1 |
Understanding cryptic schemata in large extract-transform-load systemsAlbrecht, Alexander, Naumann, Felix January 2012 (has links)
Extract-Transform-Load (ETL) tools are used for the creation, maintenance, and evolution of data warehouses, data marts, and operational data stores. ETL workflows populate those systems with data from various data sources by specifying and executing a DAG of transformations. Over time, hundreds of individual workflows evolve as new sources and new requirements are integrated into the system. The maintenance and evolution of large-scale ETL systems requires much time and manual effort. A key problem is to understand the meaning of unfamiliar attribute labels in source and target databases and ETL transformations. Hard-to-understand attribute labels lead to frustration and time spent to develop and understand ETL workflows.
We present a schema decryption technique to support ETL developers in understanding cryptic schemata of sources, targets, and ETL transformations. For a given ETL system, our recommender-like approach leverages the large number of mapped attribute labels in existing ETL workflows to produce good and meaningful decryptions. In this way we are able to decrypt attribute labels consisting of a number of unfamiliar few-letter abbreviations, such as UNP_PEN_INT, which we can decrypt to UNPAID_PENALTY_INTEREST. We evaluate our schema decryption approach on three real-world repositories of ETL workflows and show that our approach is able to suggest high-quality decryptions for cryptic attribute labels in a given schema. / Extract-Transform-Load (ETL) Tools werden häufig beim Erstellen, der Wartung und der Weiterentwicklung von Data Warehouses, Data Marts und operationalen Datenbanken verwendet. ETL Workflows befüllen diese Systeme mit Daten aus vielen unterschiedlichen Quellsystemen. Ein ETL Workflow besteht aus mehreren Transformationsschritten, die einen DAG-strukturierter Graphen bilden. Mit der Zeit entstehen hunderte individueller ETL Workflows, da neue Datenquellen integriert oder neue Anforderungen umgesetzt werden müssen. Die Wartung und Weiterentwicklung von großen ETL Systemen benötigt viel Zeit und manuelle Arbeit. Ein zentrales Problem ist dabei das Verständnis unbekannter Attributnamen in Quell- und Zieldatenbanken und ETL Transformationen. Schwer verständliche Attributnamen führen zu Frustration und hohen Zeitaufwänden bei der Entwicklung und dem Verständnis von ETL Workflows.
Wir präsentieren eine Schema Decryption Technik, die ETL Entwicklern das Verständnis kryptischer Schemata in Quell- und Zieldatenbanken und ETL Transformationen erleichtert. Unser Ansatz berücksichtigt für ein gegebenes ETL System die Vielzahl verknüpfter Attributnamen in den existierenden ETL Workflows. So werden gute und aussagekräftige "Decryptions" gefunden und wir sind in der Lage Attributnamen, die aus unbekannten Abkürzungen bestehen, zu "decrypten". So wird z.B. für den Attributenamen UNP_PEN_INT als Decryption UNPAIN_PENALTY_INTEREST vorgeschlagen.
Unser Schema Decryption Ansatz wurde für drei ETL-Repositories evaluiert und es zeigte sich, dass unser Ansatz qualitativ hochwertige Decryptions für kryptische Attributnamen vorschlägt.
|
2 |
Logical Modeling of ETL Processes Using XMLP Snehalatha, Suma 05 August 2010 (has links)
No description available.
|
3 |
Information Integration in a Grid Environment Applications in the Bioinformatics DomainRadwan, Ahmed M. 16 December 2010 (has links)
Grid computing emerged as a framework for supporting complex operations over large datasets; it enables the harnessing of large numbers of processors working in parallel to solve computing problems that typically spread across various domains. We focus on the problems of data management in a grid/cloud environment. The broader context of designing a services oriented architecture (SOA) for information integration is studied, identifying the main components for realizing this architecture. The BioFederator is a web services-based data federation architecture for bioinformatics applications. Based on collaborations with bioinformatics researchers, several domain-specific data federation challenges and needs are identified. The BioFederator addresses such challenges and provides an architecture that incorporates a series of utility services; these address issues like automatic workflow composition, domain semantics, and the distributed nature of the data. The design also incorporates a series of data-oriented services that facilitate the actual integration of data. Schema integration is a core problem in the BioFederator context. Previous methods for schema integration rely on the exploration, implicit or explicit, of the multiple design choices that are possible for the integrated schema. Such exploration relies heavily on user interaction; thus, it is time consuming and labor intensive. Furthermore, previous methods have ignored the additional information that typically results from the schema matching process, that is, the weights and in some cases the directions that are associated with the correspondences. We propose a more automatic approach to schema integration that is based on the use of directed and weighted correspondences between the concepts that appear in the source schemas. A key component of our approach is a ranking mechanism for the automatic generation of the best candidate schemas. The algorithm gives more weight to schemas that combine the concepts with higher similarity or coverage. Thus, the algorithm makes certain decisions that otherwise would likely be taken by a human expert. We show that the algorithm runs in polynomial time and moreover has good performance in practice. The proposed methods and algorithms are compared to the state of the art approaches. The BioFederator design, services, and usage scenarios are discussed. We demonstrate how our architecture can be leveraged on real world bioinformatics applications. We preformed a whole human genome annotation for nucleosome exclusion regions. The resulting annotations were studied and correlated with tissue specificity, gene density and other important gene regulation features. We also study data processing models on grid environments. MapReduce is one popular parallel programming model that is proven to scale. However, using the low-level MapReduce for general data processing tasks poses the problem of developing, maintaining and reusing custom low-level user code. Several frameworks have emerged to address this problem; these frameworks share a top-down approach, where a high-level language is used to describe the problem semantics, and the framework takes care of translating this problem description into the MapReduce constructs. We highlight several issues in the existing approaches and alternatively propose a novel refined MapReduce model that addresses the maintainability and reusability issues, without sacrificing the low-level controllability offered by directly writing MapReduce code. We present MapReduce-LEGOS (MR-LEGOS), an explicit model for composing MapReduce constructs from simpler components, namely, "Maplets", "Reducelets" and optionally "Combinelets". Maplets and Reducelets are standard MapReduce constructs that can be composed to define aggregated constructs describing the problem semantics. This composition can be viewed as defining a micro-workflow inside the MapReduce job. Using the proposed model, complex problem semantics can be defined in the encompassing micro-workflow provided by MR-LEGOS while keeping the building blocks simple. We discuss the design details, its main features and usage scenarios. Through experimental evaluation, we show that the proposed design is highly scalable and has good performance in practice.
|
4 |
Metodika řešení transformačních úloh v BI (ETL) / Methodology for solution of transformation part of BI (ETL)Cimbaľák, Michal January 2010 (has links)
Extract, Transformation, and Load (ETL) system is responsible for the extraction of data from several sources, their cleansing, conforming and insertion into a data warehouse. The thesis is focused on building ETL methodology to better understanding of ETL process and its application in the real environment. I've worked on project "BI Cookbook" with Clever Decision, a software consulting company that specializes in Business Intelligence. The aim was to create ETL part of the BI methodology. This methodology has been designed to provide ease of use, flexible and extensible material for the purpose of building ETL solution. The main goal of this work is to introduce created ETL methodology, all of its components and the way in which it could be used. Following goal is to discuss options of its implementation and modification for the wide range of use. The main contribution of this work is to provide ideas of creating and implementing ETL methodology or ideas for creating methodology in general. The thesis is divided into three parts. The first theoretical part deals with basic definition, architecture and implementing of ETL system. The second part presents the proposed methodology. The last part contains of the possible options for implementation a modification of the methodology.
|
5 |
透過ODS進行企業資訊系統整合之研究-以某企業為例 / Using ODS to integrate enterprise systems: A case study黃琬婷, Huang,wan ting Unknown Date (has links)
由於科技快速進步,企業經營也隨著科技的進步而產生重大變化,不但需求變化快速,企業還要即時快速反應外在環境,於是企業對於資訊系統整合的議題越來越重視,希望將功能導向的系統轉變為流程導向的系統,將資訊有效的整合及標準化,讓企業能快速地與外在環境連結,進而提升整體營運績效。
整合的方法非常的多,大致上可分為四類。目前對於哪一種整合方法是最有效率、效益也無一定論。最主要的原因是不同的整合個案會有不同的整合需求,若只從理論方面來探討資訊系統整合所帶來的效用並無法具體地呈現其價值。
有鑑於此,本研究之目的是希望透過個案單位的作業模型,分析、建構出整合的資料模型。因此,本研究以階段性的方法設計資料整合模型及其運作方法。在第一、二階段當中,先針對個案單位的流程進行系統資料流程塑模及業務流程塑模,再從業務面及系統面找出資訊中斷的地方及因素,並將問題具體地描述出來。在第三階段中,本研究挑選了最適合此個案單位的整合方法,也就是使用資料層級(Data-Level)整合的方法,設計整合的資料模型將資訊流完整的串接起來以支援企業的決策需求。最後一階段則要利用ETL說明整合的系統運作模式,並說明此個案單位使用ETL時,可能會遇到的問題及初步的解決方法。 / Owing to rapid advances in technology, enterprises have a major change of the progress of science and technology. The enterprises not only change rapidly in demand, but also have to response to the external environment rapidly. Hence, enterprise information system integration issues get more attentions. Enterprises hope to improve their systems from function-oriented to process-oriented because the effective integration of information and standardization allows enterprises to quickly link with the external environment and to enhance the overall operating performance.
However, there are many kinds of integrated approaches. At present, there is no substantive conclusion in approaches to integrate efficiently all systems in business. The main reason is that the integration of different cases has different integration requirements. Therefore, it cannot concretely show the value of system integration through the discussion of the theoretical aspects.
In this thesis, this study aims to enhance the operation of a case to model, analyze, and construct an integrated data model. This research has 4 phases to construct integrated data model. In phase 1 and 2, this study builds the case data flow modeling and business process modeling and discovers information gap. In phase 3, the study selects the most suitable method for this case, that is, to use the data level integrated approach to design integrated data model. Finally, using ETL illustrates system operation mode and describes the case which may encounter problems and initial solutions follow in phase 4.
|
6 |
Návrh metodiky testování BI řešení / Design of methodology for BI solutions testingJakubičková, Nela January 2011 (has links)
This thesis deals with Business Intelligence and its testing. It seeks to highlight the differences from the classical software testing and finally design a methodology for BI solutions testing that could be used in practice on real projects of BI companies. The aim of thesis is to design a methodology for BI solutions testing based on theoretical knowledge of Business Intelligence and software testing with an emphasis on the specific BI characteristics and requirements and also in accordance with Clever Decision's requirements and test it in practice on a real project in this company. The paper is written up on the basis of studying literature in the field of Business Intelligence and software testing from Czech and foreign sources as well as on the recommendations and experience of Clever Decision's employees. It is one of the few if not the first sources dealing with methodology for BI solutions testing in the Czech language. This work could also serve as a basis for more comprehensive methodologies of BI solutions testing. The thesis can be divided into theoretical and practical part. The theoretical part tries to explain the purpose of Business Intelligence use in enterprises. It elucidates particular components of the BI solution, then the actual software testing, various types of tests, with emphasis on the differences and specificities of Business Intelligence. The theoretical part is followed by designed methodology for BI solutions using a generic model for the BI/DW solution testing. The practical part's highlight is the description of real BI project testing in Clever Decision according to the designed methodology.
|
7 |
Cardinality estimation in ETL processesLehner, Wolfgang, Thiele, Maik, Kiefer, Tim 22 April 2022 (has links)
The cardinality estimation in ETL processes is particularly difficult. Aside from the well-known SQL operators, which are also used in ETL processes, there are a variety of operators without exact counterparts in the relational world. In addition to those, we find operators that support very specific data integration aspects. For such operators, there are no well-examined statistic approaches for cardinality estimations. Therefore, we propose a black-box approach and estimate the cardinality using a set of statistic models for each operator. We discuss different model granularities and develop an adaptive cardinality estimation framework for ETL processes. We map the abstract model operators to specific statistic learning approaches (regression, decision trees, support vector machines, etc.) and evaluate our cardinality estimations in an extensive experimental study.
|
Page generated in 0.0729 seconds