Spelling suggestions: "subject:"data profiling"" "subject:"mata profiling""
1 |
Aproximativní datové profilování / Aproximative data profilingKolek, Lukáš January 2021 (has links)
Data profiling is the process of analyzing data and producing an output with statistical summaries. The size of data rapidly increases and it is more difficult to process all data in a reasonable time. All data can not be stored in RAM memory, so it is not possible to run exact single-pass algorithms without using slower computer storage. The diploma thesis focuses on the implementation, comparison, and selection of suitable algorithms for data profiling of large input data. Usage of approximate algorithms brings a possibility to limit mem- ory for computation, do the whole process in RAM memory and the duration of data profiling should be reduced. The tool can compute frequency analysis, cardinality, quantiles, histograms, and other single-column statistics in a short time with a relative error lower than one percent.
|
2 |
Information Extraction from dataSottovia, Paolo 22 October 2019 (has links)
Data analysis is the process of inspecting, cleaning, extract, and modeling data with the intention of extracting useful information in order to support users in their decisions. With the advent of Big Data, data analysis was becoming more complicated due to the volume and variety of data. This process begins with the acquisition of the data and the selection of the data that is useful for the desiderata analysis. With such amount of data, also expert users are not able to inspect the data and understand if a dataset is suitable or not for their purposes. In this dissertation, we focus on five problems in the broad data analysis process to help users find insights from the data when they do not have enough knowledge about its data. First, we analyze the data description problem, where the user is looking for a description of the input dataset. We introduce data descriptions: a compact, readable and insightful formula of boolean predicates that represents a set of data records. Finding the best description for a dataset is computationally expensive and task-specific; we, therefore, introduce a set of metrics and heuristics for generating meaningful descriptions at an interactive performance. Secondly, we look at the problem of order dependency discovery, which discovers another kind of metadata that may help the user in the understanding of characteristics of a dataset. Our approach leverages the observation that discovering order dependencies can be guided by the discovery of a more specific form of dependencies called order compatibility dependencies. Thirdly, textual data encodes much hidden information. To allow this data to reach its full potential, there has been an increasing interest in extracting structural information from it. In this regard, we propose a novel approach for extracting events that are based on temporal co-reference among entities. We consider an event to be a set of entities that collectively experience relationships between them in a specific period of time. We developed a distributed strategy that is able to scale with the largest on-line encyclopedia available, Wikipedia. Then, we deal with the evolving nature of the data by focusing on the problem of finding synonymous attributes in evolving Wikipedia Infoboxes. Over time, several attributes have been used to indicate the same characteristic of an entity. This provides several issues when we are trying to analyze the content of different time periods. To solve it, we propose a clustering strategy that combines two contrasting distance metrics. We developed an approximate solution that we assess over 13 years of Wikipedia history by proving its flexibility and accuracy. Finally, we tackle the problem of identifying movements of attributes in evolving datasets. In an evolving environment, entities not only change their characteristics, but they sometimes exchange them over time. We proposed a strategy where we are able to discover those cases, and we also test our strategy on real datasets. We formally present the five problems that we validate both in terms of theoretical results and experimental evaluation, and we demonstrate that the proposed approaches efficiently scale with a large amount of data.
|
3 |
Advancing the discovery of unique column combinationsAbedjan, Ziawasch, Naumann, Felix January 2011 (has links)
Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are either brute force or have a high memory load and can thus be applied only to small datasets or samples. In this paper, the wellknown GORDIAN algorithm and "Apriori-based" algorithms are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCAGORDIAN combines the advantages of GORDIAN and our new algorithm HCA, and it significantly outperforms all previous work in many situations. / Unique-Spaltenkombinationen sind Spaltenkombinationen einer Datenbanktabelle, die nur einzigartige Werte beinhalten. Das Finden von Unique-Spaltenkombinationen spielt sowohl eine wichtige Rolle im Bereich der Grundlagenforschung von Informationssystemen als auch in Anwendungsgebieten wie dem Datenmanagement und der Erkenntnisgewinnung aus Datenbeständen. Vorhandene Algorithmen, die dieses Problem angehen, sind entweder Brute-Force oder benötigen zu viel Hauptspeicher. Deshalb können diese Algorithmen nur auf kleine Datenmengen angewendet werden. In dieser Arbeit werden der bekannte GORDIAN-Algorithmus und Apriori-basierte Algorithmen zum Zwecke weiterer Optimierung analysiert. Wir verbessern die Apriori Algorithmen durch eine effiziente Kandidatengenerierung und Heuristikbasierten Kandidatenfilter. Eine Hybride Lösung, HCA-GORDIAN, kombiniert die Vorteile von GORDIAN und unserem neuen Algorithmus HCA, welche die bisherigen Algorithmen hinsichtlich der Effizienz in vielen Situationen übertrifft.
|
4 |
Metodika auditu datové kvality / Data Quality Audit MethodologyKotek, Aleš January 2008 (has links)
The goal of this thesis is to summarize and to describe all available know-how and experiences of Adastra employees related to Data Quality Audit in organization. The entire thesis should serve as a guideline for sales and implementation staff within the Adastra Corp. The first part of this thesis (chapter 2 and 3) is generally concerned with Data Quality, i.e. provides various definitions of Data Quality, points out importance/relevance of Data Quality in organization and describes the most important tools and Data Quality Management Solutions. The second part (chapter 4 and 5) uses the theoretical basis of the previous chapters and form the main methodical part of this thesis. Chapter 4 is rather focused on business/sales side, defines the most important terms and used principles, and is considered as a necessary precondition for correct understanding following chapter. Chapter 5 shows detailed procedures of Data Quality Audit. Single activities are written in a standardized form to ensure clear, accurate and brief step description. The result of this thesis is the most detailed description of Data Quality Audit in Adastra Corp. including all identified services/products.
|
5 |
Úloha profilace v řízení datové kvality / The role of data profiling in data quality managementFišer, David January 2014 (has links)
The goal of this thesis is to research to role of data profiling in data quality management and the quality of existing data profiling SW tools. The role of data profiling was based up an analysis of the general methodology of approach to the management of data quality. I've compiled an own methodology for the analysis of data profiling tools. This methodology focuses on three main aspects: Technical requirements, User friendliness and Functionality. Based on the methodology the best SW solution for data profiling was chosen. General shortcomings of tested SW tools were also analysed.
|
6 |
Creating a Customizable Component Based ETL Solution for the Consumer / Skapandet av en anpassningsbar komponentbaserad ETL-lösning för konsumentenRetelius, Philip, Bergström Persson, Eddie January 2021 (has links)
In today's society, an enormous amount of data is created that is stored in various databases. Since the data is in many cases stored in different databases, there is a demand from organizations with a lot of data to be able to merge separated data and get an extraction of this resource. Extract, Transform and Load System (ETL) is a solution that has made it possible to easily merge different databases. However, the ETL market has been owned by large actors such as Amazon and Microsoft and the solutions offered are completely owned by these actors. This leaves the consumer with little ownership of the solution. Therefore, this thesis proposes a framework to create a component based ETL which gives consumers an opportunity to own and develop their own ETL solution that they can customize to their own needs. The result of the thesis is a prototype ETL solution that is built with the idea of being able to configure and customize the prototype and it accomplishes this by being independent of inflexible external libraries and a level of modularity that makes adding and removing components easy. The results of this thesis are verified with a test that shows how two different files containing data can be combined. / I dagens samhälle skapas det en enorm mängd data som är lagrad i olika databaser. Eftersom data i många fall är lagrat i olika databaser, finns det en efterfrågan från organisationer med mycket data att kunna slå ihop separerad data och få en utvinning av denna resurs. Extract, Transform and Load System (ETL) är en lösning som gjort det möjligt att slå ihop olika databaser. Dock är problemet denna expansion av ETL teknologi. ETL marknaden blivit ägd av stora aktörer såsom Amazon och Microsoft och de lösningar som erbjuds är helt ägda av dem. Detta lämnar konsumenten med lite ägodel av lösningen. Därför föreslår detta examensarbete ett ramverk för att skapa ett komponentbaserat ETL verktyg som ger konsumenter en möjlighet att utveckla en egen ETL lösning som de kan skräddarsy efter deras egna förfogande. Resultatet av examensarbete är en prototyp ETL-lösning som är byggd för att kunna konfigurera och skräddarsy prototypen. Lösningen lyckas med detta genom att vara oberoende av oflexibla externa bibliotek och en nivå av modularitet som gör addering och borttagning av komponenter enkelt. Resultatet av detta examensarbete är verifierat av ett test som visar på hur två olika filer med innehållande data kan kombineras.
|
7 |
Efficient and exact computation of inclusion dependencies for data integrationBauckmann, Jana, Leser, Ulf, Naumann, Felix January 2010 (has links)
Data obtained from foreign data sources often come with only superficial structural information, such as relation names and attribute names. Other types of metadata that are important for effective integration and meaningful querying of such data sets are missing. In particular, relationships among attributes, such as foreign keys, are crucial metadata for understanding the structure of an unknown database. The discovery of such relationships is difficult, because in principle for each pair of attributes in the database each pair of data values must be compared.
A precondition for a foreign key is an inclusion dependency (IND) between the key and the foreign key attributes. We present with Spider an algorithm that efficiently finds all INDs in a given relational database. It leverages the sorting facilities of DBMS but performs the actual comparisons outside of the database to save computation. Spider analyzes very large databases up to an order of magnitude faster than previous approaches. We also evaluate in detail the effectiveness of several heuristics to reduce the number of necessary comparisons. Furthermore, we generalize Spider to find composite INDs covering multiple attributes, and partial INDs, which are true INDs for all but a certain number of values. This last type is particularly relevant when integrating dirty data as is often the case in the life sciences domain - our driving motivation.
|
8 |
中文資訊擷取結果之錯誤偵測 / Error Detection on Chinese Information Extraction Results鄭雍瑋, Cheng, Yung-Wei Unknown Date (has links)
資訊擷取是從自然語言文本中辨識出特定的主題或事件的描述,進而萃取出相關主題或事件元素中的對應資訊,再將其擷取之結果彙整至資料庫中,便能將自然語言文件轉換成結構化的核心資訊。然而資訊擷取技術的結果會有錯誤情況發生,若單只依靠人工檢查及更正錯誤的方式進行,將會是耗費大量人力及時間的工作。
在本研究論文中,我們提出字串圖形結構與字串特徵值兩種錯誤資料偵測方法。前者是透過圖形結構比對各資料內字元及字元間關聯,接著由公式計算出每筆資料的比對分數,藉由分數高低可判斷是否為錯誤資料;後者則是利用字串特徵值,來描述字串外表特徵,再透過SVM和C4.5機器學習分類方法歸納出決策樹,進而分類正確與錯誤二元資料。而此兩種偵測方法的差異在於前者隱含了圖學理論之節點位置與鄰點概念,直接比對原始字串內容;後者則是將原始字串轉換成特徵數值,進行分類等動作。
在實驗方面,我們以「總統府人事任免公報」之資訊擷取成果資料庫作為測試資料。實驗結果顯示,本研究所提出的錯誤偵測方法可以有效偵測出不合格的值組,不但能節省驗證資料所花費的成本,甚至可確保高資料品質的資訊擷取成果產出,促使資訊擷取技術更廣泛的實際應用。 / Given a targeted subject and a text collection, information extraction techniques provide the capability to populate a database in which each record entry is a subject instance documented in the text collection. However, even with the state-of-the-art IE techniques, IE task results are expected to contain errors. Manual error detection and correction are labor intensive and time consuming. This validation cost remains a major obstacle to actual deployment of practical IE applications with high validity requirement.
In this paper, we propose string graph structure and string feature-based methods. The former takes advantage of graph structure to compare characters and the relation between characters. Next step, we count the corresponding score via formula, and then the scores are takes to estimate the data correctness. The latter uses string features to describe a certain characteristics of each string, after that decision tree is generated by the C4.5 and SVM machine learning algorithms. And then classify the data is valid or not. These two detection methods have the ability to describe the feature of data and verify the correctness further. The difference between these two methods is that, we deal with string of row data directly in the previous method. Besides, it indicates the concept of node position and neighbor node in graphic theory. By contrast, the row string was transformed into feature value, and then be classified in the latter method.
In our experiments, we use IE task results of government personnel directives as test data. We conducted experiments to verify that effective detection of IE invalid values can be achieved by using the string graph structure and string feature-based methods. The contribution of our work is to reduce validation cost and enhance the quality of IE results, even provide both analytical and empirical evidences for supporting the effective enhancement of IE results usability as well.
|
9 |
Kvalita dat a efektivní využití rejstříků státní správy / Data Quality and Effective Use of Registers of State AdministrationRut, Lukáš January 2009 (has links)
This diploma thesis deals with registers of state administration in term of data quality. The main objective is to analyze the ways how to evaluate data quality and to apply appropriate method to data in business register. Analysis of possibilities of data cleansing and data quality improving and proposal of solution of found inaccuracy in business register is another objective. The last goal of this paper is to analyze approaches how to set identifier of persons and to choose suitable key for identification of persons in registers of state administration. The thesis is divided into several parts. The first one includes introduction into the sphere of registers of state administration. It closely analyzes several selected registers especially in terms of which data contain and how they are updated. Description of legislation changes, which will come into operation in the middle of year 2010, is great contribution of this part. Special attention is dedicated to the impact of these changes from data quality point of view. Next part deals with problems of legal and physical entities identifiers. This section contains possible solution how to identify entities in data from registers. Third part analyzes ways how to determine data quality. Method called data profiling is closely described and applied to extensive data quality analysis of business register. Correct metadata and information about incorrect data are the outputs of this analysis. The last chapter deals with possibilities how to solve data quality problems. There are proposed and compared three variations of solution. The paper as a whole represents compact material how to solve problems with effective using of data contained in registers of state administration. Nevertheless, proposed solutions and described approaches can be used in many other projects which deal with data quality.
|
10 |
Metodologie a problémy při transformaci dat a určení jejího významu v rámci integrace heterogenních informačních zdrojů / Methodology and problems of data transformation and determine its importance in the integration of heterogeneous information sourcesBartoš, Ivan January 2012 (has links)
Methodology and issues of data transformation and its information value estimation during the integration of the heterogenous information sources PhDr. Ivan BARTOŠ Abstract This study focuses mainly on the data and information transformation issue. This topic is currently critical in several scientific and commercial areas. Information value, information quality and the quality of the source data differs between the various systems. This is not only due to the different topologies of the information sources but also because of its different understanding and a manner of storing the information describing the entity of the enterprise. Such information systems, respectively database systems in the scope of the thesis, could perform well as the stand alone systems. The issue appears in the moment when such heterogeneous systems are required to be integrated and the information shall be migrated between each other. The thesis is logically divided into four major parts based on these issues. The first part describes the methods that can be used to classify the data quality of the source system (the one to be integrated) from which the information can be extracted. Based on assumption of the common lack of project and system documentation hereby introduced methods can be used for such qualification even when the...
|
Page generated in 0.082 seconds