Global ETD Search

1	Scaling Big Data Cleansing Khayyat, Zuhair 31 July 2017 (has links) Data cleansing approaches have usually focused on detecting and fixing errors with little attention to big data scaling. This presents a serious impediment since identify- ing and repairing dirty data often involves processing huge input datasets, handling sophisticated error discovery approaches and managing huge arbitrary errors. With large datasets, error detection becomes overly expensive and complicated especially when considering user-defined functions. Furthermore, a distinctive algorithm is de- sired to optimize inequality joins in sophisticated error discovery rather than na ̈ıvely parallelizing them. Also, when repairing large errors, their skewed distribution may obstruct effective error repairs. In this dissertation, I present solutions to overcome the above three problems in scaling data cleansing. First, I present BigDansing as a general system to tackle efficiency, scalability, and ease-of-use issues in data cleansing for Big Data. It automatically parallelizes the user’s code on top of general-purpose distributed platforms. Its programming inter- face allows users to express data quality rules independently from the requirements of parallel and distributed environments. Without sacrificing their quality, BigDans- ing also enables parallel execution of serial repair algorithms by exploiting the graph representation of discovered errors. The experimental results show that BigDansing outperforms existing baselines up to more than two orders of magnitude. Although BigDansing scales cleansing jobs, it still lacks the ability to handle sophisticated error discovery requiring inequality joins. Therefore, I developed IEJoin as an algorithm for fast inequality joins. It is based on sorted arrays and space efficient bit-arrays to reduce the problem’s search space. By comparing IEJoin against well- known optimizations, I show that it is more scalable, and several orders of magnitude faster. BigDansing depends on vertex-centric graph systems, i.e., Pregel, to efficiently store and process discovered errors. Although Pregel scales general-purpose graph computations, it is not able to handle skewed workloads efficiently. Therefore, I introduce Mizan, a Pregel system that balances the workload transparently during runtime to adapt for changes in computing needs. Mizan is general; it does not assume any a priori knowledge of the graph structure or the algorithm behavior. Through extensive evaluations, I show that Mizan provides up to 84% improvement over techniques leveraging static graph pre-partitioning. data cleansing inequality join graph processing big data pregel spark
2	TREATMENT OF DATA WITH MISSING ELEMENTS IN PROCESS MODELLING RAPUR, NIHARIKA 02 September 2003 (has links) No description available. Engineering, Industrial missing data data complete incomplete data cleansing
3	Developing a data quality scorecard that measures data quality in a data warehouse Grillo, Aderibigbe January 2018 (has links) The main purpose of this thesis is to develop a data quality scorecard (DQS) that aligns the data quality needs of the Data warehouse stakeholder group with selected data quality dimensions. To comprehend the research domain, a general and systematic literature review (SLR) was carried out, after which the research scope was established. Using Design Science Research (DSR) as the methodology to structure the research, three iterations were carried out to achieve the research aim highlighted in this thesis. In the first iteration, as DSR was used as a paradigm, the artefact was build from the results of the general and systematic literature review conduct. A data quality scorecard (DQS) was conceptualised. The result of the SLR and the recommendations for designing an effective scorecard provided the input for the development of the DQS. Using a System Usability Scale (SUS), to validate the usability of the DQS, the results of the first iteration suggest that the DW stakeholders found the DQS useful. The second iteration was conducted to further evaluate the DQS through a run through in the FMCG domain and then conducting a semi-structured interview. The thematic analysis of the semi-structured interviews demonstrated that the stakeholder's participants' found the DQS to be transparent; an additional reporting tool; Integrates; easy to use; consistent; and increases confidence in the data. However, the timeliness data dimension was found to be redundant, necessitating a modification to the DQS. The third iteration was conducted with similar steps as the second iteration but with the modified DQS in the oil and gas domain. The results from the third iteration suggest that DQS is a useful tool that is easy to use on a daily basis. The research contributes to theory by demonstrating a novel approach to DQS design This was achieved by ensuring the design of the DQS aligns with the data quality concern areas of the DW stakeholders and the data quality dimensions. Further, this research lay a good foundation for the future by establishing a DQS model that can be used as a base for further development.
4	Datová kvalita, integrita a konsolidace dat v BI / Data Quality, Data intagrity and Data Consolidation in BI Smolík, Ondřej January 2008 (has links) This thesis fights with the data quality in business intelligence. We present basic principles for building data warehouse to achieve the highest data quality. We also present some data clearing methods as deviation detection or name-address clearing. This work also deals with origin of erroneous data and prevention of their generation. In second part of this thesis we show presented methods and principles on real example of data warehouse and we suggest how to get sales data from our business partners or customers.
5	An Analysis of Data Cleaning Tools : A comparative analysis of the performance and effectiveness of data cleaning tools Stenegren, Filip January 2023 (has links) I en värld full av data är felaktiga eller inkonsekventa data oundvikliga, och datarensning, en process som rensar sådana skillnader, blir avgörande. Syftet med studien är att besvara frågan om vilka kriterier datarengöringsverktyg kan jämföras och utvärderas med. Samt att genomföra en jämförande analys av två datarengöringsverktyg, varav ett utvecklades för ändamålet med denna studie medan det andra tillhandahölls för studien. Analysens resultat bör svara på frågan om vilket av verktygen som är överlägset och i vilka avseenden. De resulterande kriterierna för jämförelse är exekveringstid, mängden RAM (Random Access Memory) och CPU (Central Processing Unit) som används, skalbarhet och användarupplevelse. Genom systematisk testning och utvärdering överträffade det utvecklade verktyget i effektivitetskriterier som tidmätning och skalbarhet, det har också en liten fördel när det gäller resursförbrukning. Men eftersom det tillhandahållna verktyget erbjuder ett GUI (Graphical User Interface) finns det inte ett definitivt svar på vilket verktyg som är överlägset eftersom användarupplevelse och behov kan väga över alla tekniska färdigheter. Således kan slutsatsen om vilket verktyg som är överlägset variera, beroende på användarens specifika behov. / In a world teeming with data, faulty or inconsistent data is inevitable, and data cleansing, a process that purges such discrepancies, becomes crucial. The purpose of the study is to answer the question of what criteria data cleaning tools can be compared and evaluated with. As well as undergoing a comparative analysis of two data cleansing tools, one of which is developed for the purpose of this study whereas the other was provided for the study. The result of the analysis should answer the question of which of the tools is superior and in what regard. The resulting criteria for comparison are execution time, amount of RAM (Random Access Memory) and CPU (Central Processing Unit) usage, scalability and user experience. Through systematic testing and evaluation, the developed tool outperformed in efficiency criteria like time measurement and scalability, it also has a slight edge over on resource consumption. However, because the provided tool offers a GUI (Graphical User Interface), there is no definitive answer as to which tool is superior as user experience and needs can outweigh any technical prowess. Thus, the conclusion as to which tool is superior may vary, depending on the specific needs of the user. Data Cleansing Data Cleaning Python Excel Regular-Expression Datarensning Datatvätt Python Excel Regular-Expression Software Engineering Programvaruteknik
6	Housing Price Prediction over Countrywide Data : A comparison of XGBoost and Random Forest regressor models Henriksson, Erik, Werlinder, Kristopher January 2021 (has links) The aim of this research project is to investigate how an XGBoost regressor compares to a Random Forest regressor in terms of predictive performance of housing prices with the help of two data sets. The comparison considers training time, inference time and the three evaluation metrics R2, RMSE and MAPE. The data sets are described in detail together with background about the regressor models that are used. The method makes substantial data cleaning of the two data sets, it involves hyperparameter tuning to find optimal parameters and 5foldcrossvalidation in order to achieve good performance estimates. The finding of this research project is that XGBoost performs better on both small and large data sets. While the Random Forest model can achieve similar results as the XGBoost model, it needs a much longer training time, between 2 and 50 times as long, and has a longer inference time, around 40 times as long. This makes it especially superior when used on larger sets of data. / Målet med den här studien är att jämföra och undersöka hur en XGBoost regressor och en Random Forest regressor presterar i att förutsäga huspriser. Detta görs med hjälp av två stycken datauppsättningar. Jämförelsen tar hänsyn till modellernas träningstid, slutledningstid och de tre utvärderingsfaktorerna R2, RMSE and MAPE. Datauppsättningarna beskrivs i detalj tillsammans med en bakgrund om regressionsmodellerna. Metoden innefattar en rengöring av datauppsättningarna, sökande efter optimala hyperparametrar för modellerna och 5delad korsvalidering för att uppnå goda förutsägelser. Resultatet av studien är att XGBoost regressorn presterar bättre på både små och stora datauppsättningar, men att den är överlägsen när det gäller stora datauppsättningar. Medan Random Forest modellen kan uppnå liknande resultat som XGBoost modellen, tar träningstiden mellan 250 gånger så lång tid och modellen får en cirka 40 gånger längre slutledningstid. Detta gör att XGBoost är särskilt överlägsen vid användning av stora datauppsättningar. Random Forest XGBoost predicting housing prices feature engineering ensemble learning boosting data cleansing 5foldcrossvalidation. Computer Sciences Datavetenskap (datalogi)
7	Describing differences between overlapping databases Müller, Heiko 12 August 2009 (has links) Die Analyse existierender Daten ist wichtiger Bestandteil moderner Forschung. Das Thema Datenqualität gewinnt deshalb im Bereich der wissenschaftlichen Forschung zunehmend an Bedeutung. Existierende Verfahren zur Datenbereinigung sind für wissenschaftliche Daten jedoch nur bedingt einsetzbar. Dies liegt zum einen an der höheren Komplexität der Daten und zum anderen an unserer oftmals noch unvollständigen Kenntnis der Regularien in den entsprechenden Domänen. Die vorliegende Arbeit ist leistet folgende Beiträge im Hinblick auf Datenqualität und Datenbereinigung wissenschaftlicher Daten: Im ersten Teil der Arbeit geben wir einen Überblick über existierende Verfahren zur Datenbereinigung und diskutieren deren Stärken und Schwächen. Aus unseren Ergebnissen folgern wir, daß überlappende Datenquellen großes Potential zur Verbesserung der Korrektheit und Genauigkeit wissenschaftlicher Daten haben. Überlappende Datenquellen decken Bereiche potentiell minderer Datenqualität in Form von (Daten-)konflikten auf und bieten gleichzeitig eine Möglichkeit zur Qualitätsverbesserung durch Datenintegration. Eine wichtige Voraussetzung für die Integration überlappender Datenquellen ist das Auflösen existierender Konflikte. In vielen Fällen treten die Konflikte nicht zufällig auf sondern folgen einer systematischen Ursache. Im zweiten Teil dieser Arbeit entwickeln wir Algorithmen, die das Auffinden systematischer Konflikte unterstützen. Wir klassifizieren Konflikte dabei anhand charakteristischer Muster in den überlappenden Daten. Diese Widerspruchsmuster unterstützen einen Experten bei der Festlegung von Konfliktlösungsstrategien zur der Datenintegration. Im dritten Teil dieser Arbeit verwenden wir ein prozeßbezogenes Model zur Beschreibung systematischer Konflikte, um Abhängigkeiten zwischen Konfliktgruppen aufzeigen zu können. Wir verwenden hierzu Sequenzen mengenorientierter Modifikationsoperationen die eine Datenquelle in die andere überführen. Wir präsentieren Algorithmen zur Bestimmung minimaler Modifikationssequenzen für ein gegebenes Paar von Datenquellen. Die Komplexität des Problems bedingt die Verwendung von Heuristiken. In unseren Experimenten zeigen wir die vielversprechende Qualität der Ergebnisse unserer Heuristiken. / Data quality has become an issue in scientific research. Cleaning scientific data, however, is hampered by incomplete or fuzzy knowledge of regularities in the examined domain. A common approach to enhance the overall quality of scientific data is to merge overlapping sources by eliminating conflicts that exist between them. The main objective of this thesis is to provide methods to aid the developer of an integrated system over contradicting databases in the task of resolving value conflicts. We contribute by developing a set of algorithms to identify regularities in overlapping databases that occur in conjunction with conflicts between them. These regularities highlight systematic differences between the databases. Evaluated by an expert user the discovered regularities provide insights on possible conflict reasons and help assess the quality of inconsistent values. Instead of inspecting individual conflicts, the expert user is now enabled to specify a conflict resolution strategy based on known groups of conflicts that share the same conflict reason. The thesis has three main parts. Part I gives a comprehensive review of existing data cleansing methods. We show why existing data cleansing techniques fall short for the domain of genome data and argue that merging overlapping data has outstanding ability to increase data accuracy; a quality criteria ignored by most of the existing cleansing approaches. Part II introduces the concept of contradiction patterns. We present a model for systematic conflicts and describe algorithms for efficiently detecting patterns that summarize characteristic data properties for conflict occurrence. These patterns help in providing answers to questions like “Which are the conflict-causing attributes, or values?” and “What kind of dependencies exists between the occurrences of contradictions in different attributes?”. In Part III, we define a model for systematic conflicts based on sequences of set-oriented update operations. Even though we only consider a restricted form of updates, our algorithms for computing minimal update sequences for pairs of databases require exponential space and time. We show that the problem is NP-hard for a restricted set of operations. However, we also present heuristics that lead to convincing results in all examples we considered. Datenbereinigung Genomdaten Widerspruchsmuster Distanz von Datenbanken Data cleansing Genome data Contradiction pattern Update distance 004 Informatik 28 Informatik, Datenverarbeitung ddc:004
8	Řízení kvality dat v malých a středních firmách / Data quality management in small and medium enterprises Zelený, Pavel January 2010 (has links) This diploma thesis deals with the data quality management. There are many tools and methodologies to support the data quality management even in Czech market but they are all only for large companies. Small and middle companies can't afford them because of high cost. The first goal of this thesis is to summarize principles of the methodologies and then on the base of the methodologies to suggest more simple methodology for small and middle companies. In the second part of thesis is created and adapted the methodology for a specific company. The first step is to choose the data area of interest in the company. Because of impossibility to buy a software tool to clean data, there are defined relatively simple rules which are base source to create cleaning scripts in SQL language. The scripts are used for automatic data cleaning. On the base of next analyze is decided what data should be cleaned manually. In the next step are described recommendations how to remove duplicities from the database. There is used a functionality of the company's production system. The last step of the methodology is to create a control mechanism which have to keep the required data quality in future. At the end of thesis is made a data research in four data sources. All these sources are from companies using the same production system. The reason of research is to present the overview of data quality and to help with decision about cleaning data in the companies also.
9	Kvalita dat a efektivní využití rejstříků státní správy / Data Quality and Effective Use of Registers of State Administration Rut, Lukáš January 2009 (has links) This diploma thesis deals with registers of state administration in term of data quality. The main objective is to analyze the ways how to evaluate data quality and to apply appropriate method to data in business register. Analysis of possibilities of data cleansing and data quality improving and proposal of solution of found inaccuracy in business register is another objective. The last goal of this paper is to analyze approaches how to set identifier of persons and to choose suitable key for identification of persons in registers of state administration. The thesis is divided into several parts. The first one includes introduction into the sphere of registers of state administration. It closely analyzes several selected registers especially in terms of which data contain and how they are updated. Description of legislation changes, which will come into operation in the middle of year 2010, is great contribution of this part. Special attention is dedicated to the impact of these changes from data quality point of view. Next part deals with problems of legal and physical entities identifiers. This section contains possible solution how to identify entities in data from registers. Third part analyzes ways how to determine data quality. Method called data profiling is closely described and applied to extensive data quality analysis of business register. Correct metadata and information about incorrect data are the outputs of this analysis. The last chapter deals with possibilities how to solve data quality problems. There are proposed and compared three variations of solution. The paper as a whole represents compact material how to solve problems with effective using of data contained in registers of state administration. Nevertheless, proposed solutions and described approaches can be used in many other projects which deal with data quality.
10	Master Data Integration hub - řešení pro konsolidaci referenčních dat v podniku / Master Data Integration hub - solution for company-wide consolidation of referrential data Bartoš, Jan January 2011 (has links) In current information systems the requirement to integrate disparate applications into cohesive package is greatly accented. While well-established technologies facilitating functional and comunicational integration (ESB, message brokes, web services) already exist, tools and methodologies for continuous integration of disparate data sources on enterprise-wide level are still in development. Master Data Management (MDM) is a major approach in the area of data integration and referrential data management in particular. It encompasses the referrential data integration, data quality management and referrential data consolidation, metadata management, master data ownership, principle of accountability for master data and processes related to referrential data management. Thesis is focused on technological aspects of MDM implementation realized via introduction of centrallized repository for master data -- Master Data Integration Hub (MDI Hub). MDI Hub is an application which enables the integration and consolidation of referrential data stored in disparate systems and applications based on predefined workflows. It also handles the master data propagation back to source systems and provides services like dictionaries management and data quality monitoring. Thesis objective is to cover design and implementation aspects of MDI Hub, which forms the application part of MDM. In introduction the motivation for referrential data consolidation is discussed and list of techniques used in MDI Hub solution development is presented. The main part of thesis proposes the design of MDI Hub referrential architecture and suggests the activities performed in process of MDI Hub implementation. Thesis is based on information gained from specialized publications, on knowledge gathererd by delivering projects with companies Adastra and Ataccama and on co-workers know-how and experience. Most important contribution of thesis is comprehensive view on MDI Hub design and MDI Hub referrential architecture proposal. MDI Hub referrential architecture can serve as basis for particular MDI Hub implementation.

Search results