Global ETD Search

41	Data Editing and Logic: The covering set method from the perspective of logic Boskovitz, Agnes, abvi@webone.com.au January 2008 (has links) Errors in collections of data can cause significant problems when those data are used. Therefore the owners of data find themselves spending much time on data cleaning. This thesis is a theoretical work about one part of the broad subject of data cleaning - to be called the covering set method. More specifically, the covering set method deals with data records that have been assessed by the use of edits, which are rules that the data records are supposed to obey. The problem solved by the covering set method is the error localisation problem, which is the problem of determining the erroneous fields within data records that fail the edits. In this thesis I analyse the covering set method from the perspective of propositional logic. I demonstrate that the covering set method has strong parallels with well-known parts of propositional logic. The first aspect of the covering set method that I analyse is the edit generation function, which is the main function used in the covering set method. I demonstrate that the edit generation function can be formalised as a logical deduction function in propositional logic. I also demonstrate that the best-known edit generation function, written here as FH (standing for Fellegi-Holt), is essentially the same as propositional resolution deduction. Since there are many automated implementations of propositional resolution, the equivalence of FH with propositional resolution gives some hope that the covering set method might be implementable with automated logic tools. However, before any implementation, the other main aspect of the covering set method must also be formalised in terms of logic. This other aspect, to be called covering set correctibility, is the property that must be obeyed by the edit generation function if the covering set method is to successfully solve the error localisation problem. In this thesis I demonstrate that covering set correctibility is a strengthening of the well-known logical properties of soundness and refutation completeness. What is more, the proofs of the covering set correctibility of FH and of the soundness / completeness of resolution deduction have strong parallels: while the proof of soundness / completeness depends on the reduction property for counter-examples, the proof of covering set correctibility depends on the related lifting property. In this thesis I also use the lifting property to prove the covering set correctibility of the function defined by the Field Code Forest Algorithm. In so doing, I prove that the Field Code Forest Algorithm, whose correctness has been questioned, is indeed correct. The results about edit generation functions and covering set correctibility apply to both categorical edits (edits about discrete data) and arithmetic edits (edits expressible as linear inequalities). Thus this thesis gives the beginnings of a theoretical logical framework for error localisation, which might give new insights to the problem. In addition, the new insights will help develop new tools using automated logic tools. What is more, the strong parallels between the covering set method and aspects of logic are of aesthetic appeal. data editing data cleaning propositional logic propositional resolution Fellegi-Holt Field Code Forest Algorithm error localisation reduction property for counter-examples lifting property
42	Enhancements of pre-processing, analysis and presentation techniques in web log mining / Žiniatinklio įrašų gavybos paruošimo, analizės ir rezultatų pateikimo naudotojui tobulinimas Pabarškaitė, Židrina 13 July 2009 (has links) As Internet is becoming an important part of our life, more attention is paid to the information quality and how it is displayed to the user. The research area of this work is web data analysis and methods how to process this data. This knowledge can be extracted by gathering web servers’ data – log files, where all users’ navigational patters about browsing are recorded. The research object of the dissertation is web log data mining process. General topics that are related with this object: web log data preparation methods, data mining algorithms for prediction and classification tasks, web text mining. The key target of the thesis is to develop methods how to improve knowledge discovery steps mining web log data that would reveal new opportunities to the data analyst. While performing web log analysis, it was discovered that insufficient interest has been paid to web log data cleaning process. By reducing the number of redundant records data mining process becomes much more effective and faster. Therefore a new original cleaning framework was introduced which leaves records that only corresponds to the real user clicks. People tend to understand technical information more if it is similar to a human language. Therefore it is advantageous to use decision trees for mining web log data, as they generate web usage patterns in the form of rules which are understandable to humans. However, it was discovered that users browsing history length is different, therefore specific data... [to full text] / Internetui skverbiantis į mūsų gyvenimą, vis didesnis dėmesys kreipiamas į informacijos pateikimo kokybę, bei į tai, kaip informacija yra pateikta. Disertacijos tyrimų sritis yra žiniatinklio serverių kaupiamų duomenų gavyba bei duomenų pateikimo galutiniam naudotojui gerinimo būdai. Tam reikalingos žinios išgaunamos iš žiniatinklio serverio žurnalo įrašų, kuriuose fiksuojama informacija apie išsiųstus vartotojams žiniatinklio puslapius. Darbo tyrimų objektas yra žiniatinklio įrašų gavyba, o su šiuo objektu susiję dalykai: žiniatinklio duomenų paruošimo etapų tobulinimas, žiniatinklio tekstų analizė, duomenų analizės algoritmai prognozavimo ir klasifikavimo uždaviniams spręsti. Pagrindinis disertacijos tikslas – perprasti svetainių naudotojų elgesio formas, tiriant žiniatinklio įrašus, tobulinti paruošimo, analizės ir rezultatų interpretavimo etapų metodologijas. Darbo tyrimai atskleidė naujas žiniatinklio duomenų analizės galimybes. Išsiaiškinta, kad internetinių duomenų – žiniatinklio įrašų švarinimui buvo skirtas nepakankamas dėmesys. Parodyta, kad sumažinus nereikšmingų įrašų kiekį, duomenų analizės procesas tampa efektyvesnis. Todėl buvo sukurtas naujas metodas, kurį pritaikius žinių pateikimas atitinka tikruosius vartotojų maršrutus. Tyrimo metu nustatyta, kad naudotojų naršymo istorija yra skirtingų ilgių, todėl atlikus specifinį duomenų paruošimą – suformavus fiksuoto ilgio vektorius, tikslinga taikyti iki šiol nenaudotus praktikoje sprendimų medžių algoritmus... [toliau žr. visą tekstą] Informatics Engineering Internet Data mining Web log mining Data cleaning Decision trees Internetas Duomenų gavyba Žiniatinklio (saityno) žurnalo gavyba Duomenų valymas Sprendimų medžiai
43	Statistical decisions in optimising grain yield Norng, Sorn January 2004 (has links) This thesis concerns Precision Agriculture (PA) technology which involves methods developed to optimise grain yield by examining data quality and modelling protein/yield relationship of wheat and sorghum fields in central and southern Queensland. An important part of developing strategies to optimisise grain yield is the understanding of PA technology. This covers major aspects of PA which includes all the components of Site- Specific Crop Management System (SSCM). These components are 1. Spatial referencing, 2. Crop, soil and climate monitoring, 3. Attribute mapping, 4. Decision suppport systems and 5. Differential action. Understanding how all five components fit into PA significantly aids the development of data analysis methods. The development of PA is dependent on the collection, analysis and interpretation of information. A preliminary data analysis step is described which covers both non-spatial and spatial data analysis methods. The non-spatial analysis involves plotting methods (maps, histograms), standard distribution and statistical summary (mean, standard deviation). The spatial analysis covers both undirected and directional variogram analyses. In addition to the data analysis, a theoretical investigation into GPS error is given. GPS plays a major role in the development of PA. A number of sources of errors affect the GPS and therefore effect the positioning measurements. Therefore, an understanding of the distribution of the errors and how they are related to each other over time is needed to complement the understanding of the nature of the data. Understanding the error distribution and the data give useful insights for model assumptions in regard to position measurement errors. A review of filtering methods is given and new methods are developed, namely, strip analysis and a double harvesting algoritm. These methods are designed specifically for controlled traffic and normal traffic respectively but can be applied to all kinds of yield monitoring data. The data resulting from the strip analysis and double harvesting algorithm are used in investigating the relationship between on-the-go yield and protein. The strategy is to use protein and yield in determining decisions with respect to nitrogen managements. The agronomic assumption is that protein and yield have a significant relationship based on plot trials. We investigate whether there is any significant relationship between protein and yield at the local level to warrent this kind of assumption. Understanding PA technology and being aware of the sources of errors that exist in data collection and data analysis are all very important in the steps of developing management decision strategies. precision agriculture combine harvesters yield maps grain protein protein/yield relationship local neighbourhoods weighted regression controlled trafiic haphazard harveasting GPS errors filtering methods exploratory data analysis data cleaning
44	Amélioration de la qualité des données : correction sémantique des anomalies inter-colonnes / Improved data quality : correction of semantic inter-column anomalies Zaidi, Houda 01 February 2017 (has links) La qualité des données présente un grand enjeu au sein d'une organisation et influe énormément sur la qualité de ses services et sur sa rentabilité. La présence de données erronées engendre donc des préoccupations importantes autour de cette qualité. Ce rapport traite la problématique de l'amélioration de la qualité des données dans les grosses masses de données. Notre approche consiste à aider l'utilisateur afin de mieux comprendre les schémas des données manipulées, mais aussi définir les actions à réaliser sur celles-ci. Nous abordons plusieurs concepts tels que les anomalies des données au sein d'une même colonne, et les anomalies entre les colonnes relatives aux dépendances fonctionnelles. Nous proposons dans ce contexte plusieurs moyens de pallier ces défauts en nous intéressons à la performance des traitements ainsi opérés. / Data quality represents a major challenge because the cost of anomalies can be very high especially for large databases in enterprises that need to exchange information between systems and integrate large amounts of data. Decision making using erroneous data has a bad influence on the activities of organizations. Quantity of data continues to increase as well as the risks of anomalies. The automatic correction of these anomalies is a topic that is becoming more important both in business and in the academic world. In this report, we propose an approach to better understand the semantics and the structure of the data. Our approach helps to correct automatically the intra-column anomalies and the inter-columns ones. We aim to improve the quality of data by processing the null values and the semantic dependencies between columns. Qualité de données Dépendances fonctionnelles Dépendances sémantiques Valeurs nulles Nettoyage de données Grosses bases de données Data Quality Functional dependencies Semantic dependencies Null values Data cleaning Big Data 005.7
45	Ensemble Stream Model for Data-Cleaning in Sensor Networks Iyer, Vasanth 16 October 2013 (has links) Ensemble Stream Modeling and Data-cleaning are sensor information processing systems have different training and testing methods by which their goals are cross-validated. This research examines a mechanism, which seeks to extract novel patterns by generating ensembles from data. The main goal of label-less stream processing is to process the sensed events to eliminate the noises that are uncorrelated, and choose the most likely model without over fitting thus obtaining higher model confidence. Higher quality streams can be realized by combining many short streams into an ensemble which has the desired quality. The framework for the investigation is an existing data mining tool. First, to accommodate feature extraction such as a bush or natural forest-fire event we make an assumption of the burnt area (BA*), sensed ground truth as our target variable obtained from logs. Even though this is an obvious model choice the results are disappointing. The reasons for this are two: One, the histogram of fire activity is highly skewed. Two, the measured sensor parameters are highly correlated. Since using non descriptive features does not yield good results, we resort to temporal features. By doing so we carefully eliminate the averaging effects; the resulting histogram is more satisfactory and conceptual knowledge is learned from sensor streams. Second is the process of feature induction by cross-validating attributes with single or multi-target variables to minimize training error. We use F-measure score, which combines precision and accuracy to determine the false alarm rate of fire events. The multi-target data-cleaning trees use information purity of the target leaf-nodes to learn higher order features. A sensitive variance measure such as f-test is performed during each node’s split to select the best attribute. Ensemble stream model approach proved to improve when using complicated features with a simpler tree classifier. The ensemble framework for data-cleaning and the enhancements to quantify quality of fitness (30% spatial, 10% temporal, and 90% mobility reduction) of sensor led to the formation of streams for sensor-enabled applications. Which further motivates the novelty of stream quality labeling and its importance in solving vast amounts of real-time mobile streams generated today. Sensor Networks Mobile Sensor Networks Data-cleaning Machine Learning Data Mining Routing Power-aware routing Netcoding Data Aggregation Quality of Data Quality of Service Feature Extraction Randomforest Bagging Classifiers Renewable Energy
46	Data-based Explanations of Random Forest using Machine Unlearning Tanmay Laxman Surve (17537112) 03 December 2023 (has links) <p dir="ltr">Tree-based machine learning models, such as decision trees and random forests, are one of the most widely used machine learning models primarily because of their predictive power in supervised learning tasks and ease of interpretation. Despite their popularity and power, these models have been found to produce unexpected or discriminatory behavior. Given their overwhelming success for most tasks, it is of interest to identify root causes of the unexpected and discriminatory behavior of tree-based models. However, there has not been much work on understanding and debugging tree-based classifiers in the context of fairness. We introduce FairDebugger, a system that utilizes recent advances in machine unlearning research to determine training data subsets responsible for model unfairness. Given a tree-based model learned on a training dataset, FairDebugger identifies the top-k training data subsets responsible for model unfairness, or bias, by measuring the change in model parameters when parts of the underlying training data are removed. We describe the architecture of FairDebugger and walk through real-world use cases to demonstrate how FairDebugger detects these patterns and their explanations.</p> Data engineering and data science Data quality Model Debugging Example-based explanations Algorithmic Fairness Data Cleaning Data Analytics Fairness in ML Random Forest Debugging
47	Travel time estimation in congested urban networks using point detectors data Mahmoud, Anas Mohammad 02 May 2009 (has links) A model for estimating travel time on short arterial links of congested urban networks, using currently available technology, is introduced in this thesis. The objective is to estimate travel time, with an acceptable level of accuracy for real-life traffic problems, such as congestion management and emergency evacuation. To achieve this research objective, various travel time estimation methods, including highway trajectories, multiple linear regression (MLR), artificial neural networks (ANN) and K –nearest neighbor (K-NN) were applied and tested on the same dataset. The results demonstrate that ANN and K-NN methods outperform linear methods by a significant margin, also, show particularly good performance in detecting congested intervals. To ensure the quality of the analysis results, set of procedures and algorithms based on traffic flow theory and test field information, were introduced to validate and clean the data used to build, train and test the different models. Travel time Congested Networks Computer science Artificial Intelligence Intelligent Transportation systems Neural Networks point detectors Multiple linear regression Matlab Estimation K nearest neighbors congestions data cleaning traffic detectors emergency evacuation GPS
48	工商及服務業普查資料品質之研究 / Data quality research of industry and commerce census 邱詠翔 Unknown Date (has links) 資料品質的好壞會影響決策品質以及各種行動的執行成果，所以資料品質在近年來越來越受到重視。本研究包含了兩個資料庫，一個是產業創新調查資料庫，一個是95年工商及服務業普查資料庫，資料品質的好壞對一個資料庫來說也是一個相當重要的議題，資料庫中往往都含有錯誤的資料，錯誤的資料會導致分析結果出現偏差的狀況，所以在進行資料分析之前，資料清理與整理是必要的事前處理工作。我們從母體資料分佈與樣本資料分佈得知，在清理與整理資料之前，平均創新員工人數為92.08，平均工商員工人數為135.54；在清理與整理資料之後，我們比較兩個資料庫員工人數的相關性、相似性、距離等性質，結果顯示兩個資料庫的資料一致性極高，平均創新員工人數與平均工商員工人數分別為39.01與42.12，跟母體平均員工人數7.05較為接近，也顯示出資料清理的重要性。本研究使用的方法為事後分層抽樣，主要研究目的是要利用產業創新調查樣本來推估95年工商及服務業普查母體資料的準確性。產業創新調查樣本在推估母體從業員工人數與母體營業收入方面皆出現高估的狀況，推測出現高估的原因是產業創新調查母體為前中華徵信所出版的五千大企業名冊為母體底冊，而工商及服務業普查企業資料為一般企業母體底冊。因此，我們利用和產業創新調查樣本所相對應的工商普查樣本做驗證，發現95年工商及服務業普查樣本與產業創新調查樣本的資料一致性極高。 / Data quality is good or bad will affect the decision quality and achievements in the implementation of various actions, so the data quality more and more attention in recent years. This study consists of two databases, one is the industrial innovation survey database, another is the industry and commerce census database in ninety five years. Data quality is good or bad of a database is also a very important issue, the database often contain erroneous information, incorrect information will result in bias of the analysis results. So before carrying out data analysis, data cleaning and consolidation is necessary. We can know from the parent and the sample data distribution. Before data cleaning and consolidation, the average number of innovation employees is 92.08, and the average number of industrial-commerce employees is 135.54. After data cleaning and consolidation, we compare the correlation, similarity, and distance of the number of employees in two databases. The results show the data consistency of the two databases is very high, the average number of innovation employees is 39.01, and the average number of industrial-commerce employees is 42.12, it is closer to the average number of parent employees 7.05. This also shows the importance of data cleaning. Method used in the study is post-stratified sampling, the main research objective is to use industrial innovation survey sample to estimate the data accuracy of the industry and commerce census in ninety five years. Use industrial innovation survey sample to estimate the number of employees and operating revenue in the industry and commerce census in ninety five years are both overestimated, we guess the reason is that the parent of the industrial innovation survey is five thousand large enterprises published by China Credit Information, and the parent of the industry and commerce census is general enterprises. Therefore, we use the corresponding industry and commerce census sample for validation. The results show that the data consistency of the industrial innovation survey sample and the industry and commerce census sample in ninety five years is very high. 資料品質事後分層抽樣產業創新調查工商及服務業普查資料清理與整理 Data Quality Post-Stratified Sampling Industrial Innovation Survey Industry and Commerce Census Data Cleaning and Consolidation
49	Predikce hodnot v čase / Prediction of Values on a Time Line Maršová, Eliška January 2016 (has links) This work deals with the prediction of numerical series whose application is suitable for prediction of stock prices. They explain the procedures for analysis and works with price charts. Also explains the methods of machine learning. Knowledge is used to build a program that finds patterns in numerical series for estimation.
50	Analýza dat síťové komunikace mobilních zařízení / Analysis of Mobile Devices Network Communication Data Abraham, Lukáš January 2020 (has links) At the beginning, the work describes DNS and SSL/TLS protocols, it mainly deals with communication between devices using these protocols. Then we'll talk about data preprocessing and data cleaning. Furthermore, the thesis deals with basic data mining techniques such as data classification, association rules, information retrieval, regression analysis and cluster analysis. The next chapter we can read something about how to identify mobile devices on the network. We will evaluate data sets that contain collected data from communication between the above mentioned protocols, which will be used in the practical part. After that, we finally get to the design of a system for analyzing network communication data. We will describe the libraries, which we used and the entire system implementation. We will perform a large number of experiments, which we will finally evaluate.

Search results