Spelling suggestions: "subject:"data preprocessing"" "subject:"data reprocessing""
11 |
Building the Dresden Web Table Corpus: A Classification ApproachLehner, Wolfgang, Eberius, Julian, Braunschweig, Katrin, Hentsch, Markus, Thiele, Maik, Ahmadov, Ahmad 12 January 2023 (has links)
In recent years, researchers have recognized relational tables on the Web as an important source of information. To assist this research we developed the Dresden Web Tables Corpus (DWTC), a collection of about 125 million data tables extracted from the Common Crawl (CC) which contains 3.6 billion web pages and is 266TB in size. As the vast majority of HTML tables are used for layout purposes and only a small share contains genuine tables with different surface forms, accurate table detection is essential for building a large-scale Web table corpus. Furthermore, correctly recognizing the table structure (e.g. horizontal listings, matrices) is important in order to understand the role of each table cell, distinguishing between label and data cells. In this paper, we present an extensive table layout classification that enables us to identify the main layout categories of Web tables with very high precision. We therefore identify and develop a plethora of table features, different feature selection techniques and several classification algorithms. We evaluate the effectiveness of the selected features and compare the performance of various state-of-the-art classification algorithms. Finally, the winning approach is employed to classify millions of tables resulting in the Dresden Web Table Corpus (DWTC).
|
12 |
Towards a Hybrid Imputation Approach Using Web TablesLehner, Wolfgang, Ahmadov, Ahmad, Thiele, Maik, Eberius, Julian, Wrembel, Robert 12 January 2023 (has links)
Data completeness is one of the most important data quality dimensions and an essential premise in data analytics. With new emerging Big Data trends such as the data lake concept, which provides a low cost data preparation repository instead of moving curated data into a data warehouse, the problem of data completeness is additionally reinforced. While traditionally the process of filling in missing values is addressed by the data imputation community using statistical techniques, we complement these approaches by using external data sources from the data lake or even the Web to lookup missing values. In this paper we propose a novel hybrid data imputation strategy that, takes into account the characteristics of an incomplete dataset and based on that chooses the best imputation approach, i.e. either a statistical approach such as regression analysis or a Web-based lookup or a combination of both. We formalize and implement both imputation approaches, including a Web table retrieval and matching system and evaluate them extensively using a corpus with 125M Web tables. We show that applying statistical techniques in conjunction with external data sources will lead to a imputation system which is robust, accurate, and has high coverage at the same time.
|
13 |
Activity Recogniton Using Accelerometer and Gyroscope Data From Pocket-Worn SmartphonesSöderberg, Oskar, Blommegård, Oscar January 2021 (has links)
Human Activity Recognition (HAR) is a widelyresearched field that has gained importance due to recentadvancements in sensor technology and machine learning. InHAR, sensors are used to identify the activity that a person is performing.In this project, the six everyday life activities walking,biking, sitting, standing, ascending stairs and descending stairsare classified using smartphone accelerometer and gyroscope datacollected by three subjects in their everyday life. To performthe classification, two different machine learning algorithms,Artificial Neural Network (ANN) and Support Vector Machine(SVM) are implemented and compared. Moreover, we comparethe accuracy of the two sensors, both individually and combined.Our results show that the accuracy is higher using only theaccelerometer data compared to using only the gyroscope data.For the accelerometer data, the accuracy is greater than 95%for both algorithms and only between 83-93% using gyroscopedata. Also, there is a small synergy effect when using both sensors,yielding higher accuracy than for any individual sensor data, andreaching 98.5% using ANN. Furthermore, for all sensor types, theANN outperforms the SVM algorithm, having a greater accuracyby more than 1.5-9 percentage points. / Aktivitetsigenkänning är ett noga studeratforskningsområde som växt i popularitet på senare tid på grundav nya framsteg inom sensorteknologi and maskininlärning. Inomaktivitetsigenkänning använder man sensorer för att identifieravilken aktivitet en person utför. I det här projektet undersökervi de sex olika vardagsmotionsaktiviteterna gå, cykla, sitta, stå och gå i trappor (up/ner) med hjälp av data från accelerometeroch gyroskop i en smartphone som samlats in av tre olikapersoner. Två olika maskininlärningsalgoritmer implementerasoch jämförs: Artificial Neural Network (ANN) och SupportVector Machine (SVM). Vidare jämför vi noggranheten förde två sensorna, både individuellt och gemensamt. Våra resultvisar att noggranheten är större när enbart accelerometerdatananvänds jämfört med att använda enbart gyroskopdatan. Föraccelerometerdatan erhålls en noggranhet större än 95 % förbåda algoritmerna medan den siffran bara är mellan 83-93 %för gyroskopdatan. Dessutom existerar det en synergieffekt vidanvändande av båda sensorerna, och noggranheten når då 98.5% vid användande av ANN. Vidare visar våra resultat att ANNhar en noggranhet som är 1.5-9 procentenheter bättre än SVMför alla sensorer. / Kandidatexjobb i elektroteknik 2021, KTH, Stockholm
|
14 |
A STUDY ON THE IMPACT OF PREPROCESSING STEPS ON MACHINE LEARNING MODEL FAIRNESSSathvika Kotha (18370548) 17 April 2024 (has links)
<p dir="ltr">The success of machine learning techniques in widespread applications has taught us that with respect to accuracy, the more data, the better the model. However, for fairness, data quality is perhaps more important than quantity. Existing studies have considered the impact of data preprocessing on the accuracy of ML model tasks. However, the impact of preprocessing on the fairness of the downstream model has neither been studied nor well understood. Throughout this thesis, we conduct a systematic study of how data quality issues and data preprocessing steps impact model fairness. Our study evaluates several preprocessing techniques for several machine learning models trained over datasets with different characteristics and evaluated using several fairness metrics. It examines different data preparation techniques, such as changing categories into numbers, filling in missing information, and smoothing out unusual data points. The study measures fairness using standards that check if the model treats all groups equally, predicts outcomes fairly, and gives similar chances to everyone. By testing these methods on various types of data, the thesis identifies which combinations of techniques can make the models both accurate and fair.The empirical analysis demonstrated that preprocessing steps like one-hot encoding, imputation of missing values, and outlier treatment significantly influence fairness metrics. Specifically, models preprocessed with median imputation and robust scaling exhibited the most balanced performance across fairness and accuracy metrics, suggesting a potential best practice guideline for equitable ML model preparation. Thus, this work sheds light on the importance of data preparation in ML and emphasizes the need for careful handling of data to support fair and ethical use of ML in society.</p>
|
15 |
Preprocessing and analysis of environmental data : Application to the water quality assessment of Mexican rivers / Pré-traitement et analyse des données environnementales : application à l'évaluation de la qualité de l'eau des rivières mexicainesSerrano Balderas, Eva Carmina 31 January 2017 (has links)
Les données acquises lors des surveillances environnementales peuvent être sujettes à différents types d'anomalies (i.e., données incomplètes, inconsistantes, inexactes ou aberrantes). Ces anomalies qui entachent la qualité des données environnementales peuvent avoir de graves conséquences lors de l'interprétation des résultats et l’évaluation des écosystèmes. Le choix des méthodes de prétraitement des données est alors crucial pour la validité des résultats d'analyses statistiques et il est assez mal défini. Pour étudier cette question, la thèse s'est concentrée sur l’acquisition des données et sur les protocoles de prétraitement des données afin de garantir la validité des résultats d'analyse des données, notamment dans le but de recommander la séquence de tâches de prétraitement la plus adaptée. Nous proposons de maîtriser l'intégralité du processus de production des données, de leur collecte sur le terrain et à leur analyse, et dans le cas de l'évaluation de la qualité de l'eau, il s’agit des étapes d'analyse chimique et hydrobiologique des échantillons produisant ainsi les données qui ont été par la suite analysées par un ensemble de méthodes statistiques et de fouille de données. En particulier, les contributions multidisciplinaires de la thèse sont : (1) en chimie de l'eau: une procédure méthodologique permettant de déterminer les quantités de pesticides organochlorés dans des échantillons d'eau collectés sur le terrain en utilisant les techniques SPE–GC-ECD (Solid Phase Extraction - Gas Chromatography - Electron Capture Detector) ; (2) en hydrobiologie : une procédure méthodologique pour évaluer la qualité de l’eau dans quatre rivières Mexicaines en utilisant des indicateurs biologiques basés sur des macroinvertébrés ; (3) en science des données : une méthode pour évaluer et guider le choix des procédures de prétraitement des données produites lors des deux précédentes étapes ainsi que leur analyse ; et enfin, (4) le développement d’un environnement analytique intégré sous la forme d’une application développée en R pour l’analyse statistique des données environnementales en général et l’analyse de la qualité de l’eau en particulier. Enfin, nous avons appliqué nos propositions sur le cas spécifique de l’évaluation de la qualité de l’eau des rivières Mexicaines Tula, Tamazula, Humaya et Culiacan dans le cadre de cette thèse qui a été menée en partie au Mexique et en France. / Data obtained from environmental surveys may be prone to have different anomalies (i.e., incomplete, inconsistent, inaccurate or outlying data). These anomalies affect the quality of environmental data and can have considerable consequences when assessing environmental ecosystems. Selection of data preprocessing procedures is crucial to validate the results of statistical analysis however, such selection is badly defined. To address this question, the thesis focused on data acquisition and data preprocessing protocols in order to ensure the validity of the results of data analysis mainly, to recommend the most suitable sequence of preprocessing tasks. We propose to control every step in the data production process, from their collection on the field to their analysis. In the case of water quality assessment, it comes to the steps of chemical and hydrobiological analysis of samples producing data that were subsequently analyzed by a set of statistical and data mining methods. The multidisciplinary contributions of the thesis are: (1) in environmental chemistry: a methodological procedure to determine the content of organochlorine pesticides in water samples using the SPE-GC-ECD (Solid Phase Extraction – Gas Chromatography – Electron Capture Detector) techniques; (2) in hydrobiology: a methodological procedure to assess the quality of water on four Mexican rivers using macroinvertebrates-based biological indices; (3) in data sciences: a method to assess and guide on the selection of preprocessing procedures for data produced from the two previous steps as well as their analysis; and (4) the development of a fully integrated analytics environment in R for statistical analysis of environmental data in general, and for water quality data analytics, in particular. Finally, within the context of this thesis that was developed between Mexico and France, we have applied our methodological approaches on the specific case of water quality assessment of the Mexican rivers Tula, Tamazula, Humaya and Culiacan.
|
16 |
Metody pro získávání asociačních pravidel z dat / Methods for Mining Association Rules from DataUhlíř, Martin January 2007 (has links)
The aim of this thesis is to implement Multipass-Apriori method for mining association rules from text data. After the introduction to the field of knowledge discovery, the specific aspects of text mining are mentioned. In the mining process, preprocessing is a very important problem, use of stemming and stop words dictionary is necessary in this case. Next part of thesis deals with meaning, usage and generating of association rules. The main part is focused on the description of Multipass-Apriori method, which was implemented. On the ground of executed tests the most optimal way of dividing partitions was set and also the best way of sorting the itemsets. As a part of testing, Multipass-Apriori method was compared with Apriori method.
|
17 |
Pattern analysis of the user behaviour in a mobile application using unsupervised machine learning / Mönsteranalys av användarbeteenden i en mobilapp med hjälp av oövervakad maskininlärningHrstic, Dusan Viktor January 2019 (has links)
Continuously increasing amount of logged data increases the possibilities of finding new discoveries about the user interaction with the application for which the data is logged. Traces from the data may reveal some specific user behavioural patterns which can discover how to improve the development of the application by showing the ways in which the application is utilized. In this thesis, unsupervised machine learning techniques are used in order to group the users depending on their utilization of SEB Privat Android mobile application. The user interactions in the applications are first extracted, then various data preprocessing techniques are implemented to prepare the data for clustering and finally two clustering algorithms, namely, HDBSCAN and KMedoids are performed to cluster the data. Three types of user behaviour have been found from both K-medoids and HDBSCAN algorithm. There are users that tend to interact more with the application and navigate through its deeper layers, then the ones that consider only a quick check of their account balance or transaction, and finally regular users. Among the resulting features chosen with the help of feature selection methods, 73 % of them are related to user behaviour. The findings can be used by the developers to improve the user interface and overall functionalities of application. The user flow can thus be optimized according to the patterns in which the users tend to navigate through the application. / En ständigt växande datamängd ökar möjligheterna att hitta nya upptäckter om användningen av en mobil applikation för vilken data är loggad. Spår som visas i data kan avslöja vissa specifika användarbeteenden som kan förbättra applikationens utveckling genom att antyda hur applikationen används. I detta examensarbete används oövervakade maskininlärningstekniker för att gruppera användarna beroende på deras bruk av SEB Privat Android mobilapplikation. Användarinteraktionerna i applikationen extraheras ut först, sedan används olika databearbetningstekniker för att förbereda data för klustringen och slutligen utförs två klustringsalgoritmer, nämligen HDBSCAN och Kmedoids för att gruppera data. Tre distinkta typer av användarbeteende har hittats från både K-medoids och HDBSCAN-algoritmen. Det finns användare som har en tendens att interagera mer med applikationen och navigera genom sitt djupare lager, sedan finns det de som endast snabbt kollar på deras kontosaldo eller transaktioner och till slut finns det vanliga användare. Bland de resulterande attributen som hade valts med hjälp av teknikerna för val av attribut, är 73% av dem relaterade till användarbeteendet. Det som upptäcktes i denna avhandling kan användas för att utvecklarna ska kunna förbättra användargränssnittet och övergripande funktioner i applikationen. Användarflödet kan därmed optimeras med hänsyn till de sätt enligt vilka användarna har en speciell tendens att navigera genom applikationen.
|
18 |
Assessing Viability of Open-Source Battery Cycling Data for Use in Data-Driven Battery Degradation ModelsRitesh Gautam (17582694) 08 December 2023 (has links)
<p dir="ltr">Lithium-ion batteries are being used increasingly more often to provide power for systems that range all the way from common cell-phones and laptops to advanced electric automotive and aircraft vehicles. However, as is the case for all battery types, lithium-ion batteries are prone to naturally occurring degradation phenomenon that limit their effective use in these systems to a finite amount of time. This degradation is caused by a plethora of variables and conditions including things like environmental conditions, physical stress/strain on the body of the battery cell, and charge/discharge parameters and cycling. Accurately and reliably being able to predict this degradation behavior in battery systems is crucial for any party looking to implement and use battery powered systems. However, due to the complicated non-linear multivariable processes that affect battery degradation, this can be difficult to achieve. Compared to traditional methods of battery degradation prediction and modeling like equivalent circuit models and physics-based electrochemical models, data-driven machine learning tools have been shown to be able to handle predicting and classifying the complex nature of battery degradation without requiring any prior knowledge of the physical systems they are describing.</p><p dir="ltr">One of the most critical steps in developing these data-driven neural network algorithms is data procurement and preprocessing. Without large amounts of high-quality data, no matter how advanced and accurate the architecture is designed, the neural network prediction tool will not be as effective as one trained on high quality, vast quantities of data. This work aims to gather battery degradation data from a wide variety of sources and studies, examine how the data was produced, test the effectiveness of the data in the Interfacial Multiphysics Laboratory’s autoencoder based neural network tool CD-Net, and analyze the results to determine factors that make battery degradation datasets perform better for use in machine learning/deep learning tools. This work also aims to relate this work to other data-driven models by comparing the CD-Net model’s performance with the publicly available BEEP’s (Battery Evaluation and Early Prediction) ElasticNet model. The reported accuracy and prediction models from the CD-Net and ElasticNet tools demonstrate that larger datasets with actively selected training/testing designations and less errors in the data produce much higher quality neural networks that are much more reliable in estimating the state-of-health of lithium-ion battery systems. The results also demonstrate that data-driven models are much less effective when trained using data from multiple different cell chemistries, form factors, and cycling conditions compared to more congruent datasets when attempting to create a generalized prediction model applicable to multiple forms of battery cells and applications.</p>
|
19 |
Bullying Detection through Graph Machine Learning : Applying Neo4j’s Unsupervised Graph Learning Techniques to the Friends DatasetEnström, Olof, Eid, Christoffer January 2023 (has links)
In recent years, the pervasive issue of bullying, particularly in academic institutions, has witnessed a surge in attention. This report centers around the utilization of the Friends Dataset and Graph Machine Learning to detect possible instances of bullying in an educational setting. The importance of this research lies in the potential it has to enhance early detection and prevention mechanisms, thereby creating safer environments for students. Leveraging graph theory, Neo4j, Graph Data Science Library, and similarity algorithms, among other tools and methods, we devised an approach for processing and analyzing the dataset. Our method involves data preprocessing, application of similarity and community detection algorithms, and result validation with domain experts. The findings of our research indicate that Graph Machine Learning can be effectively utilized to identify potential bullying scenarios, with a particular focus on discerning community structures and their influence on bullying. Our results, albeit preliminary, represent a promising step towards leveraging technology for bullying detection and prevention.
|
20 |
API data gathering and structuring for machine learning and human use : Optimizing API data for both financial machine learning and being easy to read and use by the end user / API data insamling och strukturering för maskininlärning och människa : Optimisterna API data för både finansiell maskininlärning och enkelt att läsa och använda för användarenForshällen, Axel January 2022 (has links)
This thesis looks into how to implement an abstraction layer between transaction data gathered from Revised Payment Services Directive (PSD2) compliant banks via an Application Programming Interface (API) and a data base, with a human user interface for reading and structuring the data. API for data sources tend to not have a standardized structure and this creates a problem for machine learning. The result is that the machine learning component either has to be built around the data collected from the API or the data to be transformed and reformatted to fit the structure of the machine learning component's database. An application will use the abstraction layer to fetch data and to allow the user to set up how the data should be reformatted before being sent to the machine learning component's database. The application has to display the data in an easy to read format and the application needs to be usable by a human user. The main questions are (i) how this abstraction should be implemented, (ii) how much of it can be automated, and (iii) what is the optimal design for the components. PSD2 open banking systems in Sweden are using Representational State Transfer (REST) API and provide data in the JavaScript Object Notation (JSON) format, and can be considered the de facto standard. The abstractions can be divided into three areas: Authorization, Account and transaction access, and Transaction data. Out of these areas only the transaction data could be abstracted fully. The account and transaction access process could be partly abstracted while the authorization process can only be broken down into steps as there is no possibility of abstracting the parameters used by the different banks. The project aimed to produce a fully functioning application for gathering data via PSD2 open banking where the user can configure the application through a simple system that does not require the user to have a lot of knowledge about coding. While the process of fetching transaction data from PSD2 API is simplified, the goal of being useful to a person without knowledge of coding is currently impossible unless PSD2 open banking is standardized or more advanced tools are used. / Den här uppsatsen undersöker hur man kan implementera ett abstraktionslager mellan transaktionsdata samlat från Revised Payment Services Directive (PSD2) kompatibla banker via en Application Programming Interface (API) och en databas, med en gränssnitt för människor att använda för att läsa och strukturera data. APIer för datakällor tenderar mot att inte ha en standardiserad struktur och det skapar problem för maskininlärning. Det resulterar i att maskininlärningskomponenten måste antingen byggas runt datan som hämtas från APIer eller att datan transformeras och oformaterad för att passa strukturen för maskininlärningskomponentens databas. Applikation behöver visa datan på ett format som är enkelt att läsa och vara lätt för en människa att använda. Huvudfrågorna är (i) hur abstraktionen ska implementeras, (ii) hur mycket som kan automatiseras, och (iii) vad optimala designen är för komponenterna. PSD2 open banking system i Sverige använder sig av Representational State Transfer (REST) APIer och ger data i JavaScript Object Notation (JSON) formatet och kan anses som en de facto standard. Abstraktionerna kan delas in i tre områden: auktorisering, tillgång till konton och transaktioner, och transaktionsdata. Av dessa tre områden så var det endast transaktionsdata som kunde fullt abstrakternas. Processen för tillgång till konton och transaktioner kunde delvis abstrakternas medan auktorisering kunde endast brytas ner i steg eftersom det inte finns någon möjlighet att abstraktera parametrarna som används av olika banker. Det här projektet försökte producera en fullt fungerande applikation för att samla data via PSD2 kompatibla open banking system där användaren kan konfigurera applikationen genom ett simpelt system där användaren inte skulle behöva erfarenhet om kodning. Processen för att hämta transaktionsdata från PSD2 APIer kan förenklas, men målet var att det skulle vara användbart för en person som inte kan programmering är omöjligt att nå om PSD2 open banking inte standardiseras eller mer avancerade verktyg används.
|
Page generated in 0.1143 seconds