Global ETD Search

301	LB-CNN & HD-OC, DEEP LEARNING ADAPTABLE BINARIZATION TOOLS FOR LARGE SCALE IMAGE CLASSIFICATION Timothy G Reese (13163115) 28 July 2022 (has links) <p>The computer vision task of classifying natural images is a primary driving force behind modern AI algorithms. Deep Convolutional Neural Networks (CNNs) demonstrate state of the art performance in large scale multi-class image classification tasks. However, due to the many layers and millions of parameters these models are considered to be black box algorithms. The decisions of these models are further obscured due to a cumbersome multi-class decision process. There exists another approach called class binarization in the literature which determines the multi-class prediction outcome through a sequence of binary decisions.The focus of this dissertation is on the integration of the class-binarization approach to multi-class classification with deep learning models, such as CNNs, for addressing large scale image classification problems. Three works are presented to address the integration.</p> <p>In the first work, Error Correcting Output Codes (ECOCs) are integrated into CNNs by inserting a latent-binarization layer prior to the CNNs final classification layer. This approach encapsulates both encoding and decoding steps of ECOC into a single CNN architecture. EM and Gibbs sampling algorithms are combined with back-propagation to train CNN models with Latent Binarization (LB-CNN). The training process of LB-CNN guides the model to discover hidden relationships similar to the semantic relationships known apriori between the categories. The proposed models and algorithms are applied to several image recognition tasks, producing excellent results.</p> <p>In the second work, Hierarchically Decodeable Output Codes (HD-OCs) are proposedto compactly describe a hierarchical probabilistic binary decision process model over the features of a CNN. HD-OCs enforce more homogeneous assignments of the categories to the dichotomy labels. A novel concept called average decision depth is presented to quantify the average number of binary questions needed to classify an input. An HD-OC is trained using a hierarchical log-likelihood loss that is empirically shown to orient the output of the latent feature space to resemble the hierarchical structure described by the HD-OC. Experiments are conducted at several different scales of category labels. The experiments demonstrate strong performance and powerful insights into the decision process of the model.</p> <p>In the final work, the literature of enumerative combinatorics and partially ordered sets isused to establish a unifying framework of class-binarization methods under the Multivariate Bernoulli family of models. The unifying framework theoretically establishes simple relationships for transitioning between the different binarization approaches. Such relationships provide useful investigative tools for the discovery of statistical dependencies between large groups of categories. They are additionally useful for incorporating taxonomic information as well as enforcing structural model constraints. The unifying framework lays the groundwork for future theoretical and methodological work in addressing the fundamental issues of large scale multi-class classification.</p> <p><br></p> Computational statistics Statistical data science Statistical theory ECOC classification algorithms image classification techniques Hierarchical algorithms enumerative combinatorics Deep Learning Imaging Neural Networks method Convolutional neural networks image analysis class-binarization
302	Expeditious Causal Inference for Big Observational Data Yumin Zhang (13163253) 28 July 2022 (has links) <p>This dissertation address two significant challenges in the causal inference workflow for Big Observational Data. The first is designing Big Observational Data with high-dimensional and heterogeneous covariates. The second is performing uncertainty quantification for estimates of causal estimands that are obtained from the application of black box machine learning algorithms on the designed Big Observational Data. The methodologies developed by addressing these challenges are applied for the design and analysis of Big Observational Data from a large public university in the United States. </p> <h4>Distributed Design</h4> <p>A fundamental issue in causal inference for Big Observational Data is confounding due to covariate imbalances between treatment groups. This can be addressed by designing the study prior to analysis. The design ensures that subjects in the different treatment groups that have comparable covariates are subclassified or matched together. Analyzing such a designed study helps to reduce biases arising from the confounding of covariates with treatment. Existing design methods, developed for traditional observational studies consisting of a single designer, can yield unsatisfactory designs with sub-optimum covariate balance for Big Observational Data due to their inability to accommodate the massive dimensionality, heterogeneity, and volume of the Big Data. We propose a new framework for the distributed design of Big Observational Data amongst collaborative designers. Our framework first assigns subsets of the high-dimensional and heterogeneous covariates to multiple designers. The designers then summarize their covariates into lower-dimensional quantities, share their summaries with the others, and design the study in parallel based on their assigned covariates and the summaries they receive. The final design is selected by comparing balance measures for all covariates across the candidates and identifying the best amongst the candidates. We perform simulation studies and analyze datasets from the 2016 Atlantic Causal Inference Conference Data Challenge to demonstrate the flexibility and power of our framework for constructing designs with good covariate balance from Big Observational Data.</p> <h4>Designed Bootstrap</h4> <p>The combination of modern machine learning algorithms with the nonparametric bootstrap can enable effective predictions and inferences on Big Observational Data. An increasingly prominent and critical objective in such analyses is to draw causal inferences from the Big Observational Data. A fundamental step in addressing this objective is to design the observational study prior to the application of machine learning algorithms. However, the application of the traditional nonparametric bootstrap on Big Observational Data requires excessive computational efforts. This is because every bootstrap sample would need to be re-designed under the traditional approach, which can be prohibitive in practice. We propose a design-based bootstrap for deriving causal inferences with reduced bias from the application of machine learning algorithms on Big Observational Data. Our bootstrap procedure operates by resampling from the original designed observational study. It eliminates the need for additional, costly design steps on each bootstrap sample that are performed under the standard nonparametric bootstrap. We demonstrate the computational efficiency of this procedure compared to the traditional nonparametric bootstrap, and its equivalency in terms of confidence interval coverage rates for the average treatment effects, by means of simulation studies and a real-life case study.</p> <h4>Case Study</h4> <p>We apply the distributed design and designed bootstrap methodologies in a case study involving institutional data from a large public university in the United States. The institutional data contains comprehensive information about the undergraduate students in the university, ranging from their academic records to on-campus activities. We study the causal effects of undergraduate students’ attempted course load on their academic performance based on a selection of covariates from these data. Ultimately, our real-life case study demonstrates how our methodologies enable researchers to effectively use straightforward design procedures to obtain valid causal inferences with reduced computational efforts from the application of machine learning algorithms on Big Observational Data.</p> <p><br></p> Econometric and statistical methods Applied statistics Statistical data science Statistics not elsewhere classified Causal inference Design of observational studies Propensity score method Bootstrap resampling method Causal machine learning Institutional research Big data
303	Reducing software complexity by hidden structure analysis : Methods to improve modularity and decrease ambiguity of a software system Bjuhr, Oscar, Segeljakt, Klas January 2016 (has links) Software systems can be represented as directed graphs where components are nodes and dependencies between components are edges. Improvement in system complexity and reduction of interference between development teams can be achieved by applying hidden structure analysis. However, since systems can contain thousands of dependencies, a concrete method for selecting which dependencies that are most beneficial to remove is needed. In this thesis two solutions to this problem are introduced; dominator- and cluster analysis. Dominator analysis examines the cost/gain ratio of detaching individual components from a cyclic group. Cluster analysis finds the most beneficial subgroups to split in a cyclic group. The aim of the methods is to reduce the size of cyclic groups, which are sets of co- dependent components. As a result, the system architecture will be less prone to propagating errors, caused by modifications of components. Both techniques derive from graph theory and data science but have not been applied to the area of hidden structures before. A subsystem at Ericsson is used as a testing environment. Specific dependencies in the structure which might impede the development process have been discovered. The outcome of the thesis is four to-be scenarios of the system, displaying the effect of removing these dependencies. The to-be scenarios show that the architecture can be significantly improved by removing few direct dependencies. / Mjukvarusystem kan representeras som riktade grafer där komponenter är noder och beroenden mellan komponenter är kanter. Förbättrad systemkomplexitet och minskad mängd störningar mellan utvecklingsteam kan åstadkommas genom att applicera teorin om gömda beroende. Eftersom system kan innehålla tusentals beroenden behövs en konkret metod för att hitta beroenden i systemet som är fördelaktiga att ta bort. I den här avhandlingen presenteras två lösningar till problemet; dominator- och klusteranalys. Dominatoranalys undersöker kostnad/vinst ration av att ta bort individuella komponenter i systemet från en cyklisk grupp. Klusteranalys hittar de mest lönsamma delgrupperna att klyva isär i en cyklisk grupp. Metodernas mål är att minska storleken på cykliska grupper. Cykliska grupper är uppsättningar av komponenter som är beroende av varandra. Som resultat blir systemarkitekturen mindre benägen till propagering av fel, orsakade av modifiering av komponenter. Båda metoderna härstammar från grafteori och datavetenskap men har inte applicerats på området kring gömda strukturer tidigare. Ett subsystem på Ericsson användes som testmiljö. Specifika beroenden i strukturen som kan vara hämmande för utvecklingsprocessen har identifierats. Resultatet av avhandlingen är fyra potentiella framtidsscenarion av systemet som visualiserar effekten av att ta bort de funna beroendena. Framtidsscenariona visar att arkitekturen kan förbättras markant genom att avlägsna ett fåtal direkta beroenden. DSM VSM Hidden Structure Analysis Directed Graphs Graph Theory Data Science DSM VSM Analys av Gömda Strukturer Riktade Grafer Grafteori Datavetenskap Elektroteknik och elektronik
304	A Comparative Study of Machine Learning Algorithms Le Fort, Eric January 2018 (has links) The selection of machine learning algorithm used to solve a problem is an important choice. This paper outlines research measuring three performance metrics for eight different algorithms on a prediction task involving under- graduate admissions data. The algorithms that were tested are k-nearest neighbours, decision trees, random forests, gradient tree boosting, logistic regression, naive bayes, support vector machines, and artificial neural net- works. These algorithms were compared in terms of accuracy, training time, and execution time. / Thesis / Master of Applied Science (MASc) Machine Learning Comparative Study Data Science University Admissions Software Engineering Computer Science K-Nearest Neighbours Decision Tree Random Forest Gradient Tree Boosting Logistic Regression Naive Bayes Support Vector Machine Neural Network
305	Quantitative Methods of Statistical Arbitrage Boming Ning (18414465) 22 April 2024 (has links) <p dir="ltr">Statistical arbitrage is a prevalent trading strategy which takes advantage of mean reverse property of spreads constructed from pairs or portfolios of assets. Utilizing statistical models and algorithms, statistical arbitrage exploits and capitalizes on the pricing inefficiencies between securities or within asset portfolios. </p><p dir="ltr">In chapter 2, We propose a framework for constructing diversified portfolios with multiple pairs trading strategies. In our approach, several pairs of co-moving assets are traded simultaneously, and capital is dynamically allocated among different pairs based on the statistical characteristics of the historical spreads. This allows us to further consider various portfolio designs and rebalancing strategies. Working with empirical data, our experiments suggest the significant benefits of diversification within our proposed framework.</p><p dir="ltr">In chapter 3, we explore an optimal timing strategy for the trading of price spreads exhibiting mean-reverting characteristics. A sequential optimal stopping framework is formulated to analyze the optimal timings for both entering and subsequently liquidating positions, all while considering the impact of transaction costs. Then we leverages a refined signature optimal stopping method to resolve this sequential optimal stopping problem, thereby unveiling the precise entry and exit timings that maximize gains. Our framework operates without any predefined assumptions regarding the dynamics of the underlying mean-reverting spreads, offering adaptability to diverse scenarios. Numerical results are provided to demonstrate its superior performance when comparing with conventional mean reversion trading rules.</p><p dir="ltr">In chapter 4, we introduce an innovative model-free and reinforcement learning based framework for statistical arbitrage. For the construction of mean reversion spreads, we establish an empirical reversion time metric and optimize asset coefficients by minimizing this empirical mean reversion time. In the trading phase, we employ a reinforcement learning framework to identify the optimal mean reversion strategy. Diverging from traditional mean reversion strategies that primarily focus on price deviations from a long-term mean, our methodology creatively constructs the state space to encapsulate the recent trends in price movements. Additionally, the reward function is carefully tailored to reflect the unique characteristics of mean reversion trading.</p> Statistical data science Stochastic analysis and modelling Time series and spatial modelling statistical arbitrage mean reversion trading pairs trading diversification portfolio allocation mean reversion budgeting signature method optimal stopping time reinforcement learning empirical mean reversion time
306	<b>Explaining Generative Adversarial Network Time Series Anomaly Detection using Shapley Additive Explanations</b> Cher Simon (18324174) 10 July 2024 (has links) <p dir="ltr">Anomaly detection is an active research field that widely applies to commercial applications to detect unusual patterns or outliers. Time series anomaly detection provides valuable insights into mission and safety-critical applications using ever-growing temporal data, including continuous streaming time series data from the Internet of Things (IoT), sensor networks, healthcare, stock prices, computer metrics, and application monitoring. While Generative Adversarial Networks (GANs) demonstrate promising results in time series anomaly detection, the opaque nature of generative deep learning models lacks explainability and hinders broader adoption. Understanding the rationale behind model predictions and providing human-interpretable explanations are vital for increasing confidence and trust in machine learning (ML) frameworks such as GANs. This study conducted a structured and comprehensive assessment of post-hoc local explainability in GAN-based time series anomaly detection using SHapley Additive exPlanations (SHAP). Using publicly available benchmarking datasets approved by Purdue’s Institutional Review Board (IRB), this study evaluated state-of-the-art GAN frameworks identifying their advantages and limitations for time series anomaly detection. This study demonstrated a systematic approach in quantifying the extent of GAN-based time series anomaly explainability, providing insights for businesses when considering adopting generative deep learning models. The presented results show that GANs capture complex time series temporal distribution and are applicable for anomaly detection. The analysis from this study shows SHAP can identify the significance of contributing features within time series data and derive post-hoc explanations to quantify GAN-detected time series anomalies.</p> Time-series analysis Data engineering and data science Adversarial machine learning Deep learning Generative Adversarial Networks (GANs) SHapley Additive exPlanations (SHAP) Time Series Anomaly Detection Explainability
307	Data mining and predictive analytics application on cellular networks to monitor and optimize quality of service and customer experience Muwawa, Jean Nestor Dahj 11 1900 (has links) This research study focuses on the application models of Data Mining and Machine Learning covering cellular network traffic, in the objective to arm Mobile Network Operators with full view of performance branches (Services, Device, Subscribers). The purpose is to optimize and minimize the time to detect service and subscriber patterns behaviour. Different data mining techniques and predictive algorithms have been applied on real cellular network datasets to uncover different data usage patterns using specific Key Performance Indicators (KPIs) and Key Quality Indicators (KQI). The following tools will be used to develop the concept: RStudio for Machine Learning and process visualization, Apache Spark, SparkSQL for data and big data processing and clicData for service Visualization. Two use cases have been studied during this research. In the first study, the process of Data and predictive Analytics are fully applied in the field of Telecommunications to efficiently address users’ experience, in the goal of increasing customer loyalty and decreasing churn or customer attrition. Using real cellular network transactions, prediction analytics are used to predict customers who are likely to churn, which can result in revenue loss. Prediction algorithms and models including Classification Tree, Random Forest, Neural Networks and Gradient boosting have been used with an exploratory Data Analysis, determining relationship between predicting variables. The data is segmented in to two, a training set to train the model and a testing set to test the model. The evaluation of the best performing model is based on the prediction accuracy, sensitivity, specificity and the Confusion Matrix on the test set. The second use case analyses Service Quality Management using modern data mining techniques and the advantages of in-memory big data processing with Apache Spark and SparkSQL to save cost on tool investment; thus, a low-cost Service Quality Management model is proposed and analyzed. With increase in Smart phone adoption, access to mobile internet services, applications such as streaming, interactive chats require a certain service level to ensure customer satisfaction. As a result, an SQM framework is developed with Service Quality Index (SQI) and Key Performance Index (KPI). The research concludes with recommendations and future studies around modern technology applications in Telecommunications including Internet of Things (IoT), Cloud and recommender systems. / Cellular networks have evolved and are still evolving, from traditional GSM (Global System for Mobile Communication) Circuit switched which only supported voice services and extremely low data rate, to LTE all Packet networks accommodating high speed data used for various service applications such as video streaming, video conferencing, heavy torrent download; and for say in a near future the roll-out of the Fifth generation (5G) cellular networks, intended to support complex technologies such as IoT (Internet of Things), High Definition video streaming and projected to cater massive amount of data. With high demand on network services and easy access to mobile phones, billions of transactions are performed by subscribers. The transactions appear in the form of SMSs, Handovers, voice calls, web browsing activities, video and audio streaming, heavy downloads and uploads. Nevertheless, the stormy growth in data traffic and the high requirements of new services introduce bigger challenges to Mobile Network Operators (NMOs) in analysing the big data traffic flowing in the network. Therefore, Quality of Service (QoS) and Quality of Experience (QoE) turn in to a challenge. Inefficiency in mining, analysing data and applying predictive intelligence on network traffic can produce high rate of unhappy customers or subscribers, loss on revenue and negative services’ perspective. Researchers and Service Providers are investing in Data mining, Machine Learning and AI (Artificial Intelligence) methods to manage services and experience. This research study focuses on the application models of Data Mining and Machine Learning covering network traffic, in the objective to arm Mobile Network Operators with full view of performance branches (Services, Device, Subscribers). The purpose is to optimize and minimize the time to detect service and subscriber patterns behaviour. Different data mining techniques and predictive algorithms will be applied on cellular network datasets to uncover different data usage patterns using specific Key Performance Indicators (KPIs) and Key Quality Indicators (KQI). The following tools will be used to develop the concept: R-Studio for Machine Learning, Apache Spark, SparkSQL for data processing and clicData for Visualization. / Electrical and Mining Engineering / M. Tech (Electrical Engineering) Data Mining Predictive Analytics Big Data Quality of Service (QoS) Customer Experience Business Intelligence (BI) Network Churn Key Quality Index (KQI) Key Performance Index (KPI) Service Quality Management (SQM) Neural Network (NN) Deep Learning (DL) Random Forest (RF) Classification Tree Regression In-memory Data processing Data Science 006.312 Data mining Machine learning Business intelligence Packet switching (Data transmission) Quality of service (Computer networks) Artificial intelligence
308	Privacy preserving software engineering for data driven development Tongay, Karan Naresh 14 December 2020 (has links) The exponential rise in the generation of data has introduced many new areas of research including data science, data engineering, machine learning, artificial in- telligence to name a few. It has become important for any industry or organization to precisely understand and analyze the data in order to extract value out of the data. The value of the data can only be realized when it is put into practice in the real world and the most common approach to do this in the technology industry is through software engineering. This brings into picture the area of privacy oriented software engineering and thus there is a rise of data protection regulation acts such as GDPR (General Data Protection Regulation), PDPA (Personal Data Protection Act), etc. Many organizations, governments and companies who have accumulated huge amounts of data over time may conveniently use the data for increasing business value but at the same time the privacy aspects associated with the sensitivity of data especially in terms of personal information of the people can easily be circumvented while designing a software engineering model for these types of applications. Even before the software engineering phase for any data processing application, often times there can be one or many data sharing agreements or privacy policies in place. Every organization may have their own way of maintaining data privacy practices for data driven development. There is a need to generalize or categorize their approaches into tactics which could be referred by other practitioners who are trying to integrate data privacy practices into their development. This qualitative study provides an understanding of various approaches and tactics that are being practised within the industry for privacy preserving data science in software engineering, and discusses a tool for data usage monitoring to identify unethical data access. Finally, we studied strategies for secure data publishing and conducted experiments using sample data to demonstrate how these techniques can be helpful for securing private data before publishing. / Graduate Data Privacy Privacy Data Engineering Software Engineering Data Driven Developers Data Science Privacy Preserving Data Driven Development Machine Learning One class SVM Data Usage Monitoring Health data k-anonymity l-diversity differential privacy Information management Secure data sharing Survey Audits and access control Data Privacy Tactics
309	Development of a GIS-based decision support tool for environmental impact assessment and due-diligence analyses of planned agricultural floating solar systems Prinsloo, Frederik Christoffel 08 1900 (has links) Text in English / In recent years, there have been tremendous advances in information technology, robotics, communication technology, nanotechnology, and artificial intelligence, resulting in the merging of physical, digital, and biological worlds that have come to be known as the "fourth industrial revolution”. In this context, the present study engages such technology in the green economy and to tackle the techno-economic environmental impact assessments challenges associated with floating solar system applications in the agricultural sector of South Africa. In response, this exploratory study aimed to examine the development of a Geographical Information System (GIS)-based support platform for Environmental Impact Assessment (EIA) and due-diligence analyses for future planned agricultural floating solar systems, especially with the goal to address the vast differences between the environmental impacts for land-based and water-based photovoltaic energy systems. A research gap was identified in the planning processes for implementing floating solar systems in South Africa’s agricultural sector. This inspired the development of a novel GIS-based modelling tool to assist with floating solar system type energy infrastructure planning in the renewable energy discourse. In this context, there are significant challenges and future research avenues for technical and environmental performance modelling in the new sustainable energy transformation. The present dissertation and geographical research ventured into the conceptualisation, designing and development of a software GIS-based decision support tool to assist environmental impact practitioners, project owners and landscape architects to perform environmental scoping and environmental due-diligence analysis for planned floating solar systems in the local agricultural sector. In terms of the aims and objectives of the research, this project aims at the design and development of a dedicated GIS toolset to determine the environmental feasibility around the use of floating solar systems in agricultural applications in South Africa. In this context, the research objectives of this study included the use of computational modelling and simulation techniques to theoretically determine the energy yield predictions and computing environmental impacts/offsets for future planned agricultural floating solar systems in South Africa. The toolset succeeded in determining these aspects in applications where floating solar systems would substitute Eskom grid power. The study succeeded in developing a digital GIS-based computer simulation model for floating solar systems capable of (a) predicting the anticipated energy yield, (b) calculating the environmental offsets achieved by substituting coal-fired generation by floating solar panels, (c) determining the environmental impact and land-use preservation benefits of any floating solar system, and (d) relating these metrics to water-energy-land-food (WELF) nexus parameters suitable for user project viability analysis and decision support. The research project has demonstrated how the proposed GIS toolset supports the body of geographical knowledge in the fields of Energy and Environmental Geography. The new toolset, called EIAcloudGIS, was developed to assist in solving challenges around energy and environmental sustainability analysis when planning new floating solar installations on farms in South Africa. Experiments conducted during the research showed how the geographical study in general, and the toolset in particular, succeeded in solving a real-world problem. Through the formulation and development of GIS-based computer simulation models embedded into GIS layers, this new tool practically supports the National Environmental Management Act (NEMA Act No. 107 of 1998), and in particular, associated EIA processes. The tool also simplifies and semi-automates certain aspects of environmental impact analysis processes for newly envisioned and planned floating solar installations in South Africa. / Geography / M.Sc. (Geography) Data Science Knowledge economy Energy Geography Environmental Geography Sustainable development Floating solar systems Aquavoltaics Solar energy technology Design thinking Decision support system Geographic information systems Virtual reality landscape Environmental impact assessment Environmental offsets Due-diligence analysis Fourth Industrial Revolution Renewable Agriculture Agricultural systems Precision farming Digital technology Innovation system research 910.285 Geographic information systems Knowledge economy Environmental geography Solar energy Decision support systems
310	Towards Data Wrangling Automation through Dynamically-Selected Background Knowledge Contreras Ochando, Lidia 04 February 2021 (has links) [ES] El proceso de ciencia de datos es esencial para extraer valor de los datos. Sin embargo, la parte más tediosa del proceso, la preparación de los datos, implica una serie de formateos, limpieza e identificación de problemas que principalmente son tareas manuales. La preparación de datos todavía se resiste a la automatización en parte porque el problema depende en gran medida de la información del dominio, que se convierte en un cuello de botella para los sistemas de última generación a medida que aumenta la diversidad de dominios, formatos y estructuras de los datos. En esta tesis nos enfocamos en generar algoritmos que aprovechen el conocimiento del dominio para la automatización de partes del proceso de preparación de datos. Mostramos la forma en que las técnicas generales de inducción de programas, en lugar de los lenguajes específicos del dominio, se pueden aplicar de manera flexible a problemas donde el conocimiento es importante, mediante el uso dinámico de conocimiento específico del dominio. De manera más general, sostenemos que una combinación de enfoques de aprendizaje dinámicos y basados en conocimiento puede conducir a buenas soluciones. Proponemos varias estrategias para seleccionar o construir automáticamente el conocimiento previo apropiado en varios escenarios de preparación de datos. La idea principal se basa en elegir las mejores primitivas especializadas de acuerdo con el contexto del problema particular a resolver. Abordamos dos escenarios. En el primero, manejamos datos personales (nombres, fechas, teléfonos, etc.) que se presentan en formatos de cadena de texto muy diferentes y deben ser transformados a un formato unificado. El problema es cómo construir una transformación compositiva a partir de un gran conjunto de primitivas en el dominio (por ejemplo, manejar meses, años, días de la semana, etc.). Desarrollamos un sistema (BK-ADAPT) que guía la búsqueda a través del conocimiento previo extrayendo varias meta-características de los ejemplos que caracterizan el dominio de la columna. En el segundo escenario, nos enfrentamos a la transformación de matrices de datos en lenguajes de programación genéricos como R, utilizando como ejemplos una matriz de entrada y algunas celdas de la matriz de salida. También desarrollamos un sistema guiado por una búsqueda basada en árboles (AUTOMAT[R]IX) que usa varias restricciones, probabilidades previas para las primitivas y sugerencias textuales, para aprender eficientemente las transformaciones. Con estos sistemas, mostramos que la combinación de programación inductiva, con la selección dinámica de las primitivas apropiadas a partir del conocimiento previo, es capaz de mejorar los resultados de otras herramientas actuales específicas para la preparación de datos. / [CA] El procés de ciència de dades és essencial per extraure valor de les dades. No obstant això, la part més tediosa del procés, la preparació de les dades, implica una sèrie de transformacions, neteja i identificació de problemes que principalment són tasques manuals. La preparació de dades encara es resisteix a l'automatització en part perquè el problema depén en gran manera de la informació del domini, que es converteix en un coll de botella per als sistemes d'última generació a mesura que augmenta la diversitat de dominis, formats i estructures de les dades. En aquesta tesi ens enfoquem a generar algorismes que aprofiten el coneixement del domini per a l'automatització de parts del procés de preparació de dades. Mostrem la forma en què les tècniques generals d'inducció de programes, en lloc dels llenguatges específics del domini, es poden aplicar de manera flexible a problemes on el coneixement és important, mitjançant l'ús dinàmic de coneixement específic del domini. De manera més general, sostenim que una combinació d'enfocaments d'aprenentatge dinàmics i basats en coneixement pot conduir a les bones solucions. Proposem diverses estratègies per seleccionar o construir automàticament el coneixement previ apropiat en diversos escenaris de preparació de dades. La idea principal es basa a triar les millors primitives especialitzades d'acord amb el context del problema particular a resoldre. Abordem dos escenaris. En el primer, manegem dades personals (noms, dates, telèfons, etc.) que es presenten en formats de cadena de text molt diferents i han de ser transformats a un format unificat. El problema és com construir una transformació compositiva a partir d'un gran conjunt de primitives en el domini (per exemple, manejar mesos, anys, dies de la setmana, etc.). Desenvolupem un sistema (BK-ADAPT) que guia la cerca a través del coneixement previ extraient diverses meta-característiques dels exemples que caracteritzen el domini de la columna. En el segon escenari, ens enfrontem a la transformació de matrius de dades en llenguatges de programació genèrics com a R, utilitzant com a exemples una matriu d'entrada i algunes dades de la matriu d'eixida. També desenvolupem un sistema guiat per una cerca basada en arbres (AUTOMAT[R]IX) que usa diverses restriccions, probabilitats prèvies per a les primitives i suggeriments textuals, per aprendre eficientment les transformacions. Amb aquests sistemes, mostrem que la combinació de programació inductiva amb la selecció dinàmica de les primitives apropiades a partir del coneixement previ, és capaç de millorar els resultats d'altres enfocaments de preparació de dades d'última generació i més específics. / [EN] Data science is essential for the extraction of value from data. However, the most tedious part of the process, data wrangling, implies a range of mostly manual formatting, identification and cleansing manipulations. Data wrangling still resists automation partly because the problem strongly depends on domain information, which becomes a bottleneck for state-of-the-art systems as the diversity of domains, formats and structures of the data increases. In this thesis we focus on generating algorithms that take advantage of the domain knowledge for the automation of parts of the data wrangling process. We illustrate the way in which general program induction techniques, instead of domain-specific languages, can be applied flexibly to problems where knowledge is important, through the dynamic use of domain-specific knowledge. More generally, we argue that a combination of knowledge-based and dynamic learning approaches leads to successful solutions. We propose several strategies to automatically select or construct the appropriate background knowledge for several data wrangling scenarios. The key idea is based on choosing the best specialised background primitives according to the context of the particular problem to solve. We address two scenarios. In the first one, we handle personal data (names, dates, telephone numbers, etc.) that are presented in very different string formats and have to be transformed into a unified format. The problem is how to build a compositional transformation from a large set of primitives in the domain (e.g., handling months, years, days of the week, etc.). We develop a system (BK-ADAPT) that guides the search through the background knowledge by extracting several meta-features from the examples characterising the column domain. In the second scenario, we face the transformation of data matrices in generic programming languages such as R, using an input matrix and some cells of the output matrix as examples. We also develop a system guided by a tree-based search (AUTOMAT[R]IX) that uses several constraints, prior primitive probabilities and textual hints to efficiently learn the transformations. With these systems, we show that the combination of inductive programming with the dynamic selection of the appropriate primitives from the background knowledge is able to improve the results of other state-of-the-art and more specific data wrangling approaches. / This research was supported by the Spanish MECD Grant FPU15/03219;and partially by the Spanish MINECO TIN2015-69175-C4-1-R (Lobass) and RTI2018-094403-B-C32-AR (FreeTech) in Spain; and by the ERC Advanced Grant Synthesising Inductive Data Models (Synth) in Belgium. / Contreras Ochando, L. (2020). Towards Data Wrangling Automation through Dynamically-Selected Background Knowledge [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/160724 / TESIS Inteligencia artificial (IA) Sistemas de gestión del aprendizaje Ciencia de datos Programación declarativa Lenguajes de programación declarativos Automatización de datos Programación inductiva Inductive Programming Data Wrangling Automation Declarative Programming Languages Dynamic Background Knowledge Automating Data Science Program Synthesis Artificial intelligence General-purpose learning systems LENGUAJES Y SISTEMAS INFORMATICOS

Search results