Global ETD Search

11	Data Build Tool (DBT) Jobs in Hopsworks Chen, Zidi January 2022 (has links) Feature engineering at scale is always critical and challenging in the machine learning pipeline. Modern data warehouses enable data analysts to do feature engineering by transforming, validating and aggregating data in Structured Query Language (SQL). To help data analysts do this work, Data Build Tool (DBT), an open-source tool, was proposed to build and orchestrate SQL pipelines. Hopsworks, an open-source scalable feature store, would like to add support for DBT so that data scientists can do feature engineering in Python, Spark, Flink, and SQL in a single platform. This project aims to create a concept about how to build this support and then implement it. The project checks the feasibility of the solution using a sample DBT project. According to measurements, this working solution needs around 800 MB of space in the server and it takes more time than executing DBT commands locally. However, it persistently stores the results of each execution in HopsFS, which are available to users. By adding this novel support for SQL using DBT, Hopsworks might be one of the completest platforms for feature engineering so far. / Att utveckla funktioner i stor skala är alltid kritiskt och utmanande i pipeline för maskininlärning. Moderna datalager gör det möjligt för dataanalytiker att göra feature engineering genom att omvandla, validera och aggregera data i Structured Query Language (SQL). För att hjälpa dataanalytiker att utföra detta arbete föreslogs Data Build Tool (DBT), ett verktyg med öppen källkod, för att bygga och organisera SQL-pipelines. Hopsworks, ett skalbart funktionslager med öppen källkod, vill lägga till stöd för DBT så att datavetare kan göra funktionsutveckling i Python, Spark, Flink och SQL på en enda plattform. Det här projektet syftar till att skapa ett koncept för hur man bygger detta stöd och sedan genomföra det. Projektet kontrollerar lösningens genomförbarhet med hjälp av ett exempel på DBT-projekt. Enligt mätningar behöver denna fungerande lösning cirka 800 MB utrymme på servern och det tar mer tid än att utföra DBT-kommandon lokalt. Den lagrar dock permanent resultaten av varje körning i HopsFS, vilka är tillgängliga för användarna. Genom att lägga till detta nya stöd för SQL med DBT kan Hopsworks vara en av de mest kompletta plattformarna för funktionsutveckling hittills. feature engineering Structured Query Language (SQL) funktionsteknik strukturerat frågespråk (SQL) Computer and Information Sciences Data- och informationsvetenskap
12	Balancing Privacy and Accuracy in IoT using Domain-Specific Features for Time Series Classification Lakhanpal, Pranshul 01 June 2023 (has links) (PDF) ε-Differential Privacy (DP) has been popularly used for anonymizing data to protect sensitive information and for machine learning (ML) tasks. However, there is a trade-off in balancing privacy and achieving ML accuracy since ε-DP reduces the model’s accuracy for classification tasks. Moreover, not many studies have applied DP to time series from sensors and Internet-of-Things (IoT) devices. In this work, we try to achieve the accuracy of ML models trained with ε-DP data to be as close to the ML models trained with non-anonymized data for two different physiological time series. We propose to transform time series into domain-specific 2D (image) representations such as scalograms, recurrence plots (RP), and their joint representation as inputs for training classifiers. The advantages of using these image representations render our proposed approach secure by preventing data leaks since these image transformations are irreversible. These images allow us to apply state-of-the-art image classifiers to obtain accuracy comparable to classifiers trained on non-anonymized data by ex- ploiting the additional information such as textured patterns from these images. In order to achieve classifier performance with anonymized data close to non-anonymized data, it is important to identify the value of ε and the input feature. Experimental results demonstrate that the performance of the ML models with scalograms and RP was comparable to ML models trained on their non-anonymized versions. Motivated by the promising results, an end-to-end IoT ML edge-cloud architecture capable of detecting input drifts is designed that employs our technique to train ML models on ε-DP physiological data. Our classification approach ensures the privacy of individuals while processing and analyzing the data at the edge securely and efficiently. Differential Privacy IoT Feature Engineering Machine Learning Scalograms Recurrence Plots Dynamical Properties
13	LSTM Feature Engineering Through Time Series Similarity Embedding / Aspektkonstruktion för LSTM-nätverk genom inbäddning av tidsserielikheter Bångerius, Sebastian January 2022 (has links) Time series prediction has many applications. In cases with simultaneous series (like measurements of weather from multiple stations, or multiple stocks on the stock market)it is not unlikely that these series from different measurement origins behave similarly, or respond to the same contextual signals. Training input to a prediction model could be constructed from all simultaneous measurements to try and capture the relations between the measurement origins. A generalized approach is to train a prediction model on samples from any individual measurement origin. The data mass is the same in both cases, but in the first case, fewer samples of a larger width are used, while the second option uses a higher number of smaller samples. The first, high-width option, risks over-fitting as a result of fewer training samples per input variable. The second, general option, would have no way to learn relations between the measurement origins. Amending the general model with contextual information would allow for keeping a high samples-per-variable ratio without losing the ability to take the origin of the measurements into account. This thesis presents a vector embedding method for measurement origins in an environment with shared response to contextual signals. The embeddings are based on multi-variate time series from the origins. The embedding method is inspired by co-occurrence matrices commonly used in Natural Language Processing. The similarity measures used between the series are Dynamic Time Warping (DTW), Step-wise Euclidean Distance, and Pearson Correlation. The dimensionality of the resulting embeddings is reduced by Principal Component Analysis (PCA) to increase information density, and effectively preserve variance in the similarity space. The created embedding system allows contextualization of samples, akin to the human intuition that comes from knowing where measurements were taken from, like knowing what sort of company a stock ticker represents, or what environment a weather station is located in. In the embedded space, embeddings of series from fundamentally similar measurement origins are closely located, so that information regarding the behavior of one can be generalized to its neighbors. The resulting embeddings from this work resonate well with existing clustering methods in a weather dataset, and partially in a financial dataset, and do provide performance improvement for an LSTM network acting on said financial dataset. The similarity embeddings also outperform an embedding layer trained together with the LSTM. Embedding time series LSTM feature engineering DTW correlation prediction Other Computer and Information Science Annan data- och informationsvetenskap
14	Understanding Propagation of Malicious Information Online January 2020 (has links) abstract: The recent proliferation of online platforms has not only revolutionized the way people communicate and acquire information but has also led to propagation of malicious information (e.g., online human trafficking, spread of misinformation, etc.). Propagation of such information occurs at unprecedented scale that could ultimately pose imminent societal-significant threats to the public. To better understand the behavior and impact of the malicious actors and counter their activity, social media authorities need to deploy certain capabilities to reduce their threats. Due to the large volume of this data and limited manpower, the burden usually falls to automatic approaches to identify these malicious activities. However, this is a subtle task facing online platforms due to several challenges: (1) malicious users have strong incentives to disguise themselves as normal users (e.g., intentional misspellings, camouflaging, etc.), (2) malicious users are high likely to be key users in making harmful messages go viral and thus need to be detected at their early life span to stop their threats from reaching a vast audience, and (3) available data for training automatic approaches for detecting malicious users, are usually either highly imbalanced (i.e., higher number of normal users than malicious users) or comprise insufficient labeled data. To address the above mentioned challenges, in this dissertation I investigate the propagation of online malicious information from two broad perspectives: (1) content posted by users and (2) information cascades formed by resharing mechanisms in social media. More specifically, first, non-parametric and semi-supervised learning algorithms are introduced to discern potential patterns of human trafficking activities that are of high interest to law enforcement. Second, a time-decay causality-based framework is introduced for early detection of “Pathogenic Social Media (PSM)” accounts (e.g., terrorist supporters). Third, due to the lack of sufficient annotated data for training PSM detection approaches, a semi-supervised causal framework is proposed that utilizes causal-related attributes from unlabeled instances to compensate for the lack of enough labeled data. Fourth, a feature-driven approach for PSM detection is introduced that leverages different sets of attributes from users’ causal activities, account-level and content-related information as well as those from URLs shared by users. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2020 Artificial intelligence Computer science Feature Engineering Machine Learning Malicious Information Detection Online Human Trafficking Pathogenic Social Media Accounts
15	Root-cause analysis with data-driven methods and machine learning in lithium-ion battery tests : Master's thesis about detecting deviations with PCA Rademacher, Frans January 2022 (has links) The increased demand of energy storage systems and electric vehicles on the market result in high demand of lithium-ion batteries. As a lithium-ion battery manufacturer, Northvolt runs quality tests on the products to assess their performance, life and safety. Batteries that are tested are most often behaving as expected, but sometimes deviations occur. Anomaly detection is today most often performed by plotting and comparing produced data to other test-data to find which parameters that are deviating. The purpose of this thesis is to automatize anomaly detection and a proposed solution is to use state-of-the-art machine learning methods. These include using supervised and unsupervised machine learning. Before applying machine learning, the feature engineering is presented. It describes what parameters are extracted from the experiment data sets. Then the supervised machine learning framework is described. For the unsupervised machine learning, a principal component analysis is presented to locate deviations. This thesis also presents a differential capacity analysis, as this could be incorporated with the features in the future. The results shows that the subset of labeled data for supervised learning is too small to produce a model that predicts future deviations. The extracted features are also used in the principal component analysis, where the results show deviations (outliers) and aid targeting the anomalies. These can then be used to determine the root-cause of particular anomalies and mitigate future deviations. Machine learning Li-ion batteries PCA deviations anomaly detection root-cause analysis differential capacity feature engineering Embedded Systems Inbäddad systemteknik
16	Predicting High Stress Regions in a Microstructure using Convolutional Neural Networks Kumar, Navneet January 2022 (has links) No description available. Industrial Engineering Computer Science Materials Science Artificial Intelligence Convolutional neural networks Machine learning Material response Feature Engineering
17	Predicting expert moves in the game of Othello using fully convolutional neural networks / Förutsäga expertrörelser i Othello-spelet med fullständigt konvolutionella neuronala nätverk Hlynur Davíð, Hlynsson January 2017 (has links) Careful feature engineering is an important factor of artificial intelligence for games. In this thesis I investigate the benefit of delegating the engineering efforts to the model rather than the features, using the board game Othello as a case study. Convolutional neural networks of varying depths are trained to play in a human-like manner by learning to predict actions from tournaments. My main result is that using a raw board state representation, a network can be trained to achieve 57.4% prediction accuracy on a test set, surpassing previous state-of-the-art in this task. The accuracy is increased to 58.3% by adding several common handcrafted features as input to the network but at the cost of more than half again as much the computation time. / Noggrann funktionsteknik är en viktig faktor för artificiell intelligens för spel. I dennaavhandling undersöker jag fördelarna med att delegera teknikarbetet till modellen i ställetför de funktioner, som använder brädspelet Othello som en fallstudie. Konvolutionellaneurala nätverk av varierande djup är utbildade att spela på ett mänskligt sätt genom attlära sig att förutsäga handlingar från turneringar. Mitt främsta resultat är att ett nätverkkan utbildas för att uppnå 57,4% prediktionsnoggrannhet på en testuppsättning, vilketöverträffar tidigare toppmoderna i den här uppgiften. Noggrannheten ökar till 58.3% genomatt lägga till flera vanliga handgjorda funktioner som inmatning till nätverket, tillkostnaden för mer än hälften så mycket beräknatid. Othello board games supervised learning deep learning convolutional neural networks move prediction feature engineering Computer Sciences Datavetenskap (datalogi)
18	Housing Price Prediction over Countrywide Data : A comparison of XGBoost and Random Forest regressor models Henriksson, Erik, Werlinder, Kristopher January 2021 (has links) The aim of this research project is to investigate how an XGBoost regressor compares to a Random Forest regressor in terms of predictive performance of housing prices with the help of two data sets. The comparison considers training time, inference time and the three evaluation metrics R2, RMSE and MAPE. The data sets are described in detail together with background about the regressor models that are used. The method makes substantial data cleaning of the two data sets, it involves hyperparameter tuning to find optimal parameters and 5foldcrossvalidation in order to achieve good performance estimates. The finding of this research project is that XGBoost performs better on both small and large data sets. While the Random Forest model can achieve similar results as the XGBoost model, it needs a much longer training time, between 2 and 50 times as long, and has a longer inference time, around 40 times as long. This makes it especially superior when used on larger sets of data. / Målet med den här studien är att jämföra och undersöka hur en XGBoost regressor och en Random Forest regressor presterar i att förutsäga huspriser. Detta görs med hjälp av två stycken datauppsättningar. Jämförelsen tar hänsyn till modellernas träningstid, slutledningstid och de tre utvärderingsfaktorerna R2, RMSE and MAPE. Datauppsättningarna beskrivs i detalj tillsammans med en bakgrund om regressionsmodellerna. Metoden innefattar en rengöring av datauppsättningarna, sökande efter optimala hyperparametrar för modellerna och 5delad korsvalidering för att uppnå goda förutsägelser. Resultatet av studien är att XGBoost regressorn presterar bättre på både små och stora datauppsättningar, men att den är överlägsen när det gäller stora datauppsättningar. Medan Random Forest modellen kan uppnå liknande resultat som XGBoost modellen, tar träningstiden mellan 250 gånger så lång tid och modellen får en cirka 40 gånger längre slutledningstid. Detta gör att XGBoost är särskilt överlägsen vid användning av stora datauppsättningar. Random Forest XGBoost predicting housing prices feature engineering ensemble learning boosting data cleansing 5foldcrossvalidation. Computer Sciences Datavetenskap (datalogi)
19	Machine Learning for Improving Detection of Cooling Complications : A case study / Maskininlärning för att förbättra detektering av kylproblem Bruksås Nybjörk, William January 2022 (has links) The growing market for cold chain pharmaceuticals requires reliable and flexible logistics solutions that ensure the quality of the drugs. These pharmaceuticals must maintain cool to retain the function and effect. Therefore, it is of greatest concern to keep these drugs within the specified temperature interval. Temperature controllable containers are a common logistic solution for cold chain pharmaceuticals freight. One of the leading manufacturers of these containers provides lease and shipment services while also regularly assessing the cooling function. A method is applied for detecting cooling issues and preventing impaired containers to be sent to customers. However, the method tends to miss-classify containers, missing some faulty containers while also classifying functional containers as faulty. This thesis aims to investigate and identify the dependent variables associated with the cooling performance, then Machine Learning will be performed for evaluating if recall and precision could be improved. An improvement could lead to faster response, less waste and even more reliable freight which could be vital for both companies and patients. The labeled dataset has a binary outcome (no cooling issues, cooling issues) and is heavily imbalanced since the containers have high quality and undergo frequent testing and maintenance. Therefore, just a small amount has cooling issues. After analyzing the data, extensive deviations were identified which suggested that the labeled data was misclassified. The believed misclassification was corrected and compared to the original data. A Random Forest classifier in combination with random oversampling and threshold tuning resulted in the best performance for the corrected class labels. Recall reached 86% and precision 87% which is a very promising result. A Random Forest classifier in combination with random oversampling resulted in the best score for the original class labels. Recall reached 77% and precision 44% which is much lower than the adjusted class labels but still displayed a valid result in context of the believed extent of misclassification. Power output variables, compressor error variables and standard deviation of inside temperature were found clear connection toward cooling complications. Clear links could also be found to the critical cases where set temperature could not be met. These cases could therefore be easily detected but harder to prevent since they often appeared without warning. / Den växande marknaden för läkemedel beroende av kylkedja kräver pålitliga och agila logistiska lösningar som försäkrar kvaliteten hos läkemedlen. Dessa läkemedel måste förbli kylda för att behålla funktion och effekt. Därför är det av största vikt att hålla läkemedlen inom det angivna temperaturintervallet. Temperaturkontrollerade containrar är en vanlig logistisk lösning vid kylkedjefrakt av läkemedel. En av de ledande tillverkarna av dessa containrar tillhandahåller uthyrning och frakttjänster av dessa medan de också regelbundet bedömer containrarnas kylfunktion. En metod används för att detektera kylproblem och förhindra skadade containrar från att nå kund. Dock så tenderar denna metod att missklassificera containrar genom att missa vissa containrar med kylproblem och genom att klassificera fungerande containrar som skadade. Den här uppsatsen har som syfte att identifiera beroende variabler kopplade mot kylprestandan och därefter undersöka om maskininlärning kan användas för att förbättra återkallelse och precisionsbetyg gällande containrar med kylproblem. En förbättring kan leda till snabbare respons, mindre resursslöseri och ännu pålitligare frakt vilket kan vara vitalt för både företag som patienter. Ett märkt dataset tillhandahålls och detta har ett binärt utfall (inga kylproblem, kylproblem). Datasetet är kraftigt obalanserat då containrar har en hög kvalité och genomgår frekvent testning och underhåll. Därför har enbart en liten del av containrarna kylproblem. Efter att ha analyserat datan så kunde omfattande avvikelser upptäckas vilket antydde på grov miss-klassificering. Den trodda missklassificeringen korrigerades och jämfördes med den originella datan. En Random Forest klassificerare i kombination med slumpmässig översampling och tröskeljustering gav det bästa resultatet för det korrigerade datasetet. En återkallelse på 86% och en precision på 87% nåddes, vilket var ett lovande resultat. En Random Forest klassificerare i kombination med slumpmässig översampling gav det bästa resultatet för det originella datasetet. En återkallelse på 77% och en precision på 44% nåddes. Detta var mycket lägre än det justerade datasetet men det presenterade fortfarande godkända resultat med åtanke på den trodda missklassificeringen. Variabler baserade på uteffekt, kompressorfel och standardavvikelse av innetemperatur hade tydliga kopplingar mot kylproblem. Tydliga kopplingar kunde även identifieras bland de kritiska fallen där temperaturen ej kunde bibehållas. Dessa fall kunde därmed lätt detekteras men var svårare att förhindra då dessa ofta uppkom utan förvarning. Temperature controllable containers Machine Learning imbalanced data data analysis noisy labels feature engineering threshold tuning Mechanical Engineering Maskinteknik
20	TEST ORACLE AUTOMATION WITH MACHINE LEARNING : A FEASIBILITY STUDY Imamovic, Nermin January 2018 (has links) The train represents a complex system, where every sub-system has an important role. If a subsystem doesn’t work how it should, the correctness of whole the train can be uncertain. To ensure that system works properly, we should test each sub-system individually and integrate them together in the whole system. Each of these subsystems consists of the different modules with different functionalities what should be tested. Testing of different functionalities often requires a different approach. For some functionalities, it is necessary domain knowledge from the human expert, such as classification of signals in different use cases in Propulsion and Controls (PPC) in Bombardier Transportation. Due to this reason, we need to simulate of using experts knowledge in the certain domain. We are investigating the use of machine learning techniques for solving this cases and creating system what will automatically classify different signals using the previous human knowledge. This case study is conducted in Bombardier Transportation (BT), Västerås in departments Train Control Management System (TCMS) and Propulsion and Controls (PPC), where data is collected, analyzed and evaluated. We proposed a method for solving the oracle problem based on machine learning approach for different for certain use case. Also, we explained different steps what can be used for solving the test oracle problem where signals are part of verdict process Test oracle automation the oracle problem machine learning classification feature engineering signal classification time-series analysis time-series classification multivariate time-series classification Computer Systems Datorsystem

Search results