Global ETD Search

21	Compression Selection for Columnar Data using Machine-Learning and Feature Engineering Persson, Douglas, Juelsson Larsen, Ludvig January 2023 (has links) There is a continuously growing demand for improved solutions that provide both efficient storage and efficient retrieval of big data for analytical purposes. This thesis researches the use of machine-learning together with feature engineering to recommend the most cost-effective compression algorithm and encoding combination for columns in a columnar database management system (DBMS). The framework consists of a cost function calculated using compression time, decompression time, and compression ratio. An XGBoost machine-learning model is trained on labels provided by the cost function to recommend the most cost-effective combination for columnar data within a column or vector-oriented DBMS. While the methods are applied on ClickHouse, one of the most popular open-source column-oriented DBMS on the market, the results are broadly applicable to column-oriented data which share data type and characteristics with IoT telemetry data. Using billions of available rows of numeric real business data obtained at Axis Communications in Lund, Sweden, a set of features are engineered to accurately describe the characteristics of a given column. The proposed framework allows for weighting the business interests (compression time, decompression time, and compression ratio) to determine the individually optimal cost-effective solution. The model reaches an accuracy of 99% on the test dataset and an accuracy of 90.1% on unseen data by leveraging data features that are predictive of compression algorithms and encodings performances. Following ClickHouse strategies and the most suitable practices in the field, combinations of general-purpose compression algorithms and data encodings are analysed that together yield the best results in efficiently compressing the data of certain columns. Applying the unweighted recommended combinations on all columns, the framework’s performance impact was measured to increase the average compression speed by 95.46%. Reducing the time to compress the columns from 31.17 seconds to compress the data to 13.17 seconds. Additionally, the decompression speed was increased by 59.87%, reducing the time to decompress the columns from 2.63 seconds to 2.02 seconds, at the cost of decreasing the compression ratio by 66.05%. Increasing the storage requirements by 94.9 MB. In column and vector databases, chunks of data belonging to a certain column are often stored together on a disk. Therefore, choosing the right compression algorithm can lower the storage requirements and boost database throughput. Machine Learning XGBoost Classification Feature Engineering Compression Algorithms Data Encodings Database Management System (DBMS) Column- Oriented DBMS Computer Sciences Datavetenskap (datalogi)
22	Encoding Temporal Healthcare Data for Machine Learning Laczik, Tamás January 2021 (has links) This thesis contains a review of previous work in the fields of encoding sequential healthcare data and predicting graft- versus- host disease, a medical condition, based on patient history using machine learning. A new encoding of such data is proposed for machine learning purposes. The proposed encoding, called bag of binned weighted events, is a combination of two strategies proposed in previous work, called bag of binned events and bag of weighted events. An empirical experiment is designed to evaluate the predictive performance of the proposed encoding over various binning windows to that of the previous encodings, based on the area under the receiver operating characteristic curve (AUC) metric. The experiment is carried out on real- world healthcare data obtained from Swedish registries, using the random forest and the logistic regression algorithms. After filtering the data, solving quality issues and tuning hyperparameters of the models, final results are obtained. These results indicate that the proposed encoding strategy performs on par, or slightly better than the bag of weighted events, and outperforms the bag of binned events in most cases. However, differences in metrics show small differences. It is also observed that the proposed encoding usually performs better with longer binning windows which may be attributed to data noise. Future work is proposed in the form of repeating the experiment with different datasets and models, as well as changing the binning window length of the baseline algorithms. / Denna avhandling innehåller en recension av tidigare arbete inom områden av kodning av sekventiell sjukvårdsdata och förutsägelse av transplantat- mot- värdsjukdom, ett medicinskt tillstånd, baserat på patienthistoria med maskininlärning. En ny kodning av sådan data föreslås i maskininlärningssyfte. Den föreslagna kodningen, kallad bag of binned weighted events, är en kombination av två strategier som föreslagits i tidigare arbete, kallad bag of binned events och bag of weighted events. Ett empiriskt experiment är utformat för att utvärdera den föreslagna prestandan för den föreslagna kodningen över olika binningfönster jämfört med tidigare kodningar, baserat på AUC- måttet. Experimentet utförs på verkliga sjukvårdsdata som erhållits från svenska register, med random forest och logistic regression. Efter filtrering av data, lösning av kvalitetsproblem och justering av hyperparametrar för modellerna, erhålls slutliga resultat. Dessa resultat indikerar att den föreslagna kodningsstrategin presterar i nivå med, eller något bättre än bag of weighted events, och överträffar i de flesta fall bag of binned events. Skillnader i mått är dock små. Det observeras också att den föreslagna kodningen vanligtvis fungerar bättre med längre binningfönster som kan tillskrivas dataljud. Framtida arbete föreslås i form av att upprepa experimentet med olika datamängder och modeller, samt att ändra binningfönstrets längd för basalgoritmerna. Machine Learning Temporal Data Disease Prediction Feature Engineering Random Forest Logistic Regression Maskininlärning tidsdata förutsägelse av sjukdom funktionsteknik slumpmässig skog logistisk regression Computer and Information Sciences Data- och informationsvetenskap
23	[pt] ENGENHARIA DE RECURSOS PARA LIDAR COM DADOS RUIDOSOS NA IDENTIFICAÇÃO ESPARSA SOB AS PERSPECTIVAS DE CLASSIFICAÇÃO E REGRESSÃO / [en] FEATURE ENGINEERING TO DEAL WITH NOISY DATA IN SPARSE IDENTIFICATION THROUGH CLASSIFICATION AND REGRESSION PERSPECTIVES THAYNA DA SILVA FRANCA 15 July 2021 (has links) [pt] Os sistemas dinâmicos desempenham um papel crucial no que diz respeito à compreensão de fenômenos inerentes a diversos campos da ciência. Desde a última década, todo aporte tecnológico alcançado ao longo de anos de investigação deram origem a uma estratégia orientada a dados, permitindo a inferência de modelos capazes de representar sistemas dinâmicos. Além disso, independentemente dos tipos de sensores adotados a fim de realizar o procedimento de aquisição de dados, é natural verificar a existência de uma certa corrupção ruidosa nos referidos dados. Genericamente, a tarefa de identificação é diretamente afetada pelo cenário ruidoso previamente descrito, implicando na falsa descoberta de um modelo generalizável. Em outras palavras, a corrupção ao ruído pode ser responsável pela geração de uma representação matemática infiel de um determinado sistema. Nesta tese, no que diz respeito à tarefa de identificação, é demonstrado como a robustez ao ruído pode ser melhorada a partir da hibridização de técnicas de aprendizado de máquina, como aumento de dados, regressão esparsa, seleção de características, extração de características, critério de informação, pesquisa em grade e validação cruzada. Especificamente, sob as perspectivas de classificação e regressão, o sucesso da estratégia proposta é apresentado a partir de exemplos numéricos, como o crescimento logístico, oscilador Duffing, modelo FitzHugh-Nagumo, atrator de Lorenz e uma modelagem Suscetível-Infeccioso-Recuperado (SIR) do Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). / [en] The dynamical systems play a fundamental role related to the understanding of the phenomena inherent to several fields of science. Since the last decade, all technological advances achieved throughout years of research have given rise to a data oriented strategy, enabling the inference of dynamical systems. Moreover, regardless the sensor types adopted to perform the data acquisition procedure, it is natural to verify the existence of a certain noise corruption in such data. Generically, the identification task is directly affected by the noisy scenario previously described, which entails in the false discovery of a generalizable model. In other words, the noise corruption might be responsible to give rise to a worthless mathematical representation of a given system. In this thesis, with respect to the identification assignment, it is demonstrated how the robustness to noise may be improved from the hybridization of machine learning techniques, such as data augmentation, sparse regression, feature selection, feature extraction, information criteria, grid search and cross validation. Specifically, through classification and regression perspectives, the success of the proposed strategy is presented from numerical examples, such as the logistic growth, Duffing oscillator, FitzHugh–Nagumo model, Lorenz attractor and a Susceptible-Infectious-Recovered (SIR) modeling of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). [pt] ROBUSTEZ AO RUIDO [pt] PESQUISA EM GRADE [pt] ENGENHARIA DE RECURSOS [pt] CORRUPCAO RUIDOSA [pt] IDENTIFICACAO ESPARSA [en] NOISE ROBUSTNESS [en] GRID SEARCH [en] FEATURE ENGINEERING [en] NOISY CORRUPTION [en] SPARSE IDENTIFICATION
24	Text feature mining using pre-trained word embeddings Sjökvist, Henrik January 2018 (has links) This thesis explores a machine learning task where the data contains not only numerical features but also free-text features. In order to employ a supervised classifier and make predictions, the free-text features must be converted into numerical features. In this thesis, an algorithm is developed to perform that conversion. The algorithm uses a pre-trained word embedding model which maps each word to a vector. The vectors for multiple word embeddings belonging to the same sentence are then combined to form a single sentence embedding. The sentence embeddings for the whole dataset are clustered to identify distinct groups of free-text strings. The cluster labels are output as the numerical features. The algorithm is applied on a specific case concerning operational risk control in banking. The data consists of modifications made to trades in financial instruments. Each such modification comes with a short text string which documents the modification, a trader comment. Converting these strings to numerical trader comment features is the objective of the case study. A classifier is trained and used as an evaluation tool for the trader comment features. The performance of the classifier is measured with and without the trader comment feature. Multiple models for generating the features are evaluated. All models lead to an improvement in classification rate over not using a trader comment feature. The best performance is achieved with a model where the sentence embeddings are generated using the SIF weighting scheme and then clustered using the DBSCAN algorithm. / Detta examensarbete behandlar ett maskininlärningsproblem där data innehåller fritext utöver numeriska attribut. För att kunna använda all data för övervakat lärande måste fritexten omvandlas till numeriska värden. En algoritm utvecklas i detta arbete för att utföra den omvandlingen. Algoritmen använder färdigtränade ordvektormodeller som omvandlar varje ord till en vektor. Vektorerna för flera ord i samma mening kan sedan kombineras till en meningsvektor. Meningsvektorerna i hela datamängden klustras sedan för att identifiera grupper av liknande textsträngar. Algoritmens utdata är varje datapunkts klustertillhörighet. Algoritmen appliceras på ett specifikt fall som berör operativ risk inom banksektorn. Data består av modifikationer av finansiella transaktioner. Varje sådan modifikation har en tillhörande textkommentar som beskriver modifikationen, en handlarkommentar. Att omvandla dessa kommentarer till numeriska värden är målet med fallstudien. En klassificeringsmodell tränas och används för att utvärdera de numeriska värdena från handlarkommentarerna. Klassificeringssäkerheten mäts med och utan de numeriska värdena. Olika modeller för att generera värdena från handlarkommentarerna utvärderas. Samtliga modeller leder till en förbättring i klassificering över att inte använda handlarkommentarerna. Den bästa klassificeringssäkerheten uppnås med en modell där meningsvektorerna genereras med hjälp av SIF-viktning och sedan klustras med hjälp av DBSCAN-algoritmen. Word embeddings Feature engineering Unsupervised learning Deep learning fast Text Operational risk Ordvektorer Attributgenerering Oövervakat lärande Djupinlärning fastText Operativ risk Computational Mathematics Beräkningsmatematik
25	Data-Driven Traffic Forecasting for Completed Vehicle Simulation: : A Case Study with Volvo Test Trucks Shahrokhi, Samaneh January 2023 (has links) This thesis offers a thorough investigation into the application of machine learning algorithms for predicting the presence of vehicles in a traffic setting. The research primarily focuses on enhancing vehicle simulation by employing data-driven traffic prediction methods. The study approaches the problem as a binary classification task. Various supervised learning algorithms, including Random Forest (RF), Gradient Boosting (GB), Support Vector Machine (SVM), and Logistic Regression (LogReg) were evaluated and tested. The thesis encompasses six distinct implementations, each involving different combinations of algorithms, feature engineering, hyperparameter tuning, and data splitting. The performance of each model was assessed using metrics such as accuracy, precision, recall, and F1-score, and visualizations like ROC-AUC curves were used to gain insights into their discrimination capabilities. While the RF model achieved the highest accuracy at 97%, the AUC score of Combination 2 (RF+GB) suggests that this ensemble model could strike a better balance between high accuracy (86%) and effective class separation (99%). Ultimately, the study identifies an ensemble model as the preferred choice, leading to significant improvements in prediction accuracy. The research also explores working on the problem as a time-series prediction task, exploring the use of Long Short-Term Memory (LSTM) and Auto-Regressive Integrated Moving Average (Auto-ARIMA) models. However, we found that this approach was impractical due to the dataset’s discrete and non-sequential nature. This research contributes to the advancement of vehicle simulation and traffic forecasting, demonstrating the potential of machine learning in addressing complex real-world scenarios. supervised machine learning traffic forecasting vehicle presence prediction binary classification ensemble learning feature engineering hyperparameter tuning data-driven analysis Information Systems
26	Machine Learning Approaches to Dribble Hand-off Action Classification with SportVU NBA Player Coordinate Data Stephanos, Dembe 01 May 2021 (has links) Recently, strategies of National Basketball Association teams have evolved with the skillsets of players and the emergence of advanced analytics. One of the most effective actions in dynamic offensive strategies in basketball is the dribble hand-off (DHO). This thesis proposes an architecture for a classification pipeline for detecting DHOs in an accurate and automated manner. This pipeline consists of a combination of player tracking data and event labels, a rule set to identify candidate actions, manually reviewing game recordings to label the candidates, and embedding player trajectories into hexbin cell paths before passing the completed training set to the classification models. This resulting training set is examined using the information gain from extracted and engineered features and the effectiveness of various machine learning algorithms. Finally, we provide a comprehensive accuracy evaluation of the classification models to compare various machine learning algorithms and highlight their subtle differences in this problem domain. Machine Learning Classification Deep Learning Basketball Analytics Feature Engineering Hexbin Mapping Artificial Intelligence and Robotics Databases and Information Systems Data Science Software Engineering Theory and Algorithms
27	Predicting the Impact of Supply Chain Disruptions Using Statistical Analysis and Machine Learning / Prediktering av följderna från störningar i en försörjningskedja med användning av statistisk analys och maskininlärning Andersson, Hannes, Sjöberg, John January 2023 (has links) The dairy business is vulnerable to supply chain disruptions since large safety stocks to cover up losses are not always a viable option, therefore it is crucial to maintain a smooth supply chain to ensure stable delivery accuracies. Disruptions are unpredictable and hard to avoid in the supply chain, especially in cases where production errors cause lost production volume. This thesis proposes the use of machine learning and statistical modelling together with data from Arla to predict when a shortage will occur and its duration to allow proactive decision making to mitigate the consequences of the disruption. The aim of this thesis is to create one predictive model for delay and one for duration based on data from multiple products and explore how the features and methods used can capture the product specific characteristics in the data and thereupon improve the models. The model used for evaluating these factors was a random forest classifier, and permutation feature importance was used to determine the relevant features for the models. The issue of having imbalanced data was handled by first grouping the data and then applying the oversampling method SMOTE. The two models were trained on different datasets where the duration model was trained on all disruptions and the delay model was only trained on a subset were a shortage have occurred. One finding was that applying SMOTE yielded the best results. The best duration model had an accuracy of 62% with precision and recall of 79% and 76% respectively for the majority class, but very low for the other classes with a combined average of 21% and 24%. The most important feature for the duration was the the quotient describing the lost production. The best delay model had an accuracy of 62% with more accurate predictions over all classes and an average precision and recall of 59% and 57%. The most important feature for the delay was how often a product is produced. / Mejeribranschen är sårbar för störningar i försörjningskedjan eftersom stora säkerhetslager för att täcka förluster inte alltid är ett genomförbart alternativ, därför är det avgörande att upprätthålla en smidig försörjningskedja för att säkerställa stabila leveransnivåer. Störningar är oförutsägbara och svåra att undvika i en försörjningskedja, särskilt i de fall där produktionsfel orsakar minskad produktionsvolym. Denna uppsats föreslår användning av maskininlärning och statistisk modellering tillsammans med data från Arla för att prediktera när en brist kommer att uppstå i förhållande till störningen samt bristens varaktighet för att möjliggöra proaktiva beslut som förmildrar konsekvenserna av störningen. Målet med denna uppsats är att skapa en prediktiv modell för fördröjning och en för varaktighet baserad på data från flera produkter och undersöka hur de variabler och metoder som användes kan fånga produktspecifika egenskaper i data och därav förbättra modellen. Modellen som användes för att utvärdera dessa faktorer var en random forest klassificerare, och permutation feature importance användes för att utvärdera de använda variablerna för modellerna. Obalanserad data hanterades genom att först gruppera datan och sedan tillämpa översamplingsmetoden SMOTE. De två modellerna tränades på olika data där varaktighetsmodellen tränades på alla störningar och fördröjningsmodellen endast tränades på de fall där en brist uppstått. En slutsats var att tillämpning av SMOTE gav de bästa resultaten. Den bästa varaktighetsmodellen hade en noggrannhet på 62% med precision och recall på 79% respektive 76% för majoritetsklassen men mycket lägre för de andra klasserna med en genomsnittlig precision och recall på 21% och 24%. Den viktigaste variabeln för varaktigheten var kvoten som beskriver den förlorade produktionen. Den bästa fördröjningsmodellen hade en noggrannhet på 62% med stabilare prediktioner över alla klasser och en genomsnittlig precision och recall på 59% och 57%. Den viktigaste variabeln för fördröjningen var hur ofta en produkt produceras. Supply chain disruption SMOTE feature engineering machine learning random forest statistics applied mathematics Störning i försörjningskedja maskininlärning matematik statistik Other Mathematics Annan matematik
28	Viewership forecast on a Twitch broadcast : Using machine learning to predict viewers on sponsored Twitch streams Malm, Jonas, Friberg, Martin January 2022 (has links) Today, the video game industry is larger than the sports and film industries combined, and the largest streaming platform Twitch with an average of 2.8 million concurrent viewers offers the possibility for gaming and non-gaming brands to market their products. Estimating streamers’ viewership is central in these marketing campaigns, but no large-scale studies have been conducted to predict viewership previously. This paper evaluates three different machine learning algorithms with regard to the three different error metrics MAE, MAPE and RMSE; and presents novel features for predicting viewership. Different models are chosen through recursive feature elimination using k-fold cross-validation with respect to both MAE and MAPE separately. The models are evaluated on an independent test and show promising results, on par with manual expert predictions. None of the models can be said to be significantly better than another. XGBoost optimized for MAPE obtained the lowest MAE error score of 282.54 and lowest MAPE error score of 41.36% on the test set, in comparison to expert predictions with 288.06 MAE and 83.05% MAPE. Furthermore, the study illustrates the importance of past viewership and streamer variety to predict future viewership. twitch viewership prediction regression machine learning XGBoost streaming distance metrics feature selection cross-validation feature engineering pre-processing Computer and Information Sciences Data- och informationsvetenskap
29	Computer evaluation of musical timbre transfer on drum tracks Lee, Keon Ju 09 August 2021 (has links) Musical timbre transfer is the task of re-rendering the musical content of a given source using the rendering style of a target sound. The source keeps its musical content, e.g., pitch, microtiming, orchestration, and syncopation. I specifically focus on the task of transferring the style of percussive patterns extracted from polyphonic audio using a MelGAN-VC model [57] by training acoustic properties for each genre. Evaluating audio style transfer is challenging and typically requires user studies. An analytical methodology based on supervised and unsupervised learning including visualization for evaluating musical timbre transfer is proposed. The proposed methodology is used to evaluate the MelGAN-VC model for musical timbre transfer of drum tracks. The method uses audio features to analyze results of the timbre transfer based on classification probability from Random Forest classifier. And K-means algorithm can classify unlabeled instances using audio features and style-transformed results are visualized by t-SNE dimensionality reduction technique, which is helpful for interpreting relations between musical genres and comparing results from the Random Forest classifier. / Graduate Audio Style Transfer Musical Timbre Transfer GANs Methodology Machine Learning Evaluation Pipelines AI-assisted Music Analysis Audio Feature Engineering
30	Insurance Fraud Detection using Unsupervised Sequential Anomaly Detection / Detektion av försäkringsbedrägeri med oövervakad sekvensiell anomalitetsdetektion Hansson, Anton, Cedervall, Hugo January 2022 (has links) Fraud is a common crime within the insurance industry, and insurance companies want to quickly identify fraudulent claimants as they often result in higher premiums for honest customers. Due to the digital transformation where the sheer volume and complexity of available data has grown, manual fraud detection is no longer suitable. This work aims to automate the detection of fraudulent claimants and gain practical insights into fraudulent behavior using unsupervised anomaly detection, which, compared to supervised methods, allows for a more cost-efficient and practical application in the insurance industry. To obtain interpretable results and benefit from the temporal dependencies in human behavior, we propose two variations of LSTM based autoencoders to classify sequences of insurance claims. Autoencoders can provide feature importances that give insight into the models' predictions, which is essential when models are put to practice. This approach relies on the assumption that outliers in the data are fraudulent. The models were trained and evaluated on a dataset we engineered using data from a Swedish insurance company, where the few labeled frauds that existed were solely used for validation and testing. Experimental results show state-of-the-art performance, and further evaluation shows that the combination of autoencoders and LSTMs are efficient but have similar performance to the employed baselines. This thesis provides an entry point for interested practitioners to learn key aspects of anomaly detection within fraud detection by thoroughly discussing the subject at hand and the details of our work. / <p>Gjordes digitalt via Zoom. </p> Insurance Fraud Detection Anomaly Detection Long Short-Term Memory Networks (LSTM) Unsupervised Learning Autoencoder (AE) Variational Autoencoder (VAE) Interpretable Machine Learning Feature Engineering Feature Selection Feature Importance Computer Sciences Datavetenskap (datalogi)

Search results