Global ETD Search

51	Improving Change Point Detection Using Self-Supervised VAEs : A Study on Distance Metrics and Hyperparameters in Time Series Analysis Workinn, Daniel January 2023 (has links) This thesis addresses the optimization of the Variational Autoencoder-based Change Point Detection (VAE-CP) approach in time series analysis, a vital component in data-driven decision making. We evaluate the impact of various distance metrics and hyperparameters on the model’s performance using a systematic exploration and robustness testing on diverse real-world datasets. Findings show that the Dynamic Time Warping (DTW) distance metric significantly enhances the quality of the extracted latent variable space and improves change point detection. The research underscores the potential of the VAE-CP approach for more effective and robust handling of complex time series data, advancing the capabilities of change point detection techniques. / Denna uppsats behandlar optimeringen av en Variational Autoencoder-baserad Change Point Detection (VAE-CP)-metod i tidsserieanalys, en vital komponent i datadrivet beslutsfattande. Vi utvärderar inverkan av olika distansmått och hyperparametrar på modellens prestanda med hjälp av systematisk utforskning och robusthetstestning på diverse verkliga datamängder. Resultaten visar att distansmåttet Dynamic Time Warping (DTW) betydligt förbättrar kvaliteten på det extraherade latenta variabelutrymmet och förbättrar detektionen av brytpunkter (eng. change points). Forskningen understryker potentialen med VAE-CP-metoden för mer effektiv och robust hantering av komplexa tidsseriedata, vilket förbättrar förmågan hos tekniker för att upptäcka brytpunkter. Change point detection Time series data Segmentation Machine learning Data mining Detektion av brytpunkter Tidsseriedata Segmentering Maskininlärning Datautvinning Computer and Information Sciences Data- och informationsvetenskap
52	Hydrologic Data Sharing Using Open Source Software and Low-Cost Electronics Sadler, Jeffrey Michael 01 March 2015 (has links) (PDF) While it is generally accepted that environmental data are critical to understanding environmental phenomena, there are yet improvements to be made in their consistent collection, curation, and sharing. This thesis describes two research efforts to improve two different aspects of hydrologic data collection and management. First described is a recipe for the design, development, and deployment of a low-cost environmental data logging and transmission system for environmental sensors and its connection to an open source data-sharing network. The hardware is built using several low-cost, open-source, mass-produced components. The system automatically ingests data into HydroServer, a standards-based server in the open source Hydrologic Information System (HIS) created by the Consortium of Universities for the Advancement of Hydrologic Sciences Inc (CUAHSI). A recipe for building the system is provided along with several test deployment results. Second, a connection between HydroServer and HydroShare is described. While the CUAHSI HIS system is intended to empower the hydrologic sciences community with better data storage and distribution, it lacks support for the kind of “Web 2.0” collaboration and social-networking capabilities that are increasing scientific discovery in other fields. The design, development, and testing of a software system that integrates CUAHSI HIS with the HydroShare social hydrology architecture is presented. The resulting system supports efficient archive, discovery, and retrieval of data, extensive creator and science metadata, assignment of a persistent digital identifier such as a Digital Object Identifier (DOI), scientific discussion and collaboration around the data and other basic social-networking features. In this system, HydroShare provides functionality for social interaction and collaboration while the existing HIS provides the distributed data management and web services framework. The system is expected to enable scientists, for the first time, to access and share both national- and research lab-scale hydrologic time series in a standards-based web services architecture combined with a social network developed specifically for the hydrologic sciences.These two research projects address and provide a solution for significant challenges in the automatic collection, curation, and feature-rich sharing of hydrologic data. Open-source hardware Sensor networks Environmental monitoring Open-source software Time series data Digital objects Social networks Metadata Web systems Civil and Environmental Engineering
53	The Compression of IoT operational data time series in vehicle embedded systems Xing, Renzhi January 2018 (has links) This thesis examines compression algorithms for time series operational data which are collected from the Controller Area Network (CAN) bus in an automotive Internet of Things (IoT) setting. The purpose of a compression algorithm is to decrease the size of a set of time series data (such as vehicle speed, wheel speed, etc.) so that the data to be transmitted from the vehicle is small size, thus decreasing the cost of transmission while providing potentially better offboard data analysis. The project helped improve the quality of data collected by the data analysts and reduced the cost of data transmission. Since the time series data compression mostly concerns data storage and transmission, the difficulties in this project were where to locate the combination of data compression and transmission, within the limited performance of the onboard embedded systems. These embedded systems have limited resources (concerning hardware and software resources). Hence the efficiency of the compression algorithm becomes very important. Additionally, there is a tradeoff between the compression ratio and real-time performance. Moreover, the error rate introduced by the compression algorithm must be smaller than an expected value. The compression algorithm contains two phases: (1) an online lossy compression algorithm - piecewise approximation to shrink the total number of data samples while maintaining a guaranteed precision and (2) a lossless compression algorithm – Delta-XOR encoding to compress the output of the lossy algorithm. The algorithm was tested with four typical time series data samples from real CAN logs with different functions and properties. The similarities and differences between these logs are discussed. These differences helped to determine the algorithms that should be used. After the experiments which helped to compare different algorithms and check their performances, a simulation is implemented based on the experiment results. The results of this simulation show that the combined compression algorithm can meet the need of certain compression ratio by controlling the error bound. Finally, the possibility of improving the compression algorithm in the future is discussed. / Denna avhandling undersöker komprimeringsalgoritmer för driftdata från tidsserier som samlas in från ett fordons CAN-buss i ett sammanhang rörande Internet of Things (IoT) speciellt tillämpat för bilindustrin. Syftet med en kompressionsalgoritm är att minska storleken på en uppsättning tidsseriedata (som tex fordonshastighet, hjulhastighet etc.) så att data som ska överföras från fordonet har liten storlek och därmed sänker kostnaden för överföring samtidigt som det möjliggör bättre dataanalys utanför fordonet. Projektet bidrog till att förbättra kvaliteten på data som samlats in av dataanalytiker och minskade kostnaderna för dataöverföring. Eftersom tidsseriekomprimeringen huvudsakligen handlar om datalagring och överföring var svårigheterna i det här projektet att lokalisera kombinationen av datakomprimering och överföring inom den begränsade prestandan hos de inbyggda systemen. Dessa inbyggda system har begränsade resurser (både avseende hårdvaru- och programvaruresurser). Därför blir effektiviteten hos kompressionsalgoritmen mycket viktig. Dessutom är det en kompromiss mellan kompressionsförhållandet och realtidsprestanda. Dessutom måste felfrekvensen som införs av kompressionsalgoritmen vara mindre än ett givet gränsvärde. Komprimeringsalgoritmen i denna avhandling benämns kombinerad kompression, och innehåller två faser: (1) en online-algoritm med dataförluster, för att krympa det totala antalet data-samples samtidigt som det garanterade felet kan hållas under en begränsad nivå och (2) en dataförlustfri kompressionsalgoritm som komprimerar utsignalen från den första algoritmen. Algoritmen testades med fyra typiska tidsseriedataxempel från reella CAN-loggar med olika funktioner och egenskaper. Likheterna och skillnaderna mellan dessa olika typer diskuteras. Dessa skillnader hjälpte till att bestämma vilken algoritm som ska väljas i båda faser. Efter experimenten som jämför prestandan för olika algoritmer, implementeras en simulering baserad på experimentresultaten. Resultaten av denna simulering visar att den kombinerade kompressionsalgoritmen kan möta behovet av ett visst kompressionsförhållande genom att styra mot den bundna felgränsen. Slutligen diskuteras möjligheten att förbättra kompressionsalgoritmen i framtiden. Data compression Data transmission Time series data IoT CAN Vehicle connectivity Datakomprimering Dataöverföring Tidsseriedata IoT CAN Uppkopplade fordon Computer and Information Sciences Data- och informationsvetenskap
54	Experimental Study on Machine Learning with Approximation to Data Streams Jiang, Jiani January 2019 (has links) Realtime transferring of data streams enables many data analytics and machine learning applications in the areas of e.g. massive IoT and industrial automation. Big data volume of those streams is a significant burden or overhead not only to the transportation network, but also to the corresponding application servers. Therefore, researchers and scientists focus on reducing the amount of data needed to be transferred via data compressions and approximations. Data compression techniques like lossy compression can significantly reduce data volume with the price of data information loss. Meanwhile, how to do data compression is highly dependent on the corresponding applications. However, when apply the decompressed data in some data analysis application like machine learning, the results may be affected due to the information loss. In this paper, the author did a study on the impact of data compression to the machine learning applications. In particular, from the experimental perspective, it shows the tradeoff among the approximation error bound, compression ratio and the prediction accuracy of multiple machine learning methods. The author believes that, with proper choice, data compression can dramatically reduce the amount of data transferred with limited impact on the machine learning applications. / Realtidsöverföring av dataströmmar möjliggör många dataanalyser och maskininlärningsapplikationer inom områdena t.ex. massiv IoT och industriell automatisering. Stor datavolym för dessa strömmar är en betydande börda eller omkostnad inte bara för transportnätet utan också för motsvarande applikationsservrar. Därför fokuserar forskare och forskare om att minska mängden data som behövs för att överföras via datakomprimeringar och approximationer. Datakomprimeringstekniker som förlustkomprimering kan minska datavolymen betydligt med priset för datainformation. Samtidigt är datakomprimering mycket beroende av motsvarande applikationer. Men när du använder dekomprimerade data i en viss dataanalysapplikation som maskininlärning, kan resultaten påverkas på grund av informationsförlusten. I denna artikel gjorde författaren en studie om effekterna av datakomprimering på maskininlärningsapplikationerna. I synnerhet, från det experimentella perspektivet, visar det avvägningen mellan tillnärmningsfelbundet, kompressionsförhållande och förutsägbarhetsnoggrannheten för flera maskininlärningsmetoder. Författaren anser att datakomprimering med rätt val dramatiskt kan minska mängden data som överförs med begränsad inverkan på maskininlärningsapplikationerna. Elektroteknik och elektronik
55	Implementation of Hierarchical and K-Means Clustering Techniques on the Trend and Seasonality Components of Temperature Profile Data Ogedegbe, Emmanuel 01 December 2023 (has links) (PDF) In this study, time series decomposition techniques are used in conjunction with Kmeans clustering and Hierarchical clustering, two well-known clustering algorithms, to climate data. Their implementation and comparisons are then examined. The main objective is to identify similar climate trends and group geographical areas with similar environmental conditions. Climate data from specific places are collected and analyzed as part of the project. The time series is then split into trend, seasonality, and residual components. In order to categorize growing regions according to their climatic inclinations, the deconstructed time series are then submitted to K-means clustering and Hierarchical clustering with dynamic time warping. In order to understand how different regions’ climates compare to one another and how regions cluster based on the general trend of the temperature profile over the course of the full growing season as opposed to the seasonality component for the various locations, the created clusters are evaluated. Time series data K-Means Clustering Hierarchical Clustering Applied Mathematics Computer Sciences Data Science Statistics and Probability
56	Extraction of Global Features for enhancing Machine Learning Performance / Extraktion av Globala Egenskaper för förbättring av Maskininlärningsprestanda Tesfay, Abyel January 2023 (has links) Data Science plays an essential role in many organizations and industries to become data-driven in their decision-making and workflow, as models can provide relevant input in areas such as social media, the stock market, and manufacturing industries. To train models of quality, data preparation methods such as feature extraction are used to extract relevant features. However, global features are often ignored when feature extraction is performed on time-series datasets. This thesis aims to investigate how state-of-the-art tools and methods in data preparation and analytics can be used to extract global features and evaluate if such data could improve the performance of ML models. Global features refer to information that summarizes a full dataset such as the mean and median values from a numeric dataset. They could be used as inputs to make models understand the dataset and generalize better towards new data. The thesis went through a literature study to analyze feature extraction methods, time-series data, the definition of global features, and their benefits in bioprocessing. An effort was conducted to analyze and extract global features using tools and methods for data manipulation and feature extraction. The data used in the study consists of bioprocessing measurements of E. Coli cell growth as time-series data. The global features were evaluated through a performance comparison between models trained on a combined set of the dataset and global features, and models trained only on the full dataset. The study presents a method to extract global features with open-source tools and libraries, namely the Python language and the Numpy, Pandas, Matplot, and Scikit libraries. The quality of the global features depends on the experience in data science, data structure complexity, and domain area knowledge. The results show that the best models, trained on the dataset and global features combined, perform on average 15-18% better than models trained only on the dataset. The performance depends on the type and the number of global features combined with the dataset. Global features could be useful in manufacturing industries such as pharmaceutical and chemical, by helping models predict the inputs that lead to the desired trends and output. This could help promote sustainable production in various industries. / Datavetenskap spelar en stor roll inom många organsationer och industrier för att bli data-drivna inom beslutsfattande och arbetsflöde, varav maskininlärningsmodeller kan ge relevanta förslag inom områden som social media, aktiemarknaden samt tillverkningsindustrin. För att träna kvalitativa modeller används dataförberedande verktyg som funktionsextraktion för att utvinna relevanta egenskaper från data. Dock tar man ej hänsyn till globala egenskaper när funktionsextraktion utförs på tidsserie data. Denna examensarbete undersöker hur nuvarande verktyg inom dataförberededning och analys can användas för att utvinna global funktioner och utvärderar om sådan data kan förbättra prestandan hos maskinlärningsmodeller. Globla funktioner beskriver information som sammanfattar hel data, till exempel medelvärdet och medianen. De kan användas som indata för att få modeller förstå data och generalizera bättre mot ny data. Först utfördes en litteraturstudie inom metoder för funktionsextraktion, tidsserie data, definition av globala egenskaper samt möjligheter inom bioutvinning. Därefter utfördes en analys och utvinning av globala egenskaper med verktyg och metoder för data manipulation och funktionsutvinning. Den data som användes i arbetet består av mätningar från bioutvinning av E. Coli bakterier i form av tidsserie data. De globala funktionerna utvärderades genom en jämnförelse mellan modeller tränade på kombination av hel data och globala funktioner, och modeller tränade enbart på hel data. Studien presenterar en metod för att extrahera globala funktioner med öppet tillgänglig verktyg och bibliotek, som Python språket och Numpy, Pandas, Matplot och Scikit bibloteken. Kvaliteten på de globala funktionerna baseras på erfarenheten inom datavetenskap, datas komplexitet samt förståelse för domänområdet. Resultat visar att de bästa modellerna, tränade på data och globala funktioner, presterar i genomsnitt 15-18% bättre än modeller som tränats enbart på hel data. Prestandan detta beror på typen och antalet globala funktioner som kobineras med ursprungliga datat. Globala funktioner kan vara till nytta inom tillverkningsindustrier som farmaceutisk eller kemiska, genom att hjälpa modeller att förutsäga ingångsparametrar som leder till önskad produktion. Detta kan bidra till en hållbar produktion imon flera industrier. Machine Learning Deep Learning Feature Extraction Global Features Time-series data Bioprocessing Maskininlärning Djupinlärning Funktionsextraktion Globala Funktioner Tidsserie data Biobearbetning Computer and Information Sciences Data- och informationsvetenskap
57	CSAR: The Cross-Sectional Autoregression Model Lehner, Wolfgang, Hartmann, Claudio, Hahmann, Martin, Habich, Dirk 18 January 2023 (has links) The forecasting of time series data is an integral component for management, planning, and decision making. Following the Big Data trend, large amounts of time series data are available in many application domains. The highly dynamic and often noisy character of these domains in combination with the logistic problems of collecting data from a large number of data sources, imposes new requirements on the forecasting process. A constantly increasing number of time series has to be forecasted, preferably with low latency AND high accuracy. This is almost impossible, when keeping the traditional focus on creating one forecast model for each individual time series. In addition, often used forecasting approaches like ARIMA need complete historical data to train forecast models and fail if time series are intermittent. A method that addresses all these new requirements is the cross-sectional forecasting approach. It utilizes available data from many time series of the same domain in one single model, thus, missing values can be compensated and accurate forecast results can be calculated quickly. However, this approach is limited by a rigid training data selection and existing forecasting methods show that adaptability of the model to the data increases the forecast accuracy. Therefore, in this paper we present CSAR a model that extends the cross-sectional paradigm by adding more flexibility and allowing fine grained adaptations to the analyzed data. In this way, we achieve an increased forecast accuracy and thus a wider applicability. info:eu-repo/classification/ddc/005 ddc:005
58	Modeling Credit Default Swap Spreads with Transformers : A Thesis in collaboration with Handelsbanken / Modellera Kreditswapp spreadar med Transformers : Ett projekt I samarbete med Handelsbanken Luhr, Johan January 2023 (has links) In the aftermath of the credit crisis in 2007, the importance of Credit Valuation Adjustment (CVA) rose in the Over The Counter (OTC) derivative pricing process. One important part of the pricing process is to determine Probability of Defaults (PDs) of the counterparty in question. The normal way of doing this is to use Credit Default Swap (CDS) spreads from the CDS market. In some cases, there is no associated liquid CDS market, and in those cases, it is market practice to use proxy CDS spreads. In this thesis, transformer models are used to generate proxy CDS spreads with a certain region, rating, and tenor from stand-alone CDS spread data. Two different models are created to do this. The first simpler model is an encoder-based model that uses stand-alone CDS data from a single company to generate one proxy spread per inference. The second, more advanced model is an encoder-decoder model that uses stand-alone CDS data from three companies to generate one proxy spread per inference. The performance of the models is compared, and it is shown that the more advanced model outperforms the simpler model. It should, be noted that the simpler model is faster to train. Both models could be used for data validation. To create the transformer models, it was necessary to implement custom embeddings that embedd specific corporate information and temporal information regarding the CDS spreads. The importance of the different embeddings was also investigated, and it is clear that certain embeddings are more important than others. / I efterdyningarna av kreditkrisen 2007 så ökade betydelsen av CVA vid prissättning av OTC derivat. En viktig del av prissättningen av OTC derivat är att avgöra PDs för den aktuella motparten. Om det finns en likvid CDS marknad för motparten så kan man använda sig av CDSs spreadar dirket från marknaden för att avgöra PDs. I många fall så saknas en sådan likvid CDS marknad. Då är det praksis att istället använda sig av proxy CDS spreadar. I den här uppsatsen så presenteras två transformer modeller för att generera proxy CDS spreadar för bestämda kombinationer av region, rating och löptid från enskilda företags CDS spreadar. Den först enklare modellen är en encoder baserad modell som använder sig av data från ett enskilt företag för att generera en proxy spread per inferens. Den andra modellen är en mer avancerad encoder-decoder modell. Den mer avancerade modellen använder sig av data från tre företag för att generera en proxy spread. I uppsatsen jämförs dessa modeller och man kan konstatera att den mer avancereade modellen genererar mer exakta CDS spreadar. Den enklare modellen är dock betydligt enklare att träna och båda modellerna kan användas i syfte att validera det riktiga proxy datat. För att kunna skapa modellerna så var det en nödvändighet att implementera specialbyggda embeddings som kodad in temporal information och företagsspecifik information om CDS spreadarna. Dessutom så testades vikten av enskilda embeddings och det var uppenbart att vissa embeddings var viktigare än andra. Machine Learning Transformer Finance Credit Default Swap Credit Valuation Adjustment Time Series Data Maskininlärning Transformer Finance Kreditswapp Kredit Värderings Justering Tidsserie data Computer and Information Sciences Data- och informationsvetenskap
59	Restaurant Daily Revenue Prediction : Utilizing Synthetic Time Series Data for Improved Model Performance Jarlöv, Stella, Svensson Dahl, Anton January 2023 (has links) This study aims to enhance the accuracy of a demand forecasting model, XGBoost, by incorporating synthetic multivariate restaurant time series data during the training process. The research addresses the limited availability of training data by generating synthetic data using TimeGAN, a generative adversarial deep neural network tailored for time series data. A one-year daily time series dataset, comprising numerical and categorical features based on a real restaurant's sales history, supplemented by relevant external data, serves as the original data. TimeGAN learns from this dataset to create synthetic data that closely resembles the original data in terms of temporal and distributional dynamics. Statistical and visual analyses demonstrate a strong similarity between the synthetic and original data. To evaluate the usefulness of the synthetic data, an experiment is conducted where varying lengths of synthetic data are iteratively combined with the one-year real dataset. Each iteration involves retraining the XGBoost model and assessing its accuracy for a one-week forecast using the Root Mean Square Error (RMSE). The results indicate that incorporating 6 years of synthetic data improves the model's performance by 65%. The hyperparameter configurations suggest that deeper tree structures benefit the XGBoost model when synthetic data is added. Furthermore, the model exhibits improved feature selection with an increased amount of training data. This study demonstrates that incorporating synthetic data closely resembling the original data can effectively enhance the accuracy of predictive models, particularly when training data is limited. demand forecasting data augmentation time series data machine learning restaurant industry generative adversarial networks TimeGAN XGBoost Computer and Information Sciences Data- och informationsvetenskap
60	From Data to Decision : Data Analysis for Optimal Office Development Mattsson, Josefine January 2024 (has links) The slow integration of digital tools in the real estate industry, particularly for analyzing building data, presents significant yet underexploited potential. This thesis explores the use of occupancy sensor data and room attributes from three office buildings, demonstrating how analytical methods can enhance architectural planning and inform design decisions. Room features such as size, floor plan placement, presence of screens, video solutions, whiteboards, windows, table shapes, restricted access, and proximity to amenities like coffee machines and printers were examined for their influence on space utilization. Two datasets were analyzed: one recording daily room usage and the other summarizing usage over a consistent timeframe. Analytical methods included centered moving averages, seasonal decomposition, panel data analysis models such as between and mixed effects models, various regression techniques, decision trees, random forests, Extreme Gradient Boosting (XGBoost), and K-means clustering. Results revealed consistent seasonal patterns and identified key room attributes affecting usage, such as proximity to amenities, screen availability, floor level, and room size. Basic techniques proved valuable for initial data exploration, while advanced models uncovered critical patterns, with random forest and XGBoost showing high predictive accuracy. The findings emphasize the importance of diverse analytical techniques in understanding room usage. This study underscores the value of further exploration in refining models, incorporating additional factors, and improving prediction accuracy. It highlights the significant potential for cost reduction, time savings, and innovative solutions in the real estate industry. machine learning time series data forecasting real estate industry data analysis digitalisation Maskininlärning tidsseriedata prognos fastighetsbranschen dataanalys digitalisering Computer and Information Sciences Data- och informationsvetenskap

Search results