• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 46
  • 3
  • 3
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 70
  • 70
  • 70
  • 25
  • 20
  • 18
  • 16
  • 15
  • 14
  • 13
  • 11
  • 10
  • 9
  • 9
  • 8
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
51

The Compression of IoT operational data time series in vehicle embedded systems

Xing, Renzhi January 2018 (has links)
This thesis examines compression algorithms for time series operational data which are collected from the Controller Area Network (CAN) bus in an automotive Internet of Things (IoT) setting. The purpose of a compression algorithm is to decrease the size of a set of time series data (such as vehicle speed, wheel speed, etc.) so that the data to be transmitted from the vehicle is small size, thus decreasing the cost of transmission while providing potentially better offboard data analysis. The project helped improve the quality of data collected by the data analysts and reduced the cost of data transmission. Since the time series data compression mostly concerns data storage and transmission, the difficulties in this project were where to locate the combination of data compression and transmission, within the limited performance of the onboard embedded systems. These embedded systems have limited resources (concerning hardware and software resources). Hence the efficiency of the compression algorithm becomes very important. Additionally, there is a tradeoff between the compression ratio and real-time performance. Moreover, the error rate introduced by the compression algorithm must be smaller than an expected value. The compression algorithm contains two phases: (1) an online lossy compression algorithm - piecewise approximation to shrink the total number of data samples while maintaining a guaranteed precision and (2) a lossless compression algorithm – Delta-XOR encoding to compress the output of the lossy algorithm. The algorithm was tested with four typical time series data samples from real CAN logs with different functions and properties. The similarities and differences between these logs are discussed. These differences helped to determine the algorithms that should be used. After the experiments which helped to compare different algorithms and check their performances, a simulation is implemented based on the experiment results. The results of this simulation show that the combined compression algorithm can meet the need of certain compression ratio by controlling the error bound. Finally, the possibility of improving the compression algorithm in the future is discussed. / Denna avhandling undersöker komprimeringsalgoritmer för driftdata från tidsserier som samlas in från ett fordons CAN-buss i ett sammanhang rörande Internet of Things (IoT) speciellt tillämpat för bilindustrin. Syftet med en kompressionsalgoritm är att minska storleken på en uppsättning tidsseriedata (som tex fordonshastighet, hjulhastighet etc.) så att data som ska överföras från fordonet har liten storlek och därmed sänker kostnaden för överföring samtidigt som det möjliggör bättre dataanalys utanför fordonet. Projektet bidrog till att förbättra kvaliteten på data som samlats in av dataanalytiker och minskade kostnaderna för dataöverföring. Eftersom tidsseriekomprimeringen huvudsakligen handlar om datalagring och överföring var svårigheterna i det här projektet att lokalisera kombinationen av datakomprimering och överföring inom den begränsade prestandan hos de inbyggda systemen. Dessa inbyggda system har begränsade resurser (både avseende hårdvaru- och programvaruresurser). Därför blir effektiviteten hos kompressionsalgoritmen mycket viktig. Dessutom är det en kompromiss mellan kompressionsförhållandet och realtidsprestanda. Dessutom måste felfrekvensen som införs av kompressionsalgoritmen vara mindre än ett givet gränsvärde. Komprimeringsalgoritmen i denna avhandling benämns kombinerad kompression, och innehåller två faser: (1) en online-algoritm med dataförluster, för att krympa det totala antalet data-samples samtidigt som det garanterade felet kan hållas under en begränsad nivå och (2) en dataförlustfri kompressionsalgoritm som komprimerar utsignalen från den första algoritmen. Algoritmen testades med fyra typiska tidsseriedataxempel från reella CAN-loggar med olika funktioner och egenskaper. Likheterna och skillnaderna mellan dessa olika typer diskuteras. Dessa skillnader hjälpte till att bestämma vilken algoritm som ska väljas i båda faser. Efter experimenten som jämför prestandan för olika algoritmer, implementeras en simulering baserad på experimentresultaten. Resultaten av denna simulering visar att den kombinerade kompressionsalgoritmen kan möta behovet av ett visst kompressionsförhållande genom att styra mot den bundna felgränsen. Slutligen diskuteras möjligheten att förbättra kompressionsalgoritmen i framtiden.
52

Experimental Study on Machine Learning with Approximation to Data Streams

Jiang, Jiani January 2019 (has links)
Realtime transferring of data streams enables many data analytics and machine learning applications in the areas of e.g. massive IoT and industrial automation. Big data volume of those streams is a significant burden or overhead not only to the transportation network, but also to the corresponding application servers. Therefore, researchers and scientists focus on reducing the amount of data needed to be transferred via data compressions and approximations. Data compression techniques like lossy compression can significantly reduce data volume with the price of data information loss. Meanwhile, how to do data compression is highly dependent on the corresponding applications. However, when apply the decompressed data in some data analysis application like machine learning, the results may be affected due to the information loss. In this paper, the author did a study on the impact of data compression to the machine learning applications. In particular, from the experimental perspective, it shows the tradeoff among the approximation error bound, compression ratio and the prediction accuracy of multiple machine learning methods. The author believes that, with proper choice, data compression can dramatically reduce the amount of data transferred with limited impact on the machine learning applications. / Realtidsöverföring av dataströmmar möjliggör många dataanalyser och maskininlärningsapplikationer inom områdena t.ex. massiv IoT och industriell automatisering. Stor datavolym för dessa strömmar är en betydande börda eller omkostnad inte bara för transportnätet utan också för motsvarande applikationsservrar. Därför fokuserar forskare och forskare om att minska mängden data som behövs för att överföras via datakomprimeringar och approximationer. Datakomprimeringstekniker som förlustkomprimering kan minska datavolymen betydligt med priset för datainformation. Samtidigt är datakomprimering mycket beroende av motsvarande applikationer. Men när du använder dekomprimerade data i en viss dataanalysapplikation som maskininlärning, kan resultaten påverkas på grund av informationsförlusten. I denna artikel gjorde författaren en studie om effekterna av datakomprimering på maskininlärningsapplikationerna. I synnerhet, från det experimentella perspektivet, visar det avvägningen mellan tillnärmningsfelbundet, kompressionsförhållande och förutsägbarhetsnoggrannheten för flera maskininlärningsmetoder. Författaren anser att datakomprimering med rätt val dramatiskt kan minska mängden data som överförs med begränsad inverkan på maskininlärningsapplikationerna.
53

Implementation of Hierarchical and K-Means Clustering Techniques on the Trend and Seasonality Components of Temperature Profile Data

Ogedegbe, Emmanuel 01 December 2023 (has links) (PDF)
In this study, time series decomposition techniques are used in conjunction with Kmeans clustering and Hierarchical clustering, two well-known clustering algorithms, to climate data. Their implementation and comparisons are then examined. The main objective is to identify similar climate trends and group geographical areas with similar environmental conditions. Climate data from specific places are collected and analyzed as part of the project. The time series is then split into trend, seasonality, and residual components. In order to categorize growing regions according to their climatic inclinations, the deconstructed time series are then submitted to K-means clustering and Hierarchical clustering with dynamic time warping. In order to understand how different regions’ climates compare to one another and how regions cluster based on the general trend of the temperature profile over the course of the full growing season as opposed to the seasonality component for the various locations, the created clusters are evaluated.
54

Extraction of Global Features for enhancing Machine Learning Performance / Extraktion av Globala Egenskaper för förbättring av Maskininlärningsprestanda

Tesfay, Abyel January 2023 (has links)
Data Science plays an essential role in many organizations and industries to become data-driven in their decision-making and workflow, as models can provide relevant input in areas such as social media, the stock market, and manufacturing industries. To train models of quality, data preparation methods such as feature extraction are used to extract relevant features. However, global features are often ignored when feature extraction is performed on time-series datasets. This thesis aims to investigate how state-of-the-art tools and methods in data preparation and analytics can be used to extract global features and evaluate if such data could improve the performance of ML models. Global features refer to information that summarizes a full dataset such as the mean and median values from a numeric dataset. They could be used as inputs to make models understand the dataset and generalize better towards new data. The thesis went through a literature study to analyze feature extraction methods, time-series data, the definition of global features, and their benefits in bioprocessing. An effort was conducted to analyze and extract global features using tools and methods for data manipulation and feature extraction. The data used in the study consists of bioprocessing measurements of E. Coli cell growth as time-series data. The global features were evaluated through a performance comparison between models trained on a combined set of the dataset and global features, and models trained only on the full dataset. The study presents a method to extract global features with open-source tools and libraries, namely the Python language and the Numpy, Pandas, Matplot, and Scikit libraries. The quality of the global features depends on the experience in data science, data structure complexity, and domain area knowledge. The results show that the best models, trained on the dataset and global features combined, perform on average 15-18% better than models trained only on the dataset. The performance depends on the type and the number of global features combined with the dataset. Global features could be useful in manufacturing industries such as pharmaceutical and chemical, by helping models predict the inputs that lead to the desired trends and output. This could help promote sustainable production in various industries. / Datavetenskap spelar en stor roll inom många organsationer och industrier för att bli data-drivna inom beslutsfattande och arbetsflöde, varav maskininlärningsmodeller kan ge relevanta förslag inom områden som social media, aktiemarknaden samt tillverkningsindustrin. För att träna kvalitativa modeller används dataförberedande verktyg som funktionsextraktion för att utvinna relevanta egenskaper från data. Dock tar man ej hänsyn till globala egenskaper när funktionsextraktion utförs på tidsserie data. Denna examensarbete undersöker hur nuvarande verktyg inom dataförberededning och analys can användas för att utvinna global funktioner och utvärderar om sådan data kan förbättra prestandan hos maskinlärningsmodeller. Globla funktioner beskriver information som sammanfattar hel data, till exempel medelvärdet och medianen. De kan användas som indata för att få modeller förstå data och generalizera bättre mot ny data. Först utfördes en litteraturstudie inom metoder för funktionsextraktion, tidsserie data, definition av globala egenskaper samt möjligheter inom bioutvinning. Därefter utfördes en analys och utvinning av globala egenskaper med verktyg och metoder för data manipulation och funktionsutvinning. Den data som användes i arbetet består av mätningar från bioutvinning av E. Coli bakterier i form av tidsserie data. De globala funktionerna utvärderades genom en jämnförelse mellan modeller tränade på kombination av hel data och globala funktioner, och modeller tränade enbart på hel data. Studien presenterar en metod för att extrahera globala funktioner med öppet tillgänglig verktyg och bibliotek, som Python språket och Numpy, Pandas, Matplot och Scikit bibloteken. Kvaliteten på de globala funktionerna baseras på erfarenheten inom datavetenskap, datas komplexitet samt förståelse för domänområdet. Resultat visar att de bästa modellerna, tränade på data och globala funktioner, presterar i genomsnitt 15-18% bättre än modeller som tränats enbart på hel data. Prestandan detta beror på typen och antalet globala funktioner som kobineras med ursprungliga datat. Globala funktioner kan vara till nytta inom tillverkningsindustrier som farmaceutisk eller kemiska, genom att hjälpa modeller att förutsäga ingångsparametrar som leder till önskad produktion. Detta kan bidra till en hållbar produktion imon flera industrier.
55

CSAR: The Cross-Sectional Autoregression Model

Lehner, Wolfgang, Hartmann, Claudio, Hahmann, Martin, Habich, Dirk 18 January 2023 (has links)
The forecasting of time series data is an integral component for management, planning, and decision making. Following the Big Data trend, large amounts of time series data are available in many application domains. The highly dynamic and often noisy character of these domains in combination with the logistic problems of collecting data from a large number of data sources, imposes new requirements on the forecasting process. A constantly increasing number of time series has to be forecasted, preferably with low latency AND high accuracy. This is almost impossible, when keeping the traditional focus on creating one forecast model for each individual time series. In addition, often used forecasting approaches like ARIMA need complete historical data to train forecast models and fail if time series are intermittent. A method that addresses all these new requirements is the cross-sectional forecasting approach. It utilizes available data from many time series of the same domain in one single model, thus, missing values can be compensated and accurate forecast results can be calculated quickly. However, this approach is limited by a rigid training data selection and existing forecasting methods show that adaptability of the model to the data increases the forecast accuracy. Therefore, in this paper we present CSAR a model that extends the cross-sectional paradigm by adding more flexibility and allowing fine grained adaptations to the analyzed data. In this way, we achieve an increased forecast accuracy and thus a wider applicability.
56

Modeling Credit Default Swap Spreads with Transformers : A Thesis in collaboration with Handelsbanken / Modellera Kreditswapp spreadar med Transformers : Ett projekt I samarbete med Handelsbanken

Luhr, Johan January 2023 (has links)
In the aftermath of the credit crisis in 2007, the importance of Credit Valuation Adjustment (CVA) rose in the Over The Counter (OTC) derivative pricing process. One important part of the pricing process is to determine Probability of Defaults (PDs) of the counterparty in question. The normal way of doing this is to use Credit Default Swap (CDS) spreads from the CDS market. In some cases, there is no associated liquid CDS market, and in those cases, it is market practice to use proxy CDS spreads. In this thesis, transformer models are used to generate proxy CDS spreads with a certain region, rating, and tenor from stand-alone CDS spread data. Two different models are created to do this. The first simpler model is an encoder-based model that uses stand-alone CDS data from a single company to generate one proxy spread per inference. The second, more advanced model is an encoder-decoder model that uses stand-alone CDS data from three companies to generate one proxy spread per inference. The performance of the models is compared, and it is shown that the more advanced model outperforms the simpler model. It should, be noted that the simpler model is faster to train. Both models could be used for data validation. To create the transformer models, it was necessary to implement custom embeddings that embedd specific corporate information and temporal information regarding the CDS spreads. The importance of the different embeddings was also investigated, and it is clear that certain embeddings are more important than others. / I efterdyningarna av kreditkrisen 2007 så ökade betydelsen av CVA vid prissättning av OTC derivat. En viktig del av prissättningen av OTC derivat är att avgöra PDs för den aktuella motparten. Om det finns en likvid CDS marknad för motparten så kan man använda sig av CDSs spreadar dirket från marknaden för att avgöra PDs. I många fall så saknas en sådan likvid CDS marknad. Då är det praksis att istället använda sig av proxy CDS spreadar. I den här uppsatsen så presenteras två transformer modeller för att generera proxy CDS spreadar för bestämda kombinationer av region, rating och löptid från enskilda företags CDS spreadar. Den först enklare modellen är en encoder baserad modell som använder sig av data från ett enskilt företag för att generera en proxy spread per inferens. Den andra modellen är en mer avancerad encoder-decoder modell. Den mer avancerade modellen använder sig av data från tre företag för att generera en proxy spread. I uppsatsen jämförs dessa modeller och man kan konstatera att den mer avancereade modellen genererar mer exakta CDS spreadar. Den enklare modellen är dock betydligt enklare att träna och båda modellerna kan användas i syfte att validera det riktiga proxy datat. För att kunna skapa modellerna så var det en nödvändighet att implementera specialbyggda embeddings som kodad in temporal information och företagsspecifik information om CDS spreadarna. Dessutom så testades vikten av enskilda embeddings och det var uppenbart att vissa embeddings var viktigare än andra.
57

Restaurant Daily Revenue Prediction : Utilizing Synthetic Time Series Data for Improved Model Performance

Jarlöv, Stella, Svensson Dahl, Anton January 2023 (has links)
This study aims to enhance the accuracy of a demand forecasting model, XGBoost, by incorporating synthetic multivariate restaurant time series data during the training process. The research addresses the limited availability of training data by generating synthetic data using TimeGAN, a generative adversarial deep neural network tailored for time series data. A one-year daily time series dataset, comprising numerical and categorical features based on a real restaurant's sales history, supplemented by relevant external data, serves as the original data. TimeGAN learns from this dataset to create synthetic data that closely resembles the original data in terms of temporal and distributional dynamics. Statistical and visual analyses demonstrate a strong similarity between the synthetic and original data. To evaluate the usefulness of the synthetic data, an experiment is conducted where varying lengths of synthetic data are iteratively combined with the one-year real dataset. Each iteration involves retraining the XGBoost model and assessing its accuracy for a one-week forecast using the Root Mean Square Error (RMSE). The results indicate that incorporating 6 years of synthetic data improves the model's performance by 65%. The hyperparameter configurations suggest that deeper tree structures benefit the XGBoost model when synthetic data is added. Furthermore, the model exhibits improved feature selection with an increased amount of training data. This study demonstrates that incorporating synthetic data closely resembling the original data can effectively enhance the accuracy of predictive models, particularly when training data is limited.
58

From Data to Decision : Data Analysis for Optimal Office Development

Mattsson, Josefine January 2024 (has links)
The slow integration of digital tools in the real estate industry, particularly for analyzing building data, presents significant yet underexploited potential. This thesis explores the use of occupancy sensor data and room attributes from three office buildings, demonstrating how analytical methods can enhance architectural planning and inform design decisions. Room features such as size, floor plan placement, presence of screens, video solutions, whiteboards, windows, table shapes, restricted access, and proximity to amenities like coffee machines and printers were examined for their influence on space utilization. Two datasets were analyzed: one recording daily room usage and the other summarizing usage over a consistent timeframe. Analytical methods included centered moving averages, seasonal decomposition, panel data analysis models such as between and mixed effects models, various regression techniques, decision trees, random forests, Extreme Gradient Boosting (XGBoost), and K-means clustering. Results revealed consistent seasonal patterns and identified key room attributes affecting usage, such as proximity to amenities, screen availability, floor level, and room size. Basic techniques proved valuable for initial data exploration, while advanced models uncovered critical patterns, with random forest and XGBoost showing high predictive accuracy. The findings emphasize the importance of diverse analytical techniques in understanding room usage. This study underscores the value of further exploration in refining models, incorporating additional factors, and improving prediction accuracy. It highlights the significant potential for cost reduction, time savings, and innovative solutions in the real estate industry.
59

Interaktiv identifiering av avvikelser i mätdata från testning av kretskort

Berglund, Ebba, Kazemi, Baset January 2024 (has links)
Visualisering är ett kraftfullt verktyg vid dataanalys, särskilt för att identifiera avvikelser. Att effektivt kunna identifiera felaktiga komponenter i elektronik kan förbättra och utveckla produktionsprocesserna avsevärd. Genom att tydligt visa korrelationen mellan felaktiga och fungerande komponenter kan analytiker identifiera nyckelkomponenter som orsakar defekta produkter.  Multivariata data och multivariata tidsseriedata ställer höga krav på visualiseringar på grund av deras komplexitet. Den höga dimensionaliteten kan leda till problem som överlappning och dolda mönster beroende på vilken visualiseringsteknik som används. För att uppnå effektiv visualisering av multivariata data och multivariata tidsseriedata krävs det att både trender över tid och korrelationer mellan olika variabler visas. Studien genomfördes i samarbete med konsultföretaget Syntronic AB för att identifiera lämpliga visualiseringstekniker för data som samlats in vid testning av kretskort. Metoden som användes är design science, vilket omfattar en litteraturstudie, utveckling av prototyp och utvärdering av prototypen. Prototypen består av tre visualiseringstekniker som är: Kategorisk heatmap, Parallella koordinater och Scatterplot. Dessa tekniker jämfördes systematiskt för att bedöma deras effektivitet. Utvärderingen består av kvantitativa metoder såsom mätningar och enkäter, samt den kvalitativa metoden intervju. Resultatet av studien presenterar den utvecklade prototypen och analysen av utvärderingen.  Resultatet av studien visar att kategoriska heatmaps är effektiv för att identifiera samband mellan avvikelser i multivariat data. Även om alla användare upplevde visualiseringen svårtolkad vid en första anblick uttryckte de att visualiseringen var effektiv på att visa korrelationer mellan avvikelser. Parallella koordinater upplevdes svårtolkad och ineffektiv på grund av den höga dimensionaliteten där alla dimensioner inte kan visas samtidigt. Förbättringsförslag för att öka användarvänlighet och användarupplevelse lyftes där tree view förslogs som ett alternativ för att välja de dimensioner som ska visas i stället för reglaget. Scatterplots visade sig vara användbar för att analysera enskilda testpunkter och visade generella trender på ett tydligt och begripligt sätt. Studien har även visat att interaktiviteten påverkar upplevelsen av visualisering, där begränsad interaktivitet medför att tekniken upplevds mindre användbar för att identifiera relationer mellan avvikelser. / Visualization is of great importance when analyzing data, especially when distinguishing anomalies. Identifying faulty components of electronics could evolve and improve the production processes tremendously. By effectively displaying the correlation between faulty and working components, analytics can identify key components causing faulty products.Multivariate data and multivariate time series data place high demands on visualizations due to their complexity. The high dimensionality can lead to issues such as overlapping and hidden patterns, depending on the visualization technique used. To achieve effective visualization of multivariate data and multivariate time series data, it is necessary to show both trends over time and correlations between different variables. This study was conducted in cooperation with Syntronic AB, a consulting company, to help identify suitable visualization techniques for data gathered by testing circuit boards. The methodology used is design research which includes research gathering, development of a prototype and evaluation of the prototype. The prototype consists of three visualization techniques: Categorical heatmap, Parallel Coordinates, and Scatterplot. These techniques were systematically compared to assess their effectiveness. The evaluation consists of quantitative methods such as time measurement and survey, and the qualitative method interview. The result of the study shows the developed prototype and the analysis of the evaluation.  As a result, the study found categorical heatmaps effective in distinguishing correlation between anomalies in multivariate data. Although all users found the visualization difficult to grasp at first glance, expressed their beliefs regarding the effectiveness of displaying correlation. Parallel Coordinates were perceived as difficult to interpret and ineffective for high-dimensional datasets where all dimensions can´t be displayed simultaneously. Interactive options such as tree view to select test pointsto visualize were suggested to further improve the usefulness of Parallel Coordinates. Scatterplot proved useful for analyzing individual test points and showed general trends in a user-friendly way. Furthermore, the study also showed that interactivity affect the perception of visualizations. Limited interactivity resulted in users finding the visualizations less effective in distinguishing anomalies and were perceived as less user-friendly.
60

Efficient Resource Management : A Comparison of Predictive Scaling Algorithms in Cloud-Based Applications

Dahl, Johanna, Strömbäck, Elsa January 2024 (has links)
This study aims to explore predictive scaling algorithms used to predict and manage workloadsin a containerized system. The goal is to identify which predictive scaling approach delivers themost effective results, contributing to research on cloud elasticity and resource management.This potentially leads to reduced infrastructure costs while maintaining efficient performance,enabling a more sustainable cloud-computing technology. The work involved the developmentand comparison of three different autoscaling algorithms with an interchangeable predictioncomponent. For the predictive part, three different time-series analysis methods were used:XGBoost, ARIMA, and Prophet. A simulation system with the necessary modules wasdeveloped, as well as a designated target service to experience the load. Each algorithm'sscaling accuracy was evaluated by comparing its suggested number of instances to the optimalnumber, with each instance representing a simulated CPU core. The results showed varyingefficiency: XGBoost and Prophet excelled with richer datasets, while ARIMA performed betterwith limited data. Although XGBoost and Prophet maintained 100% uptime, this could lead toresource wastage, whereas ARIMA's lower uptime percentage possibly suggested a moreresource-efficient, though less reliable, approach. Further analysis, particularly experimentalinvestigation is required to deepen the understanding of these predictors' influence on resourceallocation.

Page generated in 0.0809 seconds