Spelling suggestions: "subject:"supervised machine learning""
11 |
Initialization of the k-means algorithm : A comparison of three methodsJorstedt, Simon January 2023 (has links)
k-means is a simple and flexible clustering algorithm that has remained in common use for 50+ years. In this thesis, we discuss the algorithm in general, its advantages, weaknesses and how its ability to locate clusters can be enhanced with a suitable initialization method. We formulate appropriate requirements for the (batched) UnifRandom, k-means++ and Kaufman initialization methods and compare their performance on real and generated data through simulations. We find that all three methods (followed by the k-means procedure) are able to accurately locate at least up to nine well-separated clusters, but the appropriately batched UnifRandom and the Kaufman methods are both significantly more computationally expensive than the k-means++ method already for K = 5 clusters in a dataset of N = 1000 points.
|
12 |
Performance Interference Detection For Cloud-Native Applications Using Unsupervised Machine Learning ModelsBakshi, Eli 01 June 2024 (has links) (PDF)
Contemporary cloud-native applications frequently adopt the microservice architecture, where applications are deployed within multiple containers that run on cloud virtual machines (VMs). These applications are typically hosted on public cloud platforms, where VMs from multiple cloud subscribers compete for the same physical resources on a cloud server. When a cloud subscriber application running on a VM competes for shared physical resources from other applications running on the same VM or from other VMs co-located on the same cloud server, performance interference may occur when the performance of an application degrades due to shared resource contention. Detecting such interference is crucial for maintaining the Quality-of-Service of cloud-native Web applications. However, cloud subscribers lack access to underlying host-level hardware metrics traditionally used for interference detection without needing to instrument high overhead-inducing per-request response time values. Machine learning (ML) techniques have proven effective in detecting performance interference using metrics available at the subscriber level, though these techniques have predominantly focused on supervised models with pre-existing labeled data sets that can distinguish between normal and interference conditions. In contrast, this work proposes an unsupervised clustering ML approach to identify performance interference in cloud-native applications. The proposed approach implements a lightweight method for collecting container metrics in normal and interference scenarios and applies a dimensionality reduction technique to mitigate redundancy and noise in the collected dataset. We then apply a density-based clustering approach to this unlabeled data set to classify interference in two applications running on the AWS EC2 cloud: a microbenchmark Web application called Acme Air and a large-scale production-realistic Web benchmark called DeathStarBench. Results indicate that our density-based clustering approach effectively distinguishes between normal and interference conditions and achieves an average Density-Based Clustering Validation (DBCV) index of 0.781 and a cluster homogeneity of 0.875 across both applications.
|
13 |
Anomaly Detection in Time Series Data using Unsupervised Machine Learning Methods: A Clustering-Based Approach / Anomalidetektering av tidsseriedata med hjälp av oövervakad maskininlärningsmetoder: En klusterbaserad tillvägagångssättHanna, Peter, Swartling, Erik January 2020 (has links)
For many companies in the manufacturing industry, attempts to find damages in their products is a vital process, especially during the production phase. Since applying different machine learning techniques can further aid the process of damage identification, it becomes a popular choice among companies to make use of these methods to enhance the production process even further. For some industries, damage identification can be heavily linked with anomaly detection of different measurements. In this thesis, the aim is to construct unsupervised machine learning models to identify anomalies on unlabeled measurements of pumps using high frequency sampled current and voltage time series data. The measurement can be split up into five different phases, namely the startup phase, three duty point phases and lastly the shutdown phase. The approach is based on clustering methods, where the main algorithms of use are the density-based algorithms DBSCAN and LOF. Dimensionality reduction techniques, such as feature extraction and feature selection, are applied to the data and after constructing the five models of each phase, it can be seen that the models identifies anomalies in the data set given. / För flera företag i tillverkningsindustrin är felsökningar av produkter en fundamental uppgift i produktionsprocessen. Då användningen av olika maskininlärningsmetoder visar sig innehålla användbara tekniker för att hitta fel i produkter är dessa metoder ett populärt val bland företag som ytterligare vill förbättra produktionprocessen. För vissa industrier är feldetektering starkt kopplat till anomalidetektering av olika mätningar. I detta examensarbete är syftet att konstruera oövervakad maskininlärningsmodeller för att identifiera anomalier i tidsseriedata. Mer specifikt består datan av högfrekvent mätdata av pumpar via ström och spänningsmätningar. Mätningarna består av fem olika faser, nämligen uppstartsfasen, tre last-faser och fasen för avstängning. Maskinilärningsmetoderna är baserade på olika klustertekniker, och de metoderna som användes är DBSCAN och LOF algoritmerna. Dessutom tillämpades olika dimensionsreduktionstekniker och efter att ha konstruerat 5 olika modeller, alltså en för varje fas, kan det konstateras att modellerna lyckats identifiera anomalier i det givna datasetet.
|
14 |
Localização de danos em estruturas isotrópicas com a utilização de aprendizado de máquina / Localization of damages in isotropic strutures with the use of machine learningOliveira, Daniela Cabral de [UNESP] 28 June 2017 (has links)
Submitted by DANIELA CABRAL DE OLIVEIRA null (danielacaboliveira@gmail.com) on 2017-07-31T18:25:34Z
No. of bitstreams: 1
Dissertacao.pdf: 4071736 bytes, checksum: 8334dda6779551cc88a5687ed7937bb3 (MD5) / Approved for entry into archive by Luiz Galeffi (luizgaleffi@gmail.com) on 2017-08-03T16:52:18Z (GMT) No. of bitstreams: 1
oliveira_dc_me_ilha.pdf: 4071736 bytes, checksum: 8334dda6779551cc88a5687ed7937bb3 (MD5) / Made available in DSpace on 2017-08-03T16:52:18Z (GMT). No. of bitstreams: 1
oliveira_dc_me_ilha.pdf: 4071736 bytes, checksum: 8334dda6779551cc88a5687ed7937bb3 (MD5)
Previous issue date: 2017-06-28 / Este trabalho introduz uma nova metodologia de Monitoramento da Integridade de Estruturas (SHM, do inglês Structural Health Monitoring) utilizando algoritmos de aprendizado de máquina não-supervisionado para localização e detecção de dano. A abordagem foi testada em material isotrópico (placa de alumínio). Os dados experimentais foram cedidos por Rosa (2016). O banco de dados disponibilizado é abrangente e inclui medidas em diversas situações. Os transdutores piezelétricos foram colados na placa de alumínio com dimensões de 500 x 500 x 2mm, que atuam como sensores e atuadores ao mesmo tempo. Para manipulação dos dados foram analisados os sinais definindo o primeiro pacote do sinal (first packet), considerando apenas o intervalo de tempo igual ao tempo da força de excitação. Neste caso, na há interferência dos sinais refletidos nas bordas da estrutura. Os sinais são obtidos na situação sem dano (baseline) e, posteriormente nas diversas situações de dano. Como método de avaliação do quanto o dano interfere em cada caminho, foram implementadas as seguintes métricas: pico máximo, valor médio quadrático (RMSD), correlação entre os sinais, normas H2 e H∞ entre os sinais baseline e sinais com dano. Logo após o cálculo das métricas para as diversas situações de dano, foi implementado o algoritmo de aprendizado de máquina não-supervisionado K-Means no matlab e também testado no toolbox Weka. No algoritmo K-Means há a necessidade da pré-determinação do número de clusters e isto pode dificultar sua utilização nas situações reais. Então, fez se necessário a implementação de um algoritmo de aprendizado de máquina não-supervisionado que utiliza propagação de afinidades, onde a determinação do número de clusters é definida pela matriz de similaridades. O algoritmo de propagação de afinidades foi desenvolvido para todas as métricas separadamente para cada dano. / This paper introduces a new Structural Health Monitoring (SHM) methodology using unsupervised machine learning algorithms for locating and detecting damage. The approach was tested with isotropic material in an aluminum plate. Experimental data were provided by Rosa (2016). This provided database is open and includes measures in a variety of situations. The piezoelectric transducers were bonded to the aluminum plate with dimensions 500 x 500 x 2mm, and act as sensors and actuators simultaneously. In order to manipulate the data, signals defining the first packet were analyzed. It considers strictly the time interval equal to excitation force length. In this case, there is no interference of reflected signals in the structure boundaries. Signals are gathered at undamaged situation (baseline) and at several damage situations. As an evaluating method of how damage interferes in each path, it was implemented the following metrics: maximum peak, root-mean-square deviation (RMSD), correlation between signals, H2 and H∞ norms regarding baseline and damaged signals. The metrics were computed for numerous damage situations. The data were evaluated in an unsupervised K-Means machine learning algorithm implemented in matlab and also tested in Weka toolbox. However, the K-Means algorithm requires the specification of the number of clusters and it is a problem for practical applications. Therefore, an implementation of an unsupervised machine learning algorithm, which uses affinity propagation was made. In this case, the determination of the number of clusters is defined by the data similarity matrix. The affinity propagation algorithm was developed for all metrics separately for each damage.
|
15 |
Unsupervised topic modeling for customer support chat : Comparing LDA and K-meansAndersson, Fredrik, Idemark, Alexander January 2021 (has links)
Fortnox takes in many errands via their support chat. Some of the questions can be hard to interpret, making it difficult to know where to delegate the question further. It would be beneficial if the process was automated to answer the questions instead of need to put in time to analyze the questions to be able to delegate them. So, the main task is to find an unsupervised model that can take questions and put them into topics. A literature review over NLP and clustering was needed to find the most suitable models and techniques for the problem. Then implementing the models and techniques and evaluating them using support chat questions received by Fortnox. The unsupervised models tested in this thesis were LDA and K-means. The resulting models after training are analyzed, and some of the clusters are given a label. The authors of the thesis give clusters a label after analyzing them by looking at the most relevant words for the cluster. Three different sets of labels are analyzed and tested. The models are evaluated using five different score metrics: Silhouette, AdjustedRand Index, Recall, Precision, and F1 score. K-means scores the best when looking at the score metrics and have an F1 score of 0.417. But can not handle very small documents. LDA does not perform very well and got i F1 score of 0.137 and is not able to categorize documents together.
|
16 |
Some contributions to the clustering of financial time series and applications to credit default swaps / Quelques contributions aux méthodes de partitionnement automatique des séries temporelles financières, et applications aux couvertures de défaillanceMarti, Gautier 10 November 2017 (has links)
Nous commençons cette thèse par passer en revue l'ensemble épars de la littérature sur les méthodes de partitionnement automatique des séries temporelles financières. Ensuite, tout en introduisant les jeux de données qui ont aussi bien servi lors des études empiriques que motivé les choix de modélisation, nous essayons de donner des informations intéressantes sur l'état du marché des couvertures de défaillance peu connu du grand public sinon pour son rôle lors de la crise financière mondiale de 2007-2008. Contrairement à la majorité de la littérature sur les méthodes de partitionnement automatique des séries temporelles financières, notre but n'est pas de décrire et expliquer les résultats par des explications économiques, mais de pouvoir bâtir des modèles et autres larges systèmes d'information sur ces groupes homogènes. Pour ce faire, les fondations doivent être stables. C'est pourquoi l'essentiel des travaux entrepris et décrits dans cette thèse visent à affermir le bien-fondé de l'utilisation de ces regroupements automatiques en discutant de leur consistance et stabilité aux perturbations. De nouvelles distances entre séries temporelles financières prenant mieux en compte leur nature stochastique et pouvant être mis à profit dans les méthodes de partitionnement automatique existantes sont proposées. Nous étudions empiriquement leur impact sur les résultats. Les résultats de ces études peuvent être consultés sur www.datagrapple.com. / In this thesis we first review the scattered literature about clustering financial time series. We then try to give as much colors as possible on the credit default swap market, a relatively unknown market from the general public but for its role in the contagion of bank failures during the global financial crisis of 2007-2008, while introducing the datasets that have been used in the empirical studies. Unlike the existing body of literature which mostly offers descriptive studies, we aim at building models and large information systems based on clusters which are seen as basic building blocks: These foundations must be stable. That is why the work undertaken and described in the following intends to ground further the clustering methodologies. For that purpose, we discuss their consistency and propose alternative measures of similarity that can be plugged in the clustering methodologies. We study empirically their impact on the clusters. Results of the empirical studies can be explored at www.datagrapple.com.
|
17 |
Advanced Electricity Meter Anomaly Detection : A Machine Learning ApproachSvensson, Robin, Shalabi, Saleh January 2023 (has links)
The increasing volume of smart electricity meter readings presents a challenge forelectricity providing companies in accurately validating and correcting the associated data. This thesis attempts to find a possible solution through the application ofunsupervised machine learning for detection of anomalous readings. Through thisapplication there is a possibility of reducing the amount of manual labor that is required each month to find which meters are necessary to investigate. A solution tothis problem could prove beneficial for both the companies and their customers. Itcould increase abnormalities detected and resolve any issues before having a significant impact. Two possible algorithms to detect anomalies within these meters areinvestigated. These algorithms are the Isolation Forest and a Autoencoder, wherethe autoencoder showed results within the expectations. The results shows a greatreduction of the manual labor that is required up to 96%.
|
18 |
Fraud Detection on Unlabeled Data with Unsupervised Machine Learning / Bedrägeridetektering på omärkt data med oövervakad maskininlärningRenström, Martin, Holmsten, Timothy January 2018 (has links)
A common problem in systems handling user interaction was the risk for fraudulent behaviour. As an example, in a system with credit card transactions it could have been a person using a another user's account for purchases, or in a system with advertisment it could be bots clicking on ads. These malicious attacks were often disguised as normal interactions and could be difficult to detect. It was especially challenging when working with datasets that did not contain so called labels, which showed if the data point was fraudulent or not. This meant that there were no data that had previously been classified as fraud, which in turn made it difficult to develop an algorithm that could distinguish between normal and fraudulent behavior. In this thesis, the area of anomaly detection was explored with the intent of detecting fraudulent behavior without labeled data. Three neural network based prototypes were developed in this study. All three prototypes were some sort of variation of autoencoders. The first prototype which served as a baseline was a simple three layer autoencoder, the second prototype was a novel autoencoder which was called stacked autoencoder, the third prototype was a variational autoencoder. The prototypes were then trained and evaluated on two different datasets which both contained non fraudulent and fraudulent data. In this study it was found that the proposed stacked autoencoder architecture achieved better performance scores in recall, accuracy and NPV in the tests that were designed to simulate a real world scenario. / Ett vanligt problem med användares interaktioner i ett system var risken för bedrägeri. För ett system som hanterarade dataset med kreditkortstransaktioner så kunde ett exempel vara att en person använde en annans identitet för kortköp, eller i system som hanterade reklam så skulle det kunna ha varit en automatiserad mjukvara som simulerade interaktioner. Dessa attacker var ofta maskerade som normala interaktioner och kunde därmed vara svåra att upptäcka. Inom dataset som inte har korrekt märkt data så skulle det vara speciellt svårt att utveckla en algoritm som kan skilja på om interaktionen var avvikande eller inte. I denna avhandling så utforskas ämnet att upptäcka anomalier i dataset utan specifik data som tyder på att det var bedrägeri. Tre prototyper av neurala nätverk användes i denna studie som tränades och utvärderades på två dataset som innehöll både data som sade att det var bedrägeri och inte bedrägeri. Den första prototypen som fungerade som en bas var en simpel autoencoder med tre lager, den andra prototypen var en ny autoencoder som har fått namnet staplad autoencoder och den tredje prototypen var en variationell autoencoder. För denna studie så gav den föreslagna staplade autoencodern bäst resultat för återkallelse, noggrannhet och NPV i de test som var designade att efterlikna ett verkligt scenario.
|
19 |
Decoding communication of non-human species - Unsupervised machine learning to infer syntactical and temporal patterns in fruit-bats vocalizations.Assom, Luigi January 2023 (has links)
Decoding non-human species communication offers a unique chance to explore alternative intelligence forms using machine learning. This master thesis focuses on discreteness and grammar, two of five linguistic areas machine learning can support, and tackles inferring syntax and temporal structures from bioacoustics data annotated with animal behavior. The problem lies in a lack of species-specific linguistic knowledge, time-consuming feature extraction and availability of limited data; additionally, unsupervised clustering struggles to discretize vocalizations continuous to human perception due to unclear parameter tuning to preprocess audio. This thesis investigates unsupervised learning to generalize deciphering syntax and short-range temporal patterns in continuous-type vocalizations, specifically fruit-bats, to address the research questions: How does dimensionality reduction affect unsupervised manifold learning to quantify size and diversity of the animal repertoire? and How do syntax and temporal structure encode contextual information? An experimental strategy is designed to improve effectiveness of unsupervised clustering for quantifying the repertoire and to investigate linguistic properties with classifiers and sequence mining; acoustic segments are collected from a dataset of fruit-bat vocalizations annotated with behavior. The methodology keeps clustering methods constant while varying dimensionality reduction techniques on spectrograms and their latent representations learnt by Autoencoders. Uniform Manifold Approximation and Projection (UMAP) embeds data into a manifold; density-based clusterings are applied to its embeddings and compared with agglomerative-based labels, used as ground-truth proxy to test robustness of models. Vocalizations are encoded into label sequences. Syntactic rules and short-range patterns in sequences are investigated with classifiers (Support Vector Machines, Random Forests); graph-analytics and prefix-suffix trees. Reducing the temporal dimension of Mel-spectrograms outperformed previous clustering baseline (Silhouette score > 0.5, 95% assignment accuracy). UMAP embeddings from sequential autoencoders showed potential advantages over convolutional autoencoders. The study revealed a repertoire between seven and approximately 20 vocal-units characterized by combinatorial patterns: context-classification achieved F1-score > 0.9 also with permuted sequences; repetition characterized vocalizations of isolated pups. Vocal-unit distributions were significantly different (p < 0.05) across contexts; a truncated-power law (alpha < 2) described the distribution of maximal repetitions. This thesis contributed to unsupervised machine learning in bioacoustics for decoding non-human communication, aiding research in language evolution and animal cognition.
|
20 |
Analyzing Music Improvisations Using Unsupervised Machine Learning : Towards Automatically Discovering Creative Cognition Principles / Analysera musikaliska improvisationer utan tillsyn Maskininlärning : Mot automatisk upptäckt av principer för kreativ kognitionJorda i Custal, Cristina January 2024 (has links)
In the field of musical expression, the complex relationship between improvisation and the cognitive processes that underlie creativity presents a fascinating yet challenging puzzle, prompting this thesis to explore the connection between musical improvisation and creative cognition among musicians. Focusing on the development of robust methods for feature extraction and representation, it utilizes unsupervised Machine Learning (ML) techniques to project improvisations from a prime melody into a high-level latent space. The methodology involves iterative analysis employing Variational Autoencoder (VAE) models, initially pre-trained with a larger dataset and fine-tuned with a musical improvisation dataset provided by the Max Plank Institute. Evaluation encompasses Evidence Lower Bound (ELBO) loss metric and dimensionality reduction techniques like Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), Multidimensional Scaling (MDS), and Uniform Manifold Approximation and Projection (UMAP) to explore latent space representations. The results reveal that experienced musicians exhibit a greater divergence from the prime melody compared to amateurs. Moreover, professionals’ samples demonstrate more refined clustering and nuanced adjustments between improvisations projected in the latent space. / Inom musikaliska uttryck är det komplexa förhållandet mellan improvisation och de kognitiva processer som ligger till grund för kreativitet ett fascinerande men utmanande pussel, vilket föranleder denna avhandling att utforska sambandet mellan musikalisk improvisation och kreativ kognition bland musiker. Avhandlingen fokuserar på utvecklingen av robusta metoder för extraktion och representation av funktioner och använder oövervakade maskininlärningstekniker (ML) för att projicera improvisationer från en huvudmelodi till ett latent utrymme på hög nivå. Metoden innebär iterativ analys med hjälp av VAE-modeller (Variational Autoencoder), som ursprungligen förutbildades med ett större dataset och finjusterades med ett dataset för musikalisk improvisation från Max Plank Institute. Utvärderingen omfattar förlustmåttet Evidence Lower Bound (ELBO) och dimensionalitetsreducerande tekniker som Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), Multidimensional Scaling (MDS) och Uniform Manifold Approximation and Projection (UMAP) för att utforska latenta rymdrepresentationer. Resultaten visar att erfarna musiker uppvisar en större avvikelse från huvudmelodin jämfört med amatörer. Dessutom visar professionella musiker mer raffinerade kluster och nyanserade justeringar mellan improvisationer som projiceras i den latenta rymden.
|
Page generated in 0.1238 seconds