Global ETD Search

51	Comparing unsupervised clustering algorithms to locate uncommon user behavior in public travel data : A comparison between the K-Means and Gaussian Mixture Model algorithms Andrésen, Anton, Håkansson, Adam January 2020 (has links) Clustering machine learning algorithms have existed for a long time and there are a multitude of variations of them available to implement. Each of them has its advantages and disadvantages, which makes it challenging to select one for a particular problem and application. This study focuses on comparing two algorithms, the K-Means and Gaussian Mixture Model algorithms for outlier detection within public travel data from the travel planning mobile application MobiTime1[1]. The purpose of this study was to compare the two algorithms against each other, to identify differences between their outlier detection results. The comparisons were mainly done by comparing the differences in number of outliers located for each model, with respect to outlier threshold and number of clusters. The study found that the algorithms have large differences regarding their capabilities of detecting outliers. These differences heavily depend on the type of data that is used, but one major difference that was found was that K-Means was more restrictive then Gaussian Mixture Model when it comes to classifying data points as outliers. The result of this study could help people determining which algorithms to implement for their specific application and use case. Machine learning clustering K-Means Gaussian Mixture Model expectation-maximum data analysis public transport silhouette analysis outliers outlier detection data algorithms experiment Computer and Information Sciences Data- och informationsvetenskap
52	Speech to Text for Swedish using KALDI / Tal till text, utvecklandet av en svensk taligenkänningsmodell i KALDI Kullmann, Emelie January 2016 (has links) The field of speech recognition has during the last decade left the re- search stage and found its way in to the public market. Most computers and mobile phones sold today support dictation and transcription in a number of chosen languages. Swedish is often not one of them. In this thesis, which is executed on behalf of the Swedish Radio, an Automatic Speech Recognition model for Swedish is trained and the performance evaluated. The model is built using the open source toolkit Kaldi. Two approaches of training the acoustic part of the model is investigated. Firstly, using Hidden Markov Model and Gaussian Mixture Models and secondly, using Hidden Markov Models and Deep Neural Networks. The later approach using deep neural networks is found to achieve a better performance in terms of Word Error Rate. / De senaste åren har olika tillämpningar inom människa-dator interaktion och främst taligenkänning hittat sig ut på den allmänna marknaden. Många system och tekniska produkter stöder idag tjänsterna att transkribera tal och diktera text. Detta gäller dock främst de större språken och sällan finns samma stöd för mindre språk som exempelvis svenskan. I detta examensprojekt har en modell för taligenkänning på svenska ut- vecklas. Det är genomfört på uppdrag av Sveriges Radio som skulle ha stor nytta av en fungerande taligenkänningsmodell på svenska. Modellen är utvecklad i ramverket Kaldi. Två tillvägagångssätt för den akustiska träningen av modellen är implementerade och prestandan för dessa två är evaluerade och jämförda. Först tränas en modell med användningen av Hidden Markov Models och Gaussian Mixture Models och slutligen en modell där Hidden Markov Models och Deep Neural Networks an- vänds, det visar sig att den senare uppnår ett bättre resultat i form av måttet Word Error Rate. Automatic Speech Recognition Kaldi Hidden Markov Model Gaussian Mixture Model Deep Neural Network Taligenkänning Kaldi Hidden Markov Model Gaussian Mixture Models Deep Neural Networks Mathematics Matematik
53	Neural probabilistic topic modeling of short and messy text / Neuronprobabilistisk ämnesmodellering av kort och stökig text Harrysson, Mattias January 2016 (has links) Exploring massive amount of user generated data with topics posits a new way to find useful information. The topics are assumed to be “hidden” and must be “uncovered” by statistical methods such as topic modeling. However, the user generated data is typically short and messy e.g. informal chat conversations, heavy use of slang words and “noise” which could be URL’s or other forms of pseudo-text. This type of data is difficult to process for most natural language processing methods, including topic modeling. This thesis attempts to find the approach that objectively give the better topics from short and messy text in a comparative study. The compared approaches are latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words, and a new approach based on previous work named Neural Probabilistic Topic Modeling (NPTM). It could only be concluded that NPTM have a tendency to achieve better topics on short and messy text than LDA and RO-LDA. GMM on the other hand could not produce any meaningful results at all. The results are less conclusive since NPTM suffers from long running times which prevented enough samples to be obtained for a statistical test. / Att utforska enorma mängder användargenererad data med ämnen postulerar ett nytt sätt att hitta användbar information. Ämnena antas vara “gömda” och måste “avtäckas” med statistiska metoder såsom ämnesmodellering. Dock är användargenererad data generellt sätt kort och stökig t.ex. informella chattkonversationer, mycket slangord och “brus” som kan vara URL:er eller andra former av pseudo-text. Denna typ av data är svår att bearbeta för de flesta algoritmer i naturligt språk, inklusive ämnesmodellering. Det här arbetet har försökt hitta den metod som objektivt ger dem bättre ämnena ur kort och stökig text i en jämförande studie. De metoder som jämfördes var latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words samt en egen metod med namnet Neural Probabilistic Topic Modeling (NPTM) baserat på tidigare arbeten. Den slutsats som kan dras är att NPTM har en tendens att ge bättre ämnen på kort och stökig text jämfört med LDA och RO-LDA. GMM lyckades inte ge några meningsfulla resultat alls. Resultaten är mindre bevisande eftersom NPTM har problem med långa körtider vilket innebär att tillräckligt många stickprov inte kunde erhållas för ett statistiskt test. Topic modeling Twitter Latent Dirichlet allocation LDA Re-organized LDA RO-LDA GMM Gaussian mixture model Unsupervised Machine learning Computer Sciences Datavetenskap (datalogi)
54	Automatic Speech Recognition in Somali Gabriel, Naveen January 2020 (has links) The field of speech recognition during the last decade has left the research stage and found its way into the public market, and today, speech recognition software is ubiquitous around us. An automatic speech recognizer understands human speech and represents it as text. Most of the current speech recognition software employs variants of deep neural networks. Before the deep learning era, the hybrid of hidden Markov model and Gaussian mixture model (HMM-GMM) was a popular statistical model to solve speech recognition. In this thesis, automatic speech recognition using HMM-GMM was trained on Somali data which consisted of voice recording and its transcription. HMM-GMM is a hybrid system in which the framework is composed of an acoustic model and a language model. The acoustic model represents the time-variant aspect of the speech signal, and the language model determines how probable is the observed sequence of words. This thesis begins with background about speech recognition. Literature survey covers some of the work that has been done in this field. This thesis evaluates how different language models and discounting methods affect the performance of speech recognition systems. Also, log scores were calculated for the top 5 predicted sentences and confidence measures of pre-dicted sentences. The model was trained on 4.5 hrs of voiced data and its corresponding transcription. It was evaluated on 3 mins of testing data. The performance of the trained model on the test set was good, given that the data was devoid of any background noise and lack of variability. The performance of the model is measured using word error rate(WER) and sentence error rate (SER). The performance of the implemented model is also compared with the results of other research work. This thesis also discusses why log and confidence score of the sentence might not be a good way to measure the performance of the resulting model. It also discusses the shortcoming of the HMM-GMM model, how the existing model can be improved, and different alternatives to solve the problem. automatic speech recognition speaker adaptation generative training gaussian mixture model kaldi finite-state transducers Probability Theory and Statistics Sannolikhetsteori och statistik
55	Optimalizace modelování gaussovských směsí v podprostorech a jejich skórování v rozpoznávání mluvčího / Optimization of Gaussian Mixture Subspace Models and Related Scoring Algorithms in Speaker Verification Glembek, Ondřej January 2012 (has links) Tato práce pojednává o modelování v podprostoru parametrů směsí gaussovských rozložení pro rozpoznávání mluvčího. Práce se skládá ze tří částí. První část je věnována skórovacím metodám při použití sdružené faktorové analýzy k modelování mluvčího. Studované metody se liší převážně v tom, jak se vypořádávají s variabilitou kanálu testovacích nahrávek. Metody jsou prezentovány v souvislosti s obecnou formou funkce pravděpodobnosti pro sdruženou faktorovou analýzu a porovnány jak z hlediska přesnosti, tak i z hlediska rychlosti. Je zde prokázáno, že použití lineární aproximace pravděpodobnostní funkce dává výsledky srovnatelné se standardním vyhodnocením pravděpodobnosti při dramatickém zjednodušení matematického zápisu a tím i zvýšení rychlosti vyhodnocování. Druhá část pojednává o extrakci tzv. i-vektorů, tedy nízkodimenzionálních reprezentací nahrávek. Práce prezentuje dva přístupy ke zjednodušení extrakce. Motivací pro tuto část bylo jednak urychlení extrakce i-vektorů, jednak nasazení této úspěšné techniky na jednoduchá zařízení typu mobilní telefon, a také matematické zjednodušení umožněňující využití numerických optimalizačních metod pro diskriminativní trénování. Výsledky ukazují, že na dlouhých nahrávkách je zrychlení vykoupeno poklesem úspěšnosti rozpoznávání, avšak na krátkých nahrávkách, kde je úspěšnost rozpoznávání nízká, se rozdíly úspěšnosti stírají. Třetí část se zabývá diskriminativním trénováním v oblasti rozpoznávání mluvčího. Jsou zde shrnuty poznatky z předchozích prací zabývajících se touto problematikou. Kapitola navazuje na poznatky z předchozích dvou částí a pojednává o diskriminativním trénování parametrů extraktoru i-vektorů. Výsledky ukazují, že při klasickém trénování extraktoru a následném diskriminatviním přetrénování tyto metody zvyšují úspěšnost.
56	Improved Methodologies for the Simultanoeus Study of Two Motor Systems: Reticulospinal and Corticospinal Cooperation and Competition for Motor Control Ortiz-Rosario, Alexis 31 October 2016 (has links) No description available. Biomedical Engineering Computer Science Neurosciences
57	Unsupervised Anomaly Detection and Root Cause Analysis in HFC Networks : A Clustering Approach Forsare Källman, Povel January 2021 (has links) Following the significant transition from the traditional production industry to an informationbased economy, the telecommunications industry was faced with an explosion of innovation, resulting in a continuous change in user behaviour. The industry has made efforts to adapt to a more datadriven future, which has given rise to larger and more complex systems. Therefore, troubleshooting systems such as anomaly detection and root cause analysis are essential features for maintaining service quality and facilitating daily operations. This study aims to explore the possibilities, benefits, and drawbacks of implementing cluster analysis for anomaly detection in hybrid fibercoaxial networks. Based on the literature review on unsupervised anomaly detection and an assumption regarding the anomalous behaviour in hybrid fibercoaxial network data, the kmeans, SelfOrganizing Map, and Gaussian Mixture Model were implemented both with and without Principal Component Analysis. Analysis of the results demonstrated an increase in performance for all models when the Principal Component Analysis was applied, with kmeans outperforming both SelfOrganizing Map and Gaussian Mixture Model. On this basis, it is recommended to apply Principal Component Analysis for clusteringbased anomaly detection. Further research is necessary to identify whether cluster analysis is the most appropriate unsupervised anomaly detection approach. / Följt av övergången från den traditionella tillverkningsindustrin till en informationsbaserad ekonomi stod telekommunikationsbranschen inför en explosion av innovation. Detta skifte resulterade i en kontinuerlig förändring av användarbeteende och branschen tvingades genomgå stora ansträngningar för att lyckas anpassa sig till den mer datadrivna framtiden. Större och mer komplexa system utvecklades och således blev felsökningsfunktioner såsom anomalidetektering och rotfelsanalys centrala för att upprätthålla servicekvalitet samt underlätta för den dagliga driftverksamheten. Syftet med studien är att utforska de möjligheterna, för- samt nackdelar med att använda klusteranalys för anomalidetektering inom HFC- nätverk. Baserat på litteraturstudien för oövervakad anomalidetektering samt antaganden för anomalibeteenden inom HFC- data valdes algritmerna k- means, Self- Organizing Map och Gaussian Mixture Model att implementeras, både med och utan Principal Component Analysis. Analys av resultaten påvisade en uppenbar ökning av prestanda för samtliga modeller vid användning av PCA. Vidare överträffade k- means, både Self- Organizing Maps och Gaussian Mixture Model. Utifrån resultatanalysen rekommenderas det således att PCA bör tillämpas vid klusterings- baserad anomalidetektering. Vidare är ytterligare forskning nödvändig för att avgöra huruvida klusteranalys är den mest lämpliga metoden för oövervakad anomalidetektering. Anomaly Detection Root Cause Analysis Cluster Analysis k- means Self- Organizing Map Gaussian Mixture Model Dimensionality Reduction Principal Component Analysis Hybrid Fiber- Coaxial Network. Anomalidetektering Rotfelsanalys Klusteranalys k- means Self- Organizing Map Gaussian Mixture Model Dimensionsreducering Principal Component Analysis Hybrid Fiber Coax- nät. Computer and Information Sciences Data- och informationsvetenskap
58	A multi-wavelength study of a sample of galaxy clusters / Susan Wilson Wilson, Susan January 2012 (has links) In this dissertation we aim to perform a multi-wavelength analysis of galaxy clusters. We discuss various methods for clustering in order to determine physical parameters of galaxy clusters required for this type of study. A selection of galaxy clusters was chosen from 4 papers, (Popesso et al. 2007b, Yoon et al. 2008, Loubser et al. 2008, Brownstein & Mo at 2006) and restricted by redshift and galactic latitude to reveal a sample of 40 galaxy clusters with 0.0 < z < 0.15. Data mining using Virtual Observatory (VO) and a literature survey provided some background information about each of the galaxy clusters in our sample with respect to optical, radio and X-ray data. Using the Kayes Mixture Model (KMM) and the Gaussian Mixing Model (GMM), we determine the most likely cluster member candidates for each source in our sample. We compare the results obtained to SIMBADs method of hierarchy. We show that the GMM provides a very robust method to determine member candidates but in order to ensure that the right candidates are chosen we apply a select choice of outlier tests to our sources. We determine a method based on a combination of GMM, the QQ Plot and the Rosner test that provides a robust and consistent method for determining galaxy cluster members. Comparison between calculated physical parameters; velocity dispersion, radius, mass and temperature, and values obtained from literature show that for the majority of our galaxy clusters agree within 3 range. Inconsistencies are thought to be due to dynamically active clusters that have substructure or are undergoing mergers, making galaxy member identi cation di cult. Six correlations between di erent physical parameters in the optical and X-ray wavelength were consistent with published results. Comparing the velocity dispersion with the X-ray temperature, we found a relation of T0:43 as compared to T0:5 obtained from Bird et al. (1995). X-ray luminosity temperature and X-ray luminosity velocity dispersion relations gave the results LX T2:44 and LX 2:40 which lie within the uncertainty of results given by Rozgacheva & Kuvshinova (2010). These results all suggest that our method for determining galaxy cluster members is e cient and application to higher redshift sources can be considered. Further studies on galaxy clusters with substructure must be performed in order to improve this method. In future work, the physical parameters obtained here will be further compared to X-ray and radio properties in order to determine a link between bent radio sources and the galaxy cluster environment. / MSc (Space Physics), North-West University, Potchefstroom Campus, 2013 Galaxy kinematics and dynamics Galaxy Clusters Statistical analysis Clustering algorithms Abell clusters Mass determination Multi-wavelength view Kayes Mixing Model Gaussian Mixture Model Multi-modality Radio galaxies Data mining Velocity dispersion Kernel density estimation Outlier detection techniques
59	A multi-wavelength study of a sample of galaxy clusters / Susan Wilson Wilson, Susan January 2012 (has links) In this dissertation we aim to perform a multi-wavelength analysis of galaxy clusters. We discuss various methods for clustering in order to determine physical parameters of galaxy clusters required for this type of study. A selection of galaxy clusters was chosen from 4 papers, (Popesso et al. 2007b, Yoon et al. 2008, Loubser et al. 2008, Brownstein & Mo at 2006) and restricted by redshift and galactic latitude to reveal a sample of 40 galaxy clusters with 0.0 < z < 0.15. Data mining using Virtual Observatory (VO) and a literature survey provided some background information about each of the galaxy clusters in our sample with respect to optical, radio and X-ray data. Using the Kayes Mixture Model (KMM) and the Gaussian Mixing Model (GMM), we determine the most likely cluster member candidates for each source in our sample. We compare the results obtained to SIMBADs method of hierarchy. We show that the GMM provides a very robust method to determine member candidates but in order to ensure that the right candidates are chosen we apply a select choice of outlier tests to our sources. We determine a method based on a combination of GMM, the QQ Plot and the Rosner test that provides a robust and consistent method for determining galaxy cluster members. Comparison between calculated physical parameters; velocity dispersion, radius, mass and temperature, and values obtained from literature show that for the majority of our galaxy clusters agree within 3 range. Inconsistencies are thought to be due to dynamically active clusters that have substructure or are undergoing mergers, making galaxy member identi cation di cult. Six correlations between di erent physical parameters in the optical and X-ray wavelength were consistent with published results. Comparing the velocity dispersion with the X-ray temperature, we found a relation of T0:43 as compared to T0:5 obtained from Bird et al. (1995). X-ray luminosity temperature and X-ray luminosity velocity dispersion relations gave the results LX T2:44 and LX 2:40 which lie within the uncertainty of results given by Rozgacheva & Kuvshinova (2010). These results all suggest that our method for determining galaxy cluster members is e cient and application to higher redshift sources can be considered. Further studies on galaxy clusters with substructure must be performed in order to improve this method. In future work, the physical parameters obtained here will be further compared to X-ray and radio properties in order to determine a link between bent radio sources and the galaxy cluster environment. / MSc (Space Physics), North-West University, Potchefstroom Campus, 2013 Galaxy kinematics and dynamics Galaxy Clusters Statistical analysis Clustering algorithms Abell clusters Mass determination Multi-wavelength view Kayes Mixing Model Gaussian Mixture Model Multi-modality Radio galaxies Data mining Velocity dispersion Kernel density estimation Outlier detection techniques
60	Développement d’un modèle de classification probabiliste pour la cartographie du couvert nival dans les bassins versants d’Hydro-Québec à l’aide de données de micro-ondes passives Teasdale, Mylène 09 1900 (has links) Chaque jour, des décisions doivent être prises quant à la quantité d'hydroélectricité produite au Québec. Ces décisions reposent sur la prévision des apports en eau dans les bassins versants produite à l'aide de modèles hydrologiques. Ces modèles prennent en compte plusieurs facteurs, dont notamment la présence ou l'absence de neige au sol. Cette information est primordiale durant la fonte printanière pour anticiper les apports à venir, puisqu'entre 30 et 40% du volume de crue peut provenir de la fonte du couvert nival. Il est donc nécessaire pour les prévisionnistes de pouvoir suivre l'évolution du couvert de neige de façon quotidienne afin d'ajuster leurs prévisions selon le phénomène de fonte. Des méthodes pour cartographier la neige au sol sont actuellement utilisées à l'Institut de recherche d'Hydro-Québec (IREQ), mais elles présentent quelques lacunes. Ce mémoire a pour objectif d'utiliser des données de télédétection en micro-ondes passives (le gradient de températures de brillance en position verticale (GTV)) à l'aide d'une approche statistique afin de produire des cartes neige/non-neige et d'en quantifier l'incertitude de classification. Pour ce faire, le GTV a été utilisé afin de calculer une probabilité de neige quotidienne via les mélanges de lois normales selon la statistique bayésienne. Par la suite, ces probabilités ont été modélisées à l'aide de la régression linéaire sur les logits et des cartographies du couvert nival ont été produites. Les résultats des modèles ont été validés qualitativement et quantitativement, puis leur intégration à Hydro-Québec a été discutée. / Every day, decisions must be made about the amount of hydroelectricity produced in Quebec. These decisions are based on the prediction of water inflow in watersheds based on hydrological models. These models take into account several factors, including the presence or absence of snow. This information is critical during the spring melt to anticipate future flows, since between 30 and 40 % of the flood volume may come from the melting of the snow cover. It is therefore necessary for forecasters to be able to monitor on a daily basis the snow cover to adjust their expectations about the melting phenomenon. Some methods to map snow on the ground are currently used at the Institut de recherche d'Hydro-Québec (IREQ), but they have some shortcomings. This master thesis's main goal is to use remote sensing passive microwave data (the vertically polarized brightness temperature gradient ratio (GTV)) with a statistical approach to produce snow maps and to quantify the classification uncertainty. In order to do this, the GTV has been used to calculate a daily probability of snow via a Gaussian mixture model using Bayesian statistics. Subsequently, these probabilities were modeled using linear regression models on logits and snow cover maps were produced. The models results were validated qualitatively and quantitatively, and their integration at Hydro-Québec was discussed. Statistique bayésienne Mélanges de lois Modèle probabiliste Régression linéaire Logit Neige Télédétection Couvert nival GTV Bayesian statistics Gaussian mixture model Probabilistic model Linear regression Logit Snow Remote sensing Snow cover

Search results