Global ETD Search

161	Probabilistic Diagnostic Model for Handling Classifier Degradation in Machine Learning Gustavo A. Valencia-Zapata (8082655) 04 December 2019 (has links) Several studies point out different causes of performance degradation in supervised machine learning. Problems such as class imbalance, overlapping, small-disjuncts, noisy labels, and sparseness limit accuracy in classification algorithms. Even though a number of approaches either in the form of a methodology or an algorithm try to minimize performance degradation, they have been isolated efforts with limited scope. This research consists of three main parts: In the first part, a novel probabilistic diagnostic model based on identifying signs and symptoms of each problem is presented. Secondly, the behavior and performance of several supervised algorithms are studied when training sets have such problems. Therefore, prediction of success for treatments can be estimated across classifiers. Finally, a probabilistic sampling technique based on training set diagnosis for avoiding classifier degradation is proposed<br> Statistics Pattern Recognition and Data Mining Class imbalance Overlapping Small-disjuncts Noisy labels Sparseness Gaussian Mixture Models Separation index Classifier degradation Bayesian Information Criterion (BIC)
162	Speech to Text for Swedish using KALDI / Tal till text, utvecklandet av en svensk taligenkänningsmodell i KALDI Kullmann, Emelie January 2016 (has links) The field of speech recognition has during the last decade left the re- search stage and found its way in to the public market. Most computers and mobile phones sold today support dictation and transcription in a number of chosen languages. Swedish is often not one of them. In this thesis, which is executed on behalf of the Swedish Radio, an Automatic Speech Recognition model for Swedish is trained and the performance evaluated. The model is built using the open source toolkit Kaldi. Two approaches of training the acoustic part of the model is investigated. Firstly, using Hidden Markov Model and Gaussian Mixture Models and secondly, using Hidden Markov Models and Deep Neural Networks. The later approach using deep neural networks is found to achieve a better performance in terms of Word Error Rate. / De senaste åren har olika tillämpningar inom människa-dator interaktion och främst taligenkänning hittat sig ut på den allmänna marknaden. Många system och tekniska produkter stöder idag tjänsterna att transkribera tal och diktera text. Detta gäller dock främst de större språken och sällan finns samma stöd för mindre språk som exempelvis svenskan. I detta examensprojekt har en modell för taligenkänning på svenska ut- vecklas. Det är genomfört på uppdrag av Sveriges Radio som skulle ha stor nytta av en fungerande taligenkänningsmodell på svenska. Modellen är utvecklad i ramverket Kaldi. Två tillvägagångssätt för den akustiska träningen av modellen är implementerade och prestandan för dessa två är evaluerade och jämförda. Först tränas en modell med användningen av Hidden Markov Models och Gaussian Mixture Models och slutligen en modell där Hidden Markov Models och Deep Neural Networks an- vänds, det visar sig att den senare uppnår ett bättre resultat i form av måttet Word Error Rate. Automatic Speech Recognition Kaldi Hidden Markov Model Gaussian Mixture Model Deep Neural Network Taligenkänning Kaldi Hidden Markov Model Gaussian Mixture Models Deep Neural Networks Mathematics Matematik
163	Classification of Glioblastoma Multiforme Patients Based on an Integrative Multi-Layer Finite Mixture Model System Campos Valenzuela, Jaime Alberto 26 November 2018 (has links) Glioblastoma multiforme (GMB) is an extremely aggressive and invasive brain cancer with a median survival of less than one year. In addition, due to its anaplastic nature the histological classification of this cancer is not simple. These characteristics make this disease an interesting and important target for new methodologies of analysis and classification. In recent years, molecular information has been used to segregate and analyze GBM patients, but generally this methodology utilizes single-`omic' data to perform the classification or multi-’omic’ data in a sequential manner. In this project, a novel approach for the classification and analysis of patients with GBM is presented. The main objective of this work is to find clusters of patients with distinctive profiles using multi-’omic’ data with a real integrative methodology. During the last years, the TCGA consortium has made publicly available thousands of multi-’omic’ samples for multiple cancer types. Thanks to this, it was possible to obtain numerous GBM samples (> 300) with data for gene and microRNA expression, CpG sites methylation and copy-number variation (CNV). To achieve our objective, a mixture of linear models were built for each gene using its expression as output and a mixture of multi-`omic' data as covariates. Each model was coupled with a lasso penalization scheme, and thanks to the mixture nature of the model, it was possible to fit multiple submodels to discover different linear relationships in the same model. This complex but interpretable method was used to train over \numprint{10000} models. For \texttildelow \numprint{2400} cases, two or more submodels were obtained. Using the models and their submodels, 6 different clusters of patients were discovered. The clusters were profiled based on clinical information and gene mutations. Through this analysis, a clear separation between the younger patients and with higher survival rate (Clusters 1, 2 and 3) and those from older patients and lower survival rate (Clusters 4, 5 and 6) was found. Mutations in the gene IDH1 were found almost exclusively in Cluster 2, additionally, Cluster 5 presented a hypermutated profile. Finally, several genes not previously related to GBM showed a significant presence in the clusters, such as C15orf2 and CHEK2. The most significant models for each clusters were studied, with a special focus on their covariants. It was discovered that the number of shared significant models were very small and that the well known GBM related genes appeared as significant covariates for plenty of models, such as EGFR1 and TP53. Along with them, ubiquitin-related genes (UBC and UBD) and NRF1, which have not been linked to GBM previously, had a very significant role. This work showed the potential of using a mixture of linear models to integrate multi-’omic’ data and to group patients in order to profile them and find novel markers. The resulting clusters showed unique profiles and their significant models and covariates were comprised by well known GBM related genes and novel markers, which present the possibility for new approaches to study and attack this disease. The next step of the project is to improve several elements of the methodology to achieve a more detail analysis of the models and covariates, in particular taking into account the regression coefficients of the submodels. info:eu-repo/classification/ddc/610 ddc:610
164	Determining the number of classes in latent class regression models / A Monte Carlo simulation study on class enumeration Luo, Sherry January 2021 (has links) A Monte Carlo simulation study on class enumeration with latent class regression models. / Latent class regression (LCR) is a statistical method used to identify qualitatively different groups or latent classes within a heterogeneous population and commonly used in the behavioural, health, and social sciences. Despite the vast applications, an agreed fit index to correctly determine the number of latent classes is hotly debated. To add, there are also conflicting views on whether covariates should or should not be included into the class enumeration process. We conduct a simulation study to determine the impact of covariates on the class enumeration accuracy as well as study the performance of several commonly used fit indices under different population models and modelling conditions. Our results indicate that of the eight fit indices considered, the aBIC and BLRT proved to be the best performing fit indices for class enumeration. Furthermore, we found that covariates should not be included into the enumeration procedure. Our results illustrate that an unconditional LCA model can enumerate equivalently as well as a conditional LCA model with its true covariate specification. Even with the presence of large covariate effects in the population, the unconditional model is capable of enumerating with high accuracy. As noted by Nylund and Gibson (2016), a misspecified covariate specification can easily lead to an overestimation of latent classes. Therefore, we recommend to perform class enumeration without covariates and determine a set of candidate latent class models with the aBIC. Once that is determined, the BLRT can be utilized on the set of candidate models and confirm whether results obtained by the BLRT match the results of the aBIC. By separating the enumeration procedure of the BLRT, it still allows one to use the BLRT but reduce the heavy computational burden that is associated with this fit index. Subsequent analysis can then be pursued accordingly after the number of latent classes is determined. / Thesis / Master of Science (MSc) latent class analysis class enumeration latent variable models mplus simulations classifcation mixture models categorical data model selection latent class regression latent classes covariates measurement non-invariance direct effects
165	Understanding people movement and detecting anomalies using probabilistic generative models / Att förstå personförflyttningar och upptäcka anomalier genom att använda probabilistiska generativa modeller Hansson, Agnes January 2020 (has links) As intelligent access solutions begin to dominate the world, the statistical learning methods to answer for the behavior of these needs attention, as there is no clear answer to how an algorithm could learn and predict exactly how people move. This project aims at investigating if, with the help of unsupervised learning methods, it is possible to distinguish anomalies from normal events in an access system, and if the most probable choice of cylinder to be unlocked by a user can be calculated.Given to do this is a data set of the previous events in an access system, together with the access configurations - and the algorithms that were used consisted of an auto-encoder and a probabilistic generative model.The auto-encoder managed to, with success, encode the high-dimensional data set into one of significantly lower dimension, and the probabilistic generative model, which was chosen to be a Gaussian mixture model, identified clusters in the data and assigned a measure of unexpectedness to the events.Lastly, the probabilistic generative model was used to compute the conditional probability of which the user, given all the details except which cylinder that was chosen during an event, would choose a certain cylinder. The result of this was a correct guess in 65.7 % of the cases, which can be seen as a satisfactory number for something originating from an unsupervised problem. / Allt eftersom att intelligenta åtkomstlösningar tar över i samhället, så är det nödvändigt att ägna de statistiska inlärnings-metoderna bakom dessa tillräckligt med uppmärksamhet, eftersom det inte finns något självklart svar på hur en algoritm ska kunna lära sig och förutspå människors exakta rörelsemönster.Det här projektet har som mål att, med hjälp av oövervakad inlärning, undersöka huruvida det är möjligt att urskilja anomalier från normala iakttagelser, och om den låscylinder med högst sannolikhet att en användare väljer att försöka låsa upp går att beräknda.Givet för att genomföra detta projekt är en datamängd där händelser från ett åtkomstsystem finns, tillsammans med tillhörande åtkomstkonfig-urationer. Algoritmerna som användes i projektet har bestått av en auto-encoder och en probabilistisk generativ modell.Auto-encodern lyckades, med tillfredsställande resultat, att koda det hög-dimensionella datat till ett annat med betydligt lägre dimension, och den probabilistiska generativa modellen, som valdes till en Gaussisk mixtur-modell, lyckades identifiera kluster i datat och med att tilldela varje observation ett mått på dess otrolighet.Till slut så användes den probabilistiska generativa modellen för att beräkna en villkorad sannolikhet, för vilken användaren, given alla attribut för en händelse utom just vilken låscylinder som denna försökte öppna, skulle välja.Resultatet av dessa var en korrekt gissning i 65,7 % av fallen, vilket kan ses som en tillfredställande siffra för något som härrör från ett oövervakat problem. Machine Learning Unsupervised Learning Generative Models Auto-encoders Gaussian Mixture Models Maskininlärning Oövervakad inlärning generativa modeller auto-encoders gaussiska mixtur-modeller Probability Theory and Statistics Sannolikhetsteori och statistik
166	A study about Active Semi-Supervised Learning for Generative Models / En studie om Aktivt Semi-Övervakat Lärande för Generativa Modeller Fernandes de Almeida Quintino, Elisio January 2023 (has links) In many relevant scenarios, there is an imbalance between abundant unlabeled data and scarce labeled data to train predictive models. Semi-Supervised Learning and Active Learning are two distinct approaches to deal with this issue. The first one directly uses the unlabeled data to improve model parameter learning, while the second performs a smart choice of unlabeled points to be sent to an annotator, or oracle, which can label these points and increase the labeled training set. In this context, Generative Models are highly appropriate, since they internally represent the data generating process, naturally benefiting from data samples independently of the presence of labels. This Thesis proposes Expectation-Maximization with Density-Weighted Entropy, a novel active semi-supervised learning framework tailored towards generative models. The method is theoretically explored and experiments are conducted to evaluate its application to Gaussian Mixture Models and Multinomial Mixture Models. Based on its partial success, several questions are raised and discussed as to identify possible improvements and decide which shortcomings need to be dealt with before the method is considered robust and generally applicable. / I många relevanta scenarier finns det en obalans mellan god tillgång på oannoterad data och sämre tillgång på annoterad data för att träna prediktiva modeller. Semi-Övervakad Inlärning och Aktiv Inlärning är två distinkta metoder för att hantera denna fråga. Den första använder direkt oannoterad data för att förbättra inlärningen av modellparametrar, medan den andra utför ett smart val av oannoterade punkter som ska skickas till en annoterare eller ett orakel, som kan annotera dessa punkter och öka det annoterade träningssetet. I detta sammanhang är Generativa Modeller mycket lämpliga eftersom de internt representerar data-genereringsprocessen och naturligt gynnas av dataexempel oberoende av närvaron av etiketter. Denna Masteruppsats föreslår Expectation-Maximization med Density-Weighted Entropy, en ny aktiv semi-övervakad inlärningsmetod som är skräddarsydd för generativa modeller. Metoden utforskas teoretiskt och experiment genomförs för att utvärdera dess tillämpning på Gaussiska Mixturmodeller och Multinomiala Mixturmodeller. Baserat på dess partiella framgång ställs och diskuteras flera frågor för att identifiera möjliga förbättringar och avgöra vilka brister som måste hanteras innan metoden anses robust och allmänt tillämplig. Semi-Supervised Learning Active Learning Generative Models Mixture Models Semi-Övervakad Inlärning Aktiv Inlärning Generativa Modeller Mixturmodeller Probability Theory and Statistics Sannolikhetsteori och statistik
167	Assessment of Modern Statistical Modelling Methods for the Association of High-Energy Neutrinos to Astrophysical Sources / Bedömning av moderna statistiska modelleringsmetoder för associering av högenergetiska neutroner till astrofysiska källor Minoz, Valentin January 2021 (has links) The search for the sources of astrophysical neutrinos is a central open question in particle astrophysics. Thanks to substantial experimental efforts, we now have large-scale neutrino detectors in the oceans and polar ice. The neutrino sky seems mostly isotropic, but hints of possible source-neutrino associations have started to emerge, leading to much excitement within the astrophysics community. As more data are collected and future experiments planned, the question of how to statistically quantify point source detection in a robust way becomes increasingly pertinent. The standard approach to null-hypothesis testing leads to reporting the results in terms of a p-value, with detection typically corresponding to surpassing the coveted 5-sigma threshold. While widely used, p-values and significance thresholds are notorious in the statistical community as challenging to interpret and potentially misleading. We explore an alternative Bayesian approach to reporting point source detection and the connections and differences with the frequentist view. In this thesis, two methods for associating neutrino events to candidate sources are implemented on data from a simplified simulation of high-energy neutrino generation and detection. One is a maximum likelihood-based method that has been used in some high-profile articles, and the alternative uses Bayesian Hierarchical modelling with Hamiltonian Monte Carlo to sample the joint posterior of key parameters. Both methods are applied to a set of test cases to gauge their differences and similarities when applied on identical data. The comparisons suggest the applicability of this Bayesian approach as alternative or complement to the frequentist, and illustrate how the two approaches differ. A discussion is also conducted on the applicability and validity of the study itself as well as some potential benefits of incorporating a Bayesian framework, with suggestions for additional aspects to analyze. / Sökandet efter källorna till astrofysiska neutriner är en central öppen fråga i astropartikel- fysik. Tack vare omfattande experimentella ansträngningar har vi nu storskaliga neutrino-detektorer i haven och polarisen. Neutrinohimlen verkar mestadels isotropisk, men antydningar till möjliga källneutrinoföreningar har börjat antydas, vilket har lett till mycket spänning inom astrofysikgemenskapen. När mer data samlas in och framtida experiment planeras, blir frågan om hur man statistiskt kvantifierar punktkälledetektering på ett robust sätt alltmer relevant. Standardmetoden för nollhypotes-testning leder ofta till rapportering av resultat i termer av p-värden, då en specifik tröskel i signifikans eftertraktas. Samtidigt som att vara starkt utbredda, är p-värden och signifikansgränser mycket omdiskuterade i det statistiska samfundet angående deras tolkning. Vi utforskar en alternativ Bayesisk inställning till utvärderingen av punktkälldetektering och jämför denna med den frekvensentistiska utgångspunkten. I denna uppsats tillämpas två metoder för att associera neutrinohändelser till kandidatkällor på basis av simulerad data. Den första använder en maximum likelihood-metod anpassad från vissa uppmärksammade rapporter, medan den andra använder Hamiltonsk Monte Carlo till att approximera den gemensamma aposteriorifördelningen hos modellens parametrar. Båda metoderna tillämpas på en uppsättning testfall för att uppskatta deras skillnader och likheter tillämpade på identisk data. Jämförelserna antyder tillämpligheten av den Bayesianska som alternativ eller komplement till den klassiska, och illustrerar hur de två metoderna skiljer sig åt. En diskussion förs också om validiteten av studien i sig samt några potentiella fördelar med att använda ett Bayesiskt ramverk, med förslag på ytterligare aspekter att analysera. Statistics Astrophysics Neutrino sources Mixture models Monte Carlo methods Maximum likelihood estimation Statistik Astrofysik Neutrinokällor Blandningsmodeller Monte Carlo-metoder Maximum likelihood-metoden Probability Theory and Statistics Sannolikhetsteori och statistik
168	Adaptive Mixture Estimation and Subsampling PCA Liu, Peng January 2009 (has links) No description available. Statistics large data data mining mixture models Gaussian mixtures parameter estimation adaptive procedure partial EM high-dimensional data large p small n dimension reduction feature selection subsampling
169	Assessment of Soil Corrosion in Underground Pipelines via Statistical Inference Yajima, Ayako 10 September 2015 (has links) No description available. Civil Engineering Soil corrosion Corrosion assessment ECDA ILI Reliability Gaussian mixture models Clustering analysis Missing data analysis truncated distribution Generalized exponential distribution Bayesian inference MCMC
170	Deep Learning Framework for Trajectory Prediction and In-time Prognostics in the Terminal Airspace Varun S Sudarsanan (13889826) 06 October 2022 (has links) <p>Terminal airspace around an airport is the biggest bottleneck for commercial operations in the National Airspace System (NAS). In order to prognosticate the safety status of the terminal airspace, effective prediction of the airspace evolution is necessary. While there are fixed procedural structures for managing operations at an airport, the confluence of a large number of aircraft and the complex interactions between the pilots and air traffic controllers make it challenging to predict its evolution. Modeling the high-dimensional spatio-temporal interactions in the airspace given different environmental and infrastructural constraints is necessary for effective predictions of future aircraft trajectories that characterize the airspace state at any given moment. A novel deep learning architecture using Graph Neural Networks is proposed to predict trajectories of aircraft 10 minutes into the future and estimate prog?nostic metrics for the airspace. The uncertainty in the future is quantified by predicting distributions of future trajectories instead of point estimates. The framework’s viability for trajectory prediction and prognosis is demonstrated with terminal airspace data from Dallas Fort Worth International Airport (DFW). </p> Deep Learning Applications Terminal airspace Aviation Safety prognostics prediction model Graph Neural Network (GNN) Uncertainty Quantification Gaussian Mixture Models

Search results