351 |
Multivariate analysis of high-throughput sequencing data / Analyses multivariées de données de séquençage à haut débitDurif, Ghislain 13 December 2016 (has links)
L'analyse statistique de données de séquençage à haut débit (NGS) pose des questions computationnelles concernant la modélisation et l'inférence, en particulier à cause de la grande dimension des données. Le travail de recherche dans ce manuscrit porte sur des méthodes de réductions de dimension hybrides, basées sur des approches de compression (représentation dans un espace de faible dimension) et de sélection de variables. Des développements sont menés concernant la régression "Partial Least Squares" parcimonieuse (supervisée) et les méthodes de factorisation parcimonieuse de matrices (non supervisée). Dans les deux cas, notre objectif sera la reconstruction et la visualisation des données. Nous présenterons une nouvelle approche de type PLS parcimonieuse, basée sur une pénalité adaptative, pour la régression logistique. Cette approche sera utilisée pour des problèmes de prédiction (devenir de patients ou type cellulaire) à partir de l'expression des gènes. La principale problématique sera de prendre en compte la réponse pour écarter les variables non pertinentes. Nous mettrons en avant le lien entre la construction des algorithmes et la fiabilité des résultats.Dans une seconde partie, motivés par des questions relatives à l'analyse de données "single-cell", nous proposons une approche probabiliste pour la factorisation de matrices de comptage, laquelle prend en compte la sur-dispersion et l'amplification des zéros (caractéristiques des données single-cell). Nous développerons une procédure d'estimation basée sur l'inférence variationnelle. Nous introduirons également une procédure de sélection de variables probabiliste basée sur un modèle "spike-and-slab". L'intérêt de notre méthode pour la reconstruction, la visualisation et le clustering de données sera illustré par des simulations et par des résultats préliminaires concernant une analyse de données "single-cell". Toutes les méthodes proposées sont implémentées dans deux packages R: plsgenomics et CMF / The statistical analysis of Next-Generation Sequencing data raises many computational challenges regarding modeling and inference, especially because of the high dimensionality of genomic data. The research work in this manuscript concerns hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection. Developments are made concerning: the sparse Partial Least Squares (PLS) regression framework for supervised classification, and the sparse matrix factorization framework for unsupervised exploration. In both situations, our main purpose will be to focus on the reconstruction and visualization of the data. First, we will present a new sparse PLS approach, based on an adaptive sparsity-inducing penalty, that is suitable for logistic regression to predict the label of a discrete outcome. For instance, such a method will be used for prediction (fate of patients or specific type of unidentified single cells) based on gene expression profiles. The main issue in such framework is to account for the response to discard irrelevant variables. We will highlight the direct link between the derivation of the algorithms and the reliability of the results. Then, motivated by questions regarding single-cell data analysis, we propose a flexible model-based approach for the factorization of count matrices, that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data), for which we derive an estimation procedure based on variational inference. In this scheme, we consider probabilistic variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering will be illustrated by simulation experiments and by preliminary results on single-cell data analysis. All proposed methods were implemented into two R-packages "plsgenomics" and "CMF" based on high performance computing
|
352 |
Neuronové sítě pro doporučování knih / Deep Book RecommendationGráca, Martin January 2018 (has links)
This thesis deals with the field of recommendation systems using deep neural networks and their use in book recommendation. There are the main traditional recommender systems analysed and their representations are summarized, as well as systems with more advanced techniques based on machine learning. The core of the thesis is to use convolutional neural networks for natural language processing and create a hybrid book recommendation system. Suggested system includes matrix factorization and make recommendation based on user ratings and book metadata, including texts descriptions. I designed two models, one with bag-of-words technique and one with convolutional neural network. Both of them defeat baseline methods. On the created data set, that was created from the Goodreads, model with CNN beats model with BOW.
|
353 |
Neuronové sítě pro doporučování knih / Deep Book RecommendationGráca, Martin January 2018 (has links)
This thesis deals with the field of Recommendation systems using Deep Neural Networks and their use in book recommendation. There are the main traditional recommender systems analysed and their representations are summarized, as well as systems with more advancec techniques based on machine learning.. The core of the thesis is the use of convolutional neural networks for natural language processing and the creation of a book recommendation system. Suggested system make recommendation based on user data, including user reviews and book data, including full texts.
|
354 |
Méthodes rapides de traitement d’images hyperspectrales. Application à la caractérisation en temps réel du matériau bois / Fast methods for hyperspectral images processing. Application to the real-time characterization of wood materialNus, Ludivine 12 December 2019 (has links)
Cette thèse aborde le démélange en-ligne d’images hyperspectrales acquises par un imageur pushbroom, pour la caractérisation en temps réel du matériau bois. La première partie de cette thèse propose un modèle de mélange en-ligne fondé sur la factorisation en matrices non-négatives. À partir de ce modèle, trois algorithmes pour le démélange séquentiel en-ligne, fondés respectivement sur les règles de mise à jour multiplicatives, le gradient optimal de Nesterov et l’optimisation ADMM (Alternating Direction Method of Multipliers) sont développés. Ces algorithmes sont spécialement conçus pour réaliser le démélange en temps réel, au rythme d'acquisition de l'imageur pushbroom. Afin de régulariser le problème d’estimation (généralement mal posé), deux sortes de contraintes sur les endmembers sont utilisées : une contrainte de dispersion minimale ainsi qu’une contrainte de volume minimal. Une méthode pour l’estimation automatique du paramètre de régularisation est également proposée, en reformulant le problème de démélange hyperspectral en-ligne comme un problème d’optimisation bi-objectif. Dans la seconde partie de cette thèse, nous proposons une approche permettant de gérer la variation du nombre de sources, i.e. le rang de la décomposition, au cours du traitement. Les algorithmes en-ligne préalablement développés sont ainsi modifiés, en introduisant une étape d’apprentissage d’une bibliothèque hyperspectrale, ainsi que des pénalités de parcimonie permettant de sélectionner uniquement les sources actives. Enfin, la troisième partie de ces travaux consiste en l’application de nos approches à la détection et à la classification des singularités du matériau bois. / This PhD dissertation addresses the problem of on-line unmixing of hyperspectral images acquired by a pushbroom imaging system, for real-time characterization of wood. The first part of this work proposes an on-line mixing model based on non-negative matrix factorization. Based on this model, three algorithms for on-line sequential unmixing, using multiplicative update rules, the Nesterov optimal gradient and the ADMM optimization (Alternating Direction Method of Multipliers), respectively, are developed. These algorithms are specially designed to perform the unmixing in real time, at the pushbroom imager acquisition rate. In order to regularize the estimation problem (generally ill-posed), two types of constraints on the endmembers are used: a minimum dispersion constraint and a minimum volume constraint. A method for the unsupervised estimation of the regularization parameter is also proposed, by reformulating the on-line hyperspectral unmixing problem as a bi-objective optimization. In the second part of this manuscript, we propose an approach for handling the variation in the number of sources, i.e. the rank of the decomposition, during the processing. Thus, the previously developed on-line algorithms are modified, by introducing a hyperspectral library learning stage as well as sparse constraints allowing to select only the active sources. Finally, the third part of this work consists in the application of these approaches to the detection and the classification of the singularities of wood.
|
355 |
Méthodes symboliques pour les systèmesdifférentiels linéaires à singularité irrégulière / Symbolic methods for linear differential systems with irregular singularitySaade, Joelle 05 November 2019 (has links)
Cette thèse est consacrée aux méthodes symboliques de résolution locale des systèmes différentiels linéaires à coefficients dans K = C((x)), le corps des séries de Laurent, sur un corps effectif C. Plus précisément, nous nous intéressons aux algorithmes effectifs de réduction formelle. Au cours de la réduction, nous sommes amenés à introduire des extensions algébriques du corps de coefficients K (extensions algébriques de C, ramifications de la variable x) afin d’obtenir une structure plus fine. Du point de vue algorithmique, il est préférable de retarder autant que possible l’introduction de ces extensions. Dans ce but, nous développons un nouvel algorithme de réduction formelle qui utilise l’anneau des endomorphismes du système, appelé « eigenring », afin de se ramener au cas d’un système indécomposable sur K. En utilisant la classification formelle donnée par Balser-Jurkat-Lutz, nous déduisons la structure de l’eigenring d’un système indécomposable. Ces résultats théoriques nous permettent de construire une décomposition sur le corps de base K qui sépare les différentes parties exponentielles du système et permet ainsi d’isoler dans des sous-systèmes, indécomposables sur K, les différentes extensions de corps qui peuvent apparaître afin de les traiter séparément. Dans une deuxième partie, nous nous intéressons à l’algorithme de Miyake pour la réduction formelle. Celle-ci est basée sur le calcul du poids et d’une suite de Volevic de la matrice de valuation du système. Nous donnons des interprétations en théorie de graphe et en algèbre tropicale du poids et suites de Volevic, et obtenons ainsi des méthodes de calculs efficaces sur le plan pratique, à l’aide de la programmation linéaire. Ceci complète une étape fondamentale dans l’algorithme de réduction de Miyake. Ces différents algorithmes sont implémentés sous forme de librairies pour le logiciel de calcul formel Maple. Enfin, nous présentons une discussion sur la performance de l’algorithme de réduction avec l’eigenring ainsi qu’une comparaison en terme de temps de calcul entre notre implémentation de l’algorithme de réduction de Miyake par la programmation linéaire et ceux de Barkatou et Pflügel. / This thesis is devoted to symbolic methods for local resolution of linear differential systems with coefficients in K = C((x)), the field of Laurent series, on an effective field C. More specifically, we are interested in effective algorithms for formal reduction. During the reduction, we are led to introduce algebraic extensions of the field of coefficients K (algebraic extensions of C, ramification of the variable x) in order to obtain a finer structure. From an algorithmic point of view, it is preferable to delay as much as possible the introduction of these extensions. To this end, we developed a new algorithm for formal reduction that uses the ring of endomorphisms of the system, called "eigenring". Using the formal classification given by Balser-Jurkat-Lutz, we deduce the structure of the eigenring of an indecomposable system. These theoretical results allow us to construct a decomposition on the base field K that separates the different exponential parts of the system and thus allows us to isolate, in indecomposable subsystems in K, the different algebraic extensions that can appear in order to treat them separately. In a second part, we are interested in Miyake’s algorithm for formal reduction. This algorithm is based on the computation of the Volevic weight and numbers of the valuation matrix of the system. We provide interpretations in graph theory and tropical algebra of the Volevic weight and numbers, and thus obtain practically efficient methods using linear programming. This completes a fundamental step in the Miyake reduction algorithm. These different algorithms are implemented as libraries for the computer algebra software Maple. Finally, we present a discussion on the performance of the reduction algorithm using the eigenring as well as a comparison in terms of timing between our implementation of Miyake’s reduction algorithm by linear programming and the algorithms of Barkatou and Pflügel.
|
356 |
The Main Diagonal of a Permutation MatrixLindner, Marko, Strang, Gilbert January 2011 (has links)
By counting 1's in the "right half" of 2w consecutive rows, we locate the main diagonal of any doubly infinite permutation matrix with bandwidth w. Then the matrix can be correctly centered and factored into block-diagonal permutation matrices.
Part II of the paper discusses the same questions for the much larger class of band-dominated matrices. The main diagonal is determined by the Fredholm index of a singly infinite submatrix. Thus the main diagonal is determined "at infinity" in general, but from only 2w rows for banded permutations.
|
357 |
Block SOR Preconditional Projection Methods for Kronecker Structured Markovian RepresentationsBuchholz, Peter, Dayar, Tuğrul 15 January 2013 (has links)
Kronecker structured representations are used to cope with the state space explosion problem in Markovian modeling and analysis. Currently an open research problem is that of devising strong preconditioners to be used with projection methods for the computation of the stationary vector of Markov chains (MCs) underlying such representations. This paper proposes a block SOR (BSOR) preconditioner for hierarchical Markovian Models (HMMs) that are composed of multiple low level models and a high level model that defines the interaction among low level models. The Kronecker structure of an HMM yields nested block partitionings in its underlying continuous-time MC which may be used in the BSOR preconditioner. The computation of the BSOR preconditioned residual in each iteration of a preconditioned projection method becoms the problem of solving multiple nonsingular linear systems whose coefficient matrices are the diagonal blocks of the chosen partitioning. The proposed BSOR preconditioner solvers these systems using sparse LU or real Schur factors of diagonal blocks. The fill-in of sparse LU factorized diagonal blocks is reduced using the column approximate minimum degree algorithm (COLAMD). A set of numerical experiments are presented to show the merits of the proposed BSOR preconditioner.
|
358 |
Block SOR for Kronecker structured representationsBuchholz, Peter, Dayar, Tuğrul 15 January 2013 (has links)
Hierarchical Markovian Models (HMMs) are composed of multiple low level models (LLMs) and high level model (HLM) that defines the interaction among LLMs. The essence of the HMM approach is to model the system at hand in the form of interacting components so that its (larger) underlying continous-time Markov chain (CTMC) is not generated but implicitly represented as a sum of Kronecker products of (smaller) component matrices. The Kronecker structure of an HMM induces nested block partitionings in its underlying CTMC. These partitionings may be used in block versions of classical iterative methods based on splittings, such as block SOR (BSOR), to solve the underlying CTMC for its stationary vector. Therein the problem becomes that of solving multiple nonsingular linear systems whose coefficient matrices are the diagonal blocks of a particular partitioning. This paper shows that in each HLM state there may be diagonal blocks with identical off-diagonal parts and diagonals differing from each other by a multiple of the identity matrix. Such diagonal blocks are named candidate blocks. The paper explains how candidate blocks can be detected and how the can mutually benefit from a single real Schur factorization. It gives sufficient conditions for the existence of diagonal blocks with real eigenvalues and shows how these conditions can be checked using component matrices. It describes how the sparse real Schur factors of candidate blocks satisfying these conditions can be constructed from component matrices and their real Schur factors. It also demonstrates how fill in- of LU factorized (non-candidate) diagonal blocks can be reduced by using the column approximate minimum degree algorithm (COLAMD). Then it presents a three-level BSOR solver in which the diagonal blocks at the first level are solved using block Gauss-Seidel (BGS) at the second and the methods of real Schur and LU factorizations at the third level. Finally, on a set of numerical experiments it shows how these ideas can be used to reduce the storage required by the factors of the diagonal blocks at the third level and to improve the solution time compared to an all LU factorization implementation of the three-level BSOR solver.
|
359 |
A Confirmatory Analysis for Automating the Evaluation of Motivation Letters to Emulate Human JudgmentMercado Salazar, Jorge Anibal, Rana, S M Masud January 2021 (has links)
Manually reading, evaluating, and scoring motivation letters as part of the admissions process is a time-consuming and tedious task for Dalarna University's program managers. An automated scoring system would provide them with relief as well as the ability to make much faster decisions when selecting applicants for admission. The aim of this thesis was to analyse current human judgment and attempt to emulate it using machine learning techniques. We used various topic modelling methods, such as Latent Dirichlet Allocation and Non-Negative Matrix Factorization, to find the most interpretable topics, build a bridge between topics and human-defined factors, and finally evaluate model performance by predicting scoring values and finding accuracy using logistic regression, discriminant analysis, and other classification algorithms. Despite the fact that we were able to discover the meaning of almost all human factors on our own, the topic models' accuracy in predicting overall score was unexpectedly low. Setting a threshold on overall score to select applicants for admission yielded a good overall accuracy result, but did not yield a good consistent precision or recall score. During our investigation, we attempted to determine the possible causes of these unexpected results and discovered that not only is topic modelling limitation to blame, but human bias also plays a role.
|
360 |
Identification of Sources of Air Pollution Using Novel Analytical Techniques and InstrumentsBhardwaj, Nitish 31 March 2022 (has links)
This dissertation is a collection of studies that investigates the issue of air pollution in the field of environmental chemistry. My thesis consists of research works done to measure the concentration of particulate matter (PM) and gas-phase species in ambient air. High concentrations of PM is a significant problem in Utah and in other regions of the world. Particles having an aerodynamic diameter of 2.5 micrometers and smaller play a crucial role in air pollution and pose serious health risks when inhaled. PM is composed of both organic and inorganic components. The organic fraction in PM ranges from 10-90% of the total particle mass. Several methods have been employed to measure the organic fraction of PM, but these techniques require extensive laboratory analysis, expensive bench top equipment, and do a poor job of capturing diurnal variations of the concentrations of ambient organic compounds. The Hansen Lab has developed a new instrument called the Organic Aerosol Monitor (OAM) which is based on gas chromatography followed by mass spectrometry detection platform for measuring the carbonaceous component of PM2.5 on an hourly averaged basis. Organic marker data collected in 2016 using the OAM was used in a Positive Matrix Factorization (PMF) analysis to identify the sources of PM in West Valley City, Utah. Additionally, data was collected in Richfield and Vernal, UT in 2017 - 2018 to quantitatively monitor the composition of organic markers of PM2.5. Some previously unidentified organic compounds in PM were successfully identified during this study, including terpenes, polycyclic aromatic hydrocarbons (PAHs), diethyl phthalate, some herbicides, and pesticides. Gas-phase species play a significant role in driving the formation of air pollutants in Earth's atmosphere. Traditional gas detection methods do not provide high temporally and spatially resolved data; therefore, it becomes important to detect and measure gas-phase species both qualitatively and quantitatively to better understand the sources of air pollution. An incoherent broadband cavity enhanced absorption spectrometer (IBBCEAS) combines a broadband incoherent light source, a stable optical cavity formed by two highly reflective mirrors and a charged-coupled device (CCD) detector to quantitatively measure the gas-phase compounds present in the atmosphere. The concentrations of formaldehyde (HCHO) were measured using IBBCEAS to investigate the sources of this hydrocarbon in Bountiful, Utah during 2019. Another important species is OH radical. It is one of the most predominant oxidizing species present in the atmosphere. It is found in low concentrations, 0.1 ppt. Detecting concentrations this low is challenging. A new IBBCEAS instrument has been designed and elements of this instrument were tested by measuring the OH overtones in a variety of short chained alcohols. A set of experiments were conducted to measure the absorption cross-sections for the 5th and 6th OH vibrational overtones in a series of short chained alcohols by IBBCEAS. Because OH radical's lowest energy electronic state occurs in the same wavelength region (i.e., 308 nm) that SO2 absorbs (300-310 nm), a study was conducted in which the concentrations of SO2 were measured using an IBBCEAS and compared with a commercially available SO2 monitor.
|
Page generated in 0.0954 seconds