Spelling suggestions: "subject:"dimensionality reduction"" "subject:"dimenzionality reduction""
1 |
Ensembles des modeles en fMRI : l'apprentissage stable à grande échelle / Ensembles of models in fMRI : stable learning in large-scale settingsHoyos-Idrobo, Andrés 20 January 2017 (has links)
En imagerie médicale, des collaborations internationales ont lançé l'acquisition de centaines de Terabytes de données - et en particulierde données d'Imagerie par Résonance Magnétique fonctionelle (IRMf) -pour les mettre à disposition de la communauté scientifique.Extraire de l'information utile de ces données nécessite d'importants prétraitements et des étapes de réduction de bruit. La complexité de ces analyses rend les résultats très sensibles aux paramètres choisis. Le temps de calcul requis augmente plus vite que linéairement: les jeux de données sont si importants qu'il ne tiennent plus dans le cache, et les architectures de calcul classiques deviennent inefficaces.Pour réduire les temps de calcul, nous avons étudié le feature-grouping commetechnique de réduction de dimension. Pour ce faire, nous utilisons des méthodes de clustering. Nous proposons un algorithme de clustering agglomératif en temps linéaire: Recursive Nearest Agglomeration (ReNA). ReNA prévient la création de clusters énormes, qui constitue un défaut des méthodes agglomératives rapidesexistantes. Nous démontrons empiriquement que cet algorithme de clustering engendre des modèles très précis et rapides, et permet d'analyser de grands jeux de données avec des ressources limitées.En neuroimagerie, l'apprentissage statistique peut servir à étudierl'organisation cognitive du cerveau. Des modèles prédictifs permettent d'identifier les régions du cerveau impliquées dans le traitement cognitif d'un stimulus externe. L'entraînement de ces modèles est un problème de très grande dimension, et il est nécéssaire d'introduire un a priori pour obtenir un modèle satisfaisant.Afin de pouvoir traiter de grands jeux de données et d'améliorer lastabilité des résultats, nous proposons de combiner le clustering etl'utilisation d'ensembles de modèles. Nous évaluons la performance empirique de ce procédé à travers de nombreux jeux de données de neuroimagerie. Cette méthode est hautement parallélisable et moins coûteuse que l'état del'art en temps de calcul. Elle permet, avec moins de données d'entraînement,d'obtenir de meilleures prédictions. Enfin, nous montrons que l'utilisation d'ensembles de modèles améliore la stabilité des cartes de poids résultantes et réduit la variance du score de prédiction. / In medical imaging, collaborative worldwide initiatives have begun theacquisition of hundreds of Terabytes of data that are made available to thescientific community. In particular, functional Magnetic Resonance Imaging --fMRI-- data. However, this signal requires extensive fitting and noise reduction steps to extract useful information. The complexity of these analysis pipelines yields results that are highly dependent on the chosen parameters.The computation cost of this data deluge is worse than linear: as datasetsno longer fit in cache, standard computational architectures cannot beefficiently used.To speed-up the computation time, we considered dimensionality reduction byfeature grouping. We use clustering methods to perform this task. We introduce a linear-time agglomerative clustering scheme, Recursive Nearest Agglomeration (ReNA). Unlike existing fast agglomerative schemes, it avoids the creation of giant clusters. We then show empirically how this clustering algorithm yields very fast and accurate models, enabling to process large datasets on budget.In neuroimaging, machine learning can be used to understand the cognitiveorganization of the brain. The idea is to build predictive models that are used to identify the brain regions involved in the cognitive processing of an external stimulus. However, training such estimators is a high-dimensional problem, and one needs to impose some prior to find a suitable model.To handle large datasets and increase stability of results, we propose to useensembles of models in combination with clustering. We study the empirical performance of this pipeline on a large number of brain imaging datasets. This method is highly parallelizable, it has lower computation time than the state-of-the-art methods and we show that, it requires less data samples to achieve better prediction accuracy. Finally, we show that ensembles of models improve the stability of the weight maps and reduce the variance of prediction accuracy.
|
2 |
LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classificationMungre, Surbhi January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / Doina Caragea / Several computational biology and bioinformatics problems involve DNA sequence classification using supervised machine learning algorithms. The performance of these algorithms is largely dependent on the availability of labeled data and the approach used to represent DNA sequences as {\it feature vectors}. For many organisms, the labeled DNA data is scarce, while the unlabeled data is easily available. However, for a small number of well-studied model organisms, large amounts of labeled data are available. This calls for {\it domain adaptation} approaches, which can transfer knowledge from a {\it source} domain, for which labeled data is available, to a {\it target} domain, for which large amounts of unlabeled data are available.
Intuitively, one approach to domain adaptation can be obtained by extracting and representing the features that the source domain and the target domain sequences share. \emph{Latent Dirichlet Allocation} (LDA) is an unsupervised dimensionality reduction technique that has been successfully used to generate features for sequence data such as text. In this work, we explore the use of LDA for generating predictive DNA sequence features, that can be used in both supervised and domain adaptation frameworks. More precisely, we propose two dimensionality reduction approaches, LDA Words (LDAW) and LDA Distribution (LDAD) for DNA sequences. LDA is a probabilistic model, which is generative in nature, and is used to model collections of discrete data such as document collections. For our problem, a sequence is considered to be a ``document" and k-mers obtained from a sequence are ``document words". We use LDA to model our sequence collection. Given the LDA model, each document can be represented as a distribution over topics (where a topic can be seen as a distribution over k-mers). In the LDAW method, we use the top k-mers in each topic as our features (i.e., k-mers with the highest probability); while in the LDAD method, we use the topic distribution to represent a document as a feature vector. We study LDA-based dimensionality reduction approaches for both supervised DNA sequence classification, as well as domain adaptation approaches. We apply the proposed approaches on the splice site predication problem, which is an important DNA sequence classification problem in the context of genome annotation. In the supervised learning framework, we study the effectiveness of LDAW and LDAD methods by comparing them with a traditional dimensionality reduction technique based on the information gain criterion. In the domain adaptation framework, we study the effect of increasing the evolutionary distances between the source and target organisms, and the effect of using different weights when combining labeled data from the source domain and with labeled data from the target domain. Experimental results show that LDA-based features can be successfully used to perform dimensionality reduction and domain adaptation for DNA sequence classification problems.
|
3 |
Redução dimensional de dados de alta dimensão e poucas amostras usando Projection Pursuit / Dimension reduction of datasets with large dimensionalities and few samples using Projection PursuitEspezua Llerena, Soledad 30 July 2013 (has links)
Reduzir a dimensão de bancos de dados é um passo importante em processos de reconhecimento de padrões e aprendizagem de máquina. Projection Pursuit (PP) tem emergido como uma técnica relevante para tal fim, a qual busca projeções dos dados em espaços de baixa dimensão onde estruturas interessantes sejam reveladas. Apesar do relativo sucesso de PP em vários problemas de redução dimensional, a literatura mostra uma aplicação limitada da mesma em bancos de dados com elevada quantidade de atributos e poucas amostras, tais como os gerados em biologia molecular. Nesta tese, estudam-se formas de aproveitar o potencial de PP em problemas de alta dimensão e poucas amostras a fim de facilitar a posterior construção de classificadores. Entre as principais contribuições deste trabalho tem-se: i) Sequential Projection Pursuit Modified (SPPM), um método de busca sequencial de espaços de projeção baseado em Algoritmo Genético (AG) e operadores de cruzamento especializados; ii) Block Sequential Projection Pursuit Modified (Block-SPPM) e Whitened Sequential Projection Pursuit Modified (W-SPPM), duas estratégias de aplicação de SPPM em problemas com mais atributos do que amostras, sendo a primeira baseada e particionamento de atributos e a segunda baseada em pré-compactação dos dados. Avaliações experimentais sobre bancos de dados públicos de expressão gênica mostraram a eficácia das propostas em melhorar a acurácia de algoritmos de classificação populares em relação a vários outros métodos de redução dimensional, tanto de seleção quanto de extração de atributos, encontrando-se que W-SPPM oferece o melhor compromisso entre acurácia e custo computacional. / Reducing the dimension of datasets is an important step in pattern recognition and machine learning processes. PP has emerged as a relevant technique for that purpose. PP aims to find projections of the data in low dimensional spaces where interesting structures are revealed. Despite the success of PP in many dimension reduction problems, the literature shows a limited application of it in dataset with large amounts of features and few samples, such as those obtained in molecular biology. In this work we study ways to take advantage of the potential of PP in order to deal with problems of large dimensionalities and few samples. Among the main contributions of this work are: i) SPPM, an improved method for searching projections, based on a genetic algorithm and specialized crossover operators; and ii) Block-SPPM and W-SPPM, two strategies of applying SPPM in problems with more attributes than samples. The first strategy is based on partitioning the attribute space while the later is based on a precompaction of the data followed by a projection search. Experimental evaluations over public gene-expression datasets showed the efficacy of the proposals in improving the accuracy of popular classifiers with respect to several representative dimension reduction methods, being W-SPPM the strategy with the best compromise between accuracy and computational cost.
|
4 |
Redução dimensional de dados de alta dimensão e poucas amostras usando Projection Pursuit / Dimension reduction of datasets with large dimensionalities and few samples using Projection PursuitSoledad Espezua Llerena 30 July 2013 (has links)
Reduzir a dimensão de bancos de dados é um passo importante em processos de reconhecimento de padrões e aprendizagem de máquina. Projection Pursuit (PP) tem emergido como uma técnica relevante para tal fim, a qual busca projeções dos dados em espaços de baixa dimensão onde estruturas interessantes sejam reveladas. Apesar do relativo sucesso de PP em vários problemas de redução dimensional, a literatura mostra uma aplicação limitada da mesma em bancos de dados com elevada quantidade de atributos e poucas amostras, tais como os gerados em biologia molecular. Nesta tese, estudam-se formas de aproveitar o potencial de PP em problemas de alta dimensão e poucas amostras a fim de facilitar a posterior construção de classificadores. Entre as principais contribuições deste trabalho tem-se: i) Sequential Projection Pursuit Modified (SPPM), um método de busca sequencial de espaços de projeção baseado em Algoritmo Genético (AG) e operadores de cruzamento especializados; ii) Block Sequential Projection Pursuit Modified (Block-SPPM) e Whitened Sequential Projection Pursuit Modified (W-SPPM), duas estratégias de aplicação de SPPM em problemas com mais atributos do que amostras, sendo a primeira baseada e particionamento de atributos e a segunda baseada em pré-compactação dos dados. Avaliações experimentais sobre bancos de dados públicos de expressão gênica mostraram a eficácia das propostas em melhorar a acurácia de algoritmos de classificação populares em relação a vários outros métodos de redução dimensional, tanto de seleção quanto de extração de atributos, encontrando-se que W-SPPM oferece o melhor compromisso entre acurácia e custo computacional. / Reducing the dimension of datasets is an important step in pattern recognition and machine learning processes. PP has emerged as a relevant technique for that purpose. PP aims to find projections of the data in low dimensional spaces where interesting structures are revealed. Despite the success of PP in many dimension reduction problems, the literature shows a limited application of it in dataset with large amounts of features and few samples, such as those obtained in molecular biology. In this work we study ways to take advantage of the potential of PP in order to deal with problems of large dimensionalities and few samples. Among the main contributions of this work are: i) SPPM, an improved method for searching projections, based on a genetic algorithm and specialized crossover operators; and ii) Block-SPPM and W-SPPM, two strategies of applying SPPM in problems with more attributes than samples. The first strategy is based on partitioning the attribute space while the later is based on a precompaction of the data followed by a projection search. Experimental evaluations over public gene-expression datasets showed the efficacy of the proposals in improving the accuracy of popular classifiers with respect to several representative dimension reduction methods, being W-SPPM the strategy with the best compromise between accuracy and computational cost.
|
5 |
E-noses equipped with Artificial Intelligence Technology for diagnosis of dairy cattle disease in veterinary / E-nose utrustad med Artificiell intelligens teknik avsedd för diagnos av mjölkboskap sjukdom i veterinärHaselzadeh, Farbod January 2021 (has links)
The main goal of this project, running at Neurofy AB, was that developing an AI recognition algorithm also known as, gas sensing algorithm or simply recognition algorithm, based on Artificial Intelligence (AI) technology, which would have the ability to detect or predict diary cattle diseases using odor signal data gathered, measured and provided by Gas Sensor Array (GSA) also known as, Electronic Nose or simply E-nose developed by the company. Two major challenges in this project were to first overcome the noises and errors in the odor signal data, as the E-nose is supposed to be used in an environment with difference conditions than laboratory, for instance, in a bail (A stall for milking cows) with varying humidity and temperatures, and second to find a proper feature extraction method appropriate for GSA. Normalization and Principal component analysis (PCA) are two classic methods which not only intended for re-scaling and reducing of features in a data-set at pre-processing phase of developing of odor identification algorithm, but also it thought that these methods reduce the affect of noises in odor signal data. Applying classic approaches, like PCA, for feature extraction and dimesionality reduction gave rise to loss of valuable data which made it difficult for classification of odors. A new method was developed to handle noises in the odors signal data and also deal with dimentionality reduction without loosing of valuable data, instead of the PCA method in feature extraction stage. This method, which is consisting of signal segmentation and Autoencoder with encoder-decoder, made it possible to overcome the noise issues in data-sets and it also is more appropriate feature extraction method due to better prediction accuracy performed by the AI gas recognition algorithm in comparison to PCA. For evaluating of Autoencoder monitoring of its learning rate of was performed. For classification and predicting of odors, several classifier, among alias, Logistic Regression (LR), Support vector machine (SVM), Linear Discriminant Analysis (LDA), Random forest Classifier (RFC) and MultiLayer perceptron (MLP), was investigated. The best prediction was obtained by classifiers MLP . To validate the prediction, obtained by the new AI recognition algorithm, several validation methods like Cross validation, Accuracy score, balanced accuracy score , precision score, Recall score, and Learning Curve, were performed. This new AI recognition algorithm has the ability to diagnose 3 different diary cattle diseases with an accuracy of 96% despite lack of samples. / Syftet med detta projekt var att utveckla en igenkänning algoritm baserad på maskinintelligens (Artificiell intelligens (AI) ), även känd som gasavkänning algoritm eller igenkänningsalgoritm, baserad på artificiell intelligens (AI) teknologi såsom maskininlärning ach djupinlärning, som skulle kunna upptäcka eller diagnosera vissa mjölkkor sjukdomar med hjälp av luktsignaldata som samlats in, mätts och tillhandahållits av Gas Sensor Array (GSA), även känd som elektronisk näsa eller helt enkelt E-näsa, utvecklad av företaget Neorofy AB. Två stora utmaningar i detta projekt bearbetades. Första utmaning var att övervinna eller minska effekten av brus i signaler samt fel (error) i dess data då E-näsan är tänkt att användas i en miljö där till skillnad från laboratorium förekommer brus, till example i ett stall avsett för mjölkkor, i form av varierande fukthalt och temperatur. Andra utmaning var att hitta rätt dimensionalitetsreduktion som är anpassad till GSA. Normalisering och Principal component analysis (PCA) är två klassiska metoder som används till att både konvertera olika stora datavärden i datamängd (data-set) till samma skala och dimensionalitetsminskning av datamängd (data-set), under förbehandling process av utvecling av luktidentifieringsalgoritms. Dessa metoder används även för minskning eller eliminering av brus i luktsignaldata (odor signal data). Tillämpning av klassiska dimensionalitetsminskning algoritmer, såsom PCA, orsakade förlust av värdefulla informationer som var viktiga för kllasifisering. Den nya metoden som har utvecklats för hantering av brus i luktsignaldata samt dimensionalitetsminskning, utan att förlora värdefull data, är signalsegmentering och Autoencoder. Detta tillvägagångssätt har gjort det möjligt att övervinna brusproblemen i datamängder samt det visade sig att denna metod är lämpligare metod för dimensionalitetsminskning jämfört med PCA. För utvärdering of Autoencoder övervakning of inlärningshastighet av Autoencoder tillämpades. För klassificering, flera klassificerare, bland annat, LogisticRegression (LR), Support vector machine (SVM) , Linear Discriminant Analysis (LDA), Random forest Classifier (RFC) och MultiLayer perceptron (MLP) undersöktes. Bästa resultate erhölls av klassificeraren MLP. Flera valideringsmetoder såsom, Cross-validering, Precision score, balanced accuracy score samt inlärningskurva tillämpades. Denna nya AI gas igenkänningsalgoritm har förmågan att diagnosera tre olika mjölkkor sjukdomar med en noggrannhet på högre än 96%.
|
Page generated in 0.1272 seconds