131 |
Τμηματοποίηση εικόνων υφής με χρήση πολυφασματικής ανάλυσης και ελάττωσης διαστάσεωνΘεοδωρακόπουλος, Ηλίας 16 June 2010 (has links)
Τμηματοποίηση υφής ονομάζεται η διαδικασία του διαμερισμού μίας εικόνας σε πολλαπλά τμήματα-περιοχές, με κριτήριο την υφή κάθε περιοχής. Η διαδικασία αυτή βρίσκει πολλές εφαρμογές στους τομείς της υπολογιστικής όρασης, της ανάκτησης εικόνων, της ρομποτικής, της ανάλυσης δορυφορικών εικόνων κλπ. Αντικείμενο της παρούσης εργασίας είναι να διερευνηθεί η ικανότητα των αλγορίθμων μη γραμμικής ελάττωσης διάστασης, και ιδιαίτερα του αλγορίθμου Laplacian Eigenmaps, να παράγει μία αποδοτική αναπαράσταση των δεδομένων που προέρχονται από πολυφασματική ανάλυση εικόνων με χρήση φίλτρων Gabor, για την επίλυση του προβλήματος της τμηματοποίησης εικόνων υφής. Για το σκοπό αυτό προτείνεται μία νέα μέθοδος επιβλεπόμενης τμηματοποίησης υφής, που αξιοποιεί μία χαμηλής διάστασης αναπαράσταση των χαρακτηριστικών διανυσμάτων, και γνωστούς αλγόριθμους ομαδοποίησης δεδομένων όπως οι Fuzzy C-means και K-means, για την παραγωγή της τελικής τμηματοποίησης. Η αποτελεσματικότητα της μεθόδου συγκρίνεται με παρόμοιες μεθόδους που έχουν προταθεί στη βιβλιογραφία, και χρησιμοποιούν την αρχική , υψηλών διαστάσεων, αναπαράσταση των χαρακτηριστικών διανυσμάτων. Τα πειράματα διενεργήθηκαν χρησιμοποιώντας την βάση εικόνων υφής Brodatz. Κατά το στάδιο αξιολόγησης της μεθόδου, χρησιμοποιήθηκε ο δείκτης Rand index σαν μέτρο ομοιότητας ανάμεσα σε κάθε παραγόμενη τμηματοποίηση και την αντίστοιχη ground-truth τμηματοποίηση. / Texture segmentation is the process of partitioning an image into multiple segments (regions) based on their texture, with many applications in the area of computer vision, image retrieval, robotics, satellite imagery etc. The objective of this thesis is to investigate the ability of non-linear dimensionality reduction algorithms, and especially of LE algorithm, to produce an efficient representation for data derived from multi-spectral image analysis using Gabor filters, in solving the texture segmentation problem. For this purpose, we introduce a new supervised texture segmentation algorithm, which exploits a low-dimensional representation of feature vectors and well known clustering methods, such as Fuzzy C-means and K-means, to produce the final segmentation. The effectiveness of this method was compared to that of similar methods proposed in the literature, which use the initial high-dimensional representation of feature vectors. Experiments were performed on Brodatz texture database. During evaluation stage, Rand index has been used as a similarity measure between each segmentation and the corresponding ground-truth segmentation.
|
132 |
Développement d'outils statistiques pour l'analyse de données transcriptomiques par les réseaux de co-expression de gènes / A systemic approach to statistical analysis to transcriptomic data through co-expression network analysisBrunet, Anne-Claire 17 June 2016 (has links)
Les nouvelles biotechnologies offrent aujourd'hui la possibilité de récolter une très grande variété et quantité de données biologiques (génomique, protéomique, métagénomique...), ouvrant ainsi de nouvelles perspectives de recherche pour la compréhension des processus biologiques. Dans cette thèse, nous nous sommes plus spécifiquement intéressés aux données transcriptomiques, celles-ci caractérisant l'activité ou le niveau d'expression de plusieurs dizaines de milliers de gènes dans une cellule donnée. L'objectif était alors de proposer des outils statistiques adaptés pour analyser ce type de données qui pose des problèmes de "grande dimension" (n<<p), car collectées sur des échantillons de tailles très limitées au regard du très grand nombre de variables (ici l'expression des gènes).La première partie de la thèse est consacrée à la présentation de méthodes d'apprentissage supervisé, telles que les forêts aléatoires de Breiman et les modèles de régressions pénalisées, utilisées dans le contexte de la grande dimension pour sélectionner les gènes (variables d'expression) qui sont les plus pertinents pour l'étude de la pathologie d'intérêt. Nous évoquons les limites de ces méthodes pour la sélection de gènes qui soient pertinents, non pas uniquement pour des considérations d'ordre statistique, mais qui le soient également sur le plan biologique, et notamment pour les sélections au sein des groupes de variables fortement corrélées, c'est à dire au sein des groupes de gènes co-exprimés. Les méthodes d'apprentissage classiques considèrent que chaque gène peut avoir une action isolée dans le modèle, ce qui est en pratique peu réaliste. Un caractère biologique observable est la résultante d'un ensemble de réactions au sein d'un système complexe faisant interagir les gènes les uns avec les autres, et les gènes impliqués dans une même fonction biologique ont tendance à être co-exprimés (expression corrélée). Ainsi, dans une deuxième partie, nous nous intéressons aux réseaux de co-expression de gènes sur lesquels deux gènes sont reliés si ils sont co-exprimés. Plus précisément, nous cherchons à mettre en évidence des communautés de gènes sur ces réseaux, c'est à dire des groupes de gènes co-exprimés, puis à sélectionner les communautés les plus pertinentes pour l'étude de la pathologie, ainsi que les "gènes clés" de ces communautés. Cela favorise les interprétations biologiques, car il est souvent possible d'associer une fonction biologique à une communauté de gènes. Nous proposons une approche originale et efficace permettant de traiter simultanément la problématique de la modélisation du réseau de co-expression de gènes et celle de la détection des communautés de gènes sur le réseau. Nous mettons en avant les performances de notre approche en la comparant à des méthodes existantes et populaires pour l'analyse des réseaux de co-expression de gènes (WGCNA et méthodes spectrales). Enfin, par l'analyse d'un jeu de données réelles, nous montrons dans la dernière partie de la thèse que l'approche que nous proposons permet d'obtenir des résultats convaincants sur le plan biologique, plus propices aux interprétations et plus robustes que ceux obtenus avec les méthodes d'apprentissage supervisé classiques. / Today's, new biotechnologies offer the opportunity to collect a large variety and volume of biological data (genomic, proteomic, metagenomic...), thus opening up new avenues for research into biological processes. In this thesis, what we are specifically interested is the transcriptomic data indicative of the activity or expression level of several thousands of genes in a given cell. The aim of this thesis was to propose proper statistical tools to analyse these high dimensional data (n<<p) collected from small samples with regard to the very large number of variables (gene expression variables). The first part of the thesis is devoted to a description of some supervised learning methods, such as random forest and penalized regression models. The following methods can be used for selecting the most relevant disease-related genes. However, the statistical relevance of the selections doesn't determine the biological relevance, and particularly when genes are selected within a group of highly correlated variables or co-expressed genes. Common supervised learning methods consider that every gene can have an isolated action in the model which is not so much realistic. An observable biological phenomenum is the result of a set of reactions inside a complex system which makes genes interact with each other, and genes that have a common biological function tend to be co-expressed (correlation between expression variables). Then, in a second part, we are interested in gene co-expression networks, where genes are linked if they are co-expressed. More precisely, we aim to identify communities of co-expressed genes, and then to select the most relevant disease-related communities as well as the "key-genes" of these communities. It leads to a variety of biological interpretations, because a community of co-expressed genes is often associated with a specific biological function. We propose an original and efficient approach that permits to treat simultaneously the problem of modeling the gene co-expression network and the problem of detecting the communities in network. We put forward the performances of our approach by comparing it to the existing methods that are popular for analysing gene co-expression networks (WGCNA and spectral approaches). The last part presents the results produced by applying our proposed approach on a real-world data set. We obtain convincing and robust results that help us make more diverse biological interpretations than with results produced by common supervised learning methods.
|
133 |
Statistical and Dynamical Modeling of Riemannian Trajectories with Application to Human Movement AnalysisJanuary 2016 (has links)
abstract: The data explosion in the past decade is in part due to the widespread use of rich sensors that measure various physical phenomenon -- gyroscopes that measure orientation in phones and fitness devices, the Microsoft Kinect which measures depth information, etc. A typical application requires inferring the underlying physical phenomenon from data, which is done using machine learning. A fundamental assumption in training models is that the data is Euclidean, i.e. the metric is the standard Euclidean distance governed by the L-2 norm. However in many cases this assumption is violated, when the data lies on non Euclidean spaces such as Riemannian manifolds. While the underlying geometry accounts for the non-linearity, accurate analysis of human activity also requires temporal information to be taken into account. Human movement has a natural interpretation as a trajectory on the underlying feature manifold, as it evolves smoothly in time. A commonly occurring theme in many emerging problems is the need to \emph{represent, compare, and manipulate} such trajectories in a manner that respects the geometric constraints. This dissertation is a comprehensive treatise on modeling Riemannian trajectories to understand and exploit their statistical and dynamical properties. Such properties allow us to formulate novel representations for Riemannian trajectories. For example, the physical constraints on human movement are rarely considered, which results in an unnecessarily large space of features, making search, classification and other applications more complicated. Exploiting statistical properties can help us understand the \emph{true} space of such trajectories. In applications such as stroke rehabilitation where there is a need to differentiate between very similar kinds of movement, dynamical properties can be much more effective. In this regard, we propose a generalization to the Lyapunov exponent to Riemannian manifolds and show its effectiveness for human activity analysis. The theory developed in this thesis naturally leads to several benefits in areas such as data mining, compression, dimensionality reduction, classification, and regression. / Dissertation/Thesis / Doctoral Dissertation Electrical Engineering 2016
|
134 |
Distinct Feature Learning and Nonlinear Variation Pattern Discovery Using Regularized AutoencodersJanuary 2016 (has links)
abstract: Feature learning and the discovery of nonlinear variation patterns in high-dimensional data is an important task in many problem domains, such as imaging, streaming data from sensors, and manufacturing. This dissertation presents several methods for learning and visualizing nonlinear variation in high-dimensional data. First, an automated method for discovering nonlinear variation patterns using deep learning autoencoders is proposed. The approach provides a functional mapping from a low-dimensional representation to the original spatially-dense data that is both interpretable and efficient with respect to preserving information. Experimental results indicate that deep learning autoencoders outperform manifold learning and principal component analysis in reproducing the original data from the learned variation sources.
A key issue in using autoencoders for nonlinear variation pattern discovery is to encourage the learning of solutions where each feature represents a unique variation source, which we define as distinct features. This problem of learning distinct features is also referred to as disentangling factors of variation in the representation learning literature. The remainder of this dissertation highlights and provides solutions for this important problem.
An alternating autoencoder training method is presented and a new measure motivated by orthogonal loadings in linear models is proposed to quantify feature distinctness in the nonlinear models. Simulated point cloud data and handwritten digit images illustrate that standard training methods for autoencoders consistently mix the true variation sources in the learned low-dimensional representation, whereas the alternating method produces solutions with more distinct patterns.
Finally, a new regularization method for learning distinct nonlinear features using autoencoders is proposed. Motivated in-part by the properties of linear solutions, a series of learning constraints are implemented via regularization penalties during stochastic gradient descent training. These include the orthogonality of tangent vectors to the manifold, the correlation between learned features, and the distributions of the learned features. This regularized learning approach yields low-dimensional representations which can be better interpreted and used to identify the true sources of variation impacting a high-dimensional feature space. Experimental results demonstrate the effectiveness of this method for nonlinear variation pattern discovery on both simulated and real data sets. / Dissertation/Thesis / Doctoral Dissertation Industrial Engineering 2016
|
135 |
3D - Patch Based Machine Learning Systems for Alzheimer’s Disease classification via 18F-FDG PET AnalysisJanuary 2017 (has links)
abstract: Alzheimer’s disease (AD), is a chronic neurodegenerative disease that usually starts slowly and gets worse over time. It is the cause of 60% to 70% of cases of dementia. There is growing interest in identifying brain image biomarkers that help evaluate AD risk pre-symptomatically. High-dimensional non-linear pattern classification methods have been applied to structural magnetic resonance images (MRI’s) and used to discriminate between clinical groups in Alzheimers progression. Using Fluorodeoxyglucose (FDG) positron emission tomography (PET) as the pre- ferred imaging modality, this thesis develops two independent machine learning based patch analysis methods and uses them to perform six binary classification experiments across different (AD) diagnostic categories. Specifically, features were extracted and learned using dimensionality reduction and dictionary learning & sparse coding by taking overlapping patches in and around the cerebral cortex and using them as fea- tures. Using AdaBoost as the preferred choice of classifier both methods try to utilize 18F-FDG PET as a biological marker in the early diagnosis of Alzheimer’s . Addi- tional we investigate the involvement of rich demographic features (ApoeE3, ApoeE4 and Functional Activities Questionnaires (FAQ)) in classification. The experimental results on Alzheimer’s Disease Neuroimaging initiative (ADNI) dataset demonstrate the effectiveness of both the proposed systems. The use of 18F-FDG PET may offer a new sensitive biomarker and enrich the brain imaging analysis toolset for studying the diagnosis and prognosis of AD. / Dissertation/Thesis / Thesis Defense Presentation / Masters Thesis Computer Science 2017
|
136 |
Investigating Gene-Gene and Gene-Environment Interactions in the Association Between Overnutrition and Obesity-Related PhenotypesTessier, François January 2017 (has links)
Introduction – Animal studies suggested that NFKB1, SOCS3 and IKBKB genes could be involved in the association between overnutrition and obesity. This study aims to investigate interactions involving these genes and nutrition affecting obesity-related phenotypes.
Methods – We used multifactor dimensionality reduction (MDR) and penalized logistic regression (PLR) to better detect gene/environment interactions in data from the Toronto Nutrigenomics and Health Study (n=1639) using dichotomized body mass index (BMI) and waist circumference (WC) as obesity-related phenotypes. Exposure variables included genotypes on 54 single nucleotide polymorphisms, dietary factors and ethnicity.
Results – MDR identified interactions between SOCS3 rs6501199 and rs4969172, and IKBKB rs3747811 affecting BMI in whites; SOCS3 rs6501199 and NFKB1 rs1609798 affecting WC in whites; and SOCS3 rs4436839 and IKBKB rs3747811 affecting WC in South Asians. PLR found a main effect of SOCS3 rs12944581 on BMI among South Asians.
Conclusion – MDR and PLR gave different results, but support some results from previous studies.
|
137 |
Metric Learning via Linear Embeddings for Human Motion RecognitionKong, ByoungDoo 18 December 2020 (has links)
We consider the application of Few-Shot Learning (FSL) and dimensionality reduction to the problem of human motion recognition (HMR). The structure of human motion has unique characteristics such as its dynamic and high-dimensional nature. Recent research on human motion recognition uses deep neural networks with multiple layers. Most importantly, large datasets will need to be collected to use such networks to analyze human motion. This process is both time-consuming and expensive since a large motion capture database must be collected and labeled. Despite significant progress having been made in human motion recognition, state-of-the-art algorithms still misclassify actions because of characteristics such as the difficulty in obtaining large-scale leveled human motion datasets. To address these limitations, we use metric-based FSL methods that use small-size data in conjunction with dimensionality reduction. We also propose a modified dimensionality reduction scheme based on the preservation of secants tailored to arbitrary useful distances, such as the geodesic distance learned by ISOMAP. We provide multiple experimental results that demonstrate improvements in human motion classification.
|
138 |
Efektivní tagování fotografií / Efficient Image TaggingProcházka, Václav January 2013 (has links)
This thesis investigates efficient manual image tagging approaches. It specifically focuses on organising images into clusters depending on their content, and thus on simplifying the selection of similar photos. Such selections may be efficiently tagged with common tags. The thesis investigates known techniques for visualisation of image collections according to the image content, together with dimensionality reduction methods. The most suitable methods are considered and evaluated. The thesis proposes a novel method for presenting image collections on 2D displays which combines a timeline with similarity grouping (Timeline projection). This method utilizes t-Distributed Stochastic Neighbour Embedding (t-SNE) for otpimally projecting groupings in high dimensional feature spaces onto the low-dimensional screen. Various modifications of t-SNE and ways to combine it with the timeline are discussed and chosen combination is implemented as a web interface and is qualitatively evaluated in a user study. Possible directions of further research on the subject are suggested.
|
139 |
Optimizing Deep Neural Networks for Classification of Short TextsPettersson, Fredrik January 2019 (has links)
This master's thesis investigates how a state-of-the-art (SOTA) deep neural network (NN) model can be created for a specific natural language processing (NLP) dataset, the effects of using different dimensionality reduction techniques on common pre-trained word embeddings and how well this model generalize on a secondary dataset. The research is motivated by two factors. One is that the construction of a machine learning (ML) text classification (TC) model is typically done around a specific dataset and often requires a lot of manual intervention. It's therefore hard to know exactly what procedures to implement for a specific dataset and how the result will be affected. The other reason is that, if the dimensionality of pre-trained embedding vectors can be lowered without losing accuracy, and thus saving execution time, other techniques can be used during the time saved to achieve even higher accuracy. A handful of deep neural network architectures are used, namely a convolutional neural network (CNN), long short-term memory neural network (LSTM) and a bidirectional LSTM (Bi-LSTM) architecture. These deep neural network architectures are combined with four different word embeddings: GoogleNews-vectors-negative300, glove.840B.300d, paragram_300_sl999 and wiki-news-300d-1M. Three main experiments are conducted in this thesis. In the first experiment, a top-performing TC model is created for a recent NLP competition held at Kaggle.com. Each implemented procedure is benchmarked on how the accuracy and execution time of the model is affected. In the second experiment, principal component analysis (PCA) and random projection (RP) are applied to the pre-trained word embeddings used in the top-performing model to investigate how the accuracy and execution time is affected when creating lower-dimensional embedding vectors. In the third experiment, the same model is benchmarked on a separate dataset (Sentiment140) to investigate how well it generalizes on other data and how each implemented procedure affects the accuracy compared to on the original dataset. The first experiment results in a bidirectional LSTM model and a combination of the three embeddings: glove, paragram and wiki-news concatenated together. The model is able to give predictions with an F1 score of 71% which is good enough to reach 9th place out of 1,401 participating teams in the competition. In the second experiment, the execution time is improved by 13%, by using PCA, while lowering the dimensionality of the embeddings by 66% and only losing half a percent of F1 accuracy. RP gave a constant accuracy of 66-67% regardless of the projected dimensions compared to over 70% when using PCA. In the third experiment, the model gained around 12% accuracy from the initial to the final benchmarks, compared to 19% on the competition dataset. The best-achieved accuracy on the Sentiment140 dataset is 86% and thus higher than the 71% achieved on the Quora dataset.
|
140 |
Mera sličnosti između modela Gausovih smeša zasnovana na transformaciji prostora parametaraKrstanović Lidija 25 September 2017 (has links)
<p>Predmet istraživanja ovog rada je istraživanje i eksploatacija mogućnosti da parametri Gausovih komponenti korišćenih Gaussian mixture modela (GMM) aproksimativno leže na niže dimenzionalnoj površi umetnutoj u konusu pozitivno definitnih matrica. U tu svrhu uvodimo novu, mnogo efikasniju meru sličnosti između GMM-ova projektovanjem LPP-tipa parametara komponenti iz više dimenzionalnog parametarskog originalno konfiguracijskog prostora u prostor značajno niže dimenzionalnosti. Prema tome, nalaženje distance između dva GMM-a iz originalnog prostora se redukuje na nalaženje distance između dva skupa niže dimenzionalnih euklidskih vektora, ponderisanih odgovarajućim težinama. Predložena mera je pogodna za primene koje zahtevaju visoko dimenzionalni prostor obeležja i/ili veliki ukupan broj Gausovih komponenti. Razrađena metodologija je primenjena kako na sintetičkim tako i na realnim eksperimentalnim podacima.</p> / <p>This thesis studies the possibility that the parameters of Gaussian components of a<br />particular Gaussian Mixture Model (GMM) lie approximately on a lower-dimensional<br />surface embedded in the cone of positive definite matrices. For that case, we deliver<br />novel, more efficient similarity measure between GMMs, by LPP-like projecting the<br />components of a particular GMM, from the high dimensional original parameter space,<br />to a much lower dimensional space. Thus, finding the distance between two GMMs in<br />the original space is reduced to finding the distance between sets of lower<br />dimensional euclidian vectors, pondered by corresponding weights. The proposed<br />measure is suitable for applications that utilize high dimensional feature spaces and/or<br />large overall number of Gaussian components. We confirm our results on artificial, as<br />well as real experimental data.</p>
|
Page generated in 0.1148 seconds