Spelling suggestions: "subject:"dimensionality reduction"" "subject:"dimensionnality reduction""
181 |
Classify part of day and snow on the load of timber stacks : A comparative study between partitional clustering and competitive learningNordqvist, My January 2021 (has links)
In today's society, companies are trying to find ways to utilize all the data they have, which considers valuable information and insights to make better decisions. This includes data used to keeping track of timber that flows between forest and industry. The growth of Artificial Intelligence (AI) and Machine Learning (ML) has enabled the development of ML modes to automate the measurements of timber on timber trucks, based on images. However, to improve the results there is a need to be able to get information from unlabeled images in order to decide weather and lighting conditions. The objective of this study is to perform an extensive for classifying unlabeled images in the categories, daylight, darkness, and snow on the load. A comparative study between partitional clustering and competitive learning is conducted to investigate which method gives the best results in terms of different clustering performance metrics. It also examines how dimensionality reduction affects the outcome. The algorithms K-means and Kohonen Self-Organizing Map (SOM) are selected for the clustering. Each model is investigated according to the number of clusters, size of dataset, clustering time, clustering performance, and manual samples from each cluster. The results indicate a noticeable clustering performance discrepancy between the algorithms concerning the number of clusters, dataset size, and manual samples. The use of dimensionality reduction led to shorter clustering time but slightly worse clustering performance. The evaluation results further show that the clustering time of Kohonen SOM is significantly higher than that of K-means.
|
182 |
Multidimensionality of the models and the data in the side-channel domain / Multidimensionnalité des modèles et des données dans le domaine des canaux auxiliairesMarion, Damien 05 December 2018 (has links)
Depuis la publication en 1999 du papier fondateur de Paul C. Kocher, Joshua Jaffe et Benjamin Jun, intitulé "Differential Power Analysis", les attaques par canaux auxiliaires se sont révélées être un moyen d’attaque performant contre les algorithmes cryptographiques. En effet, il s’est avéré que l’utilisation d’information extraite de canaux auxiliaires comme le temps d’exécution, la consommation de courant ou les émanations électromagnétiques, pouvait être utilisée pour retrouver des clés secrètes. C’est dans ce contexte que cette thèse propose, dans un premier temps, de traiter le problème de la réduction de dimension. En effet, en vingt ans, la complexité ainsi que la taille des données extraites des canaux auxiliaires n’a cessé de croître. C’est pourquoi la réduction de dimension de ces données permet de réduire le temps et d’augmenter l’efficacité des attaques. Les méthodes de réduction de dimension proposées le sont pour des modèles de fuites complexe et de dimension quelconques. Dans un second temps, une méthode d’évaluation d’algorithmes logiciels est proposée. Celle-ci repose sur l’analyse de l’ensemble des données manipulées lors de l’exécution du logiciel évalué. La méthode proposée est composée de plusieurs fonctionnalités permettant d’accélérer et d’augmenter l’efficacité de l’analyse, notamment dans le contexte d’évaluation d’implémentation de cryptographie en boîte blanche. / Since the publication in 1999 of the seminal paper of Paul C. Kocher, Joshua Jaffe and Benjamin Jun, entitled "Differential Power Analysis", the side-channel attacks have been proved to be efficient ways to attack cryptographic algorithms. Indeed, it has been revealed that the usage of information extracted from the side-channels such as the execution time, the power consumption or the electromagnetic emanations could be used to recover secret keys. In this context, we propose first, to treat the problem of dimensionality reduction. Indeed, since twenty years, the complexity and the size of the data extracted from the side-channels do not stop to grow. That is why the reduction of these data decreases the time and increases the efficiency of these attacks. The dimension reduction is proposed for complex leakage models and any dimension. Second, a software leakage assessment methodology is proposed ; it is based on the analysis of all the manipulated data during the execution of the software. The proposed methodology provides features that speed-up and increase the efficiency of the analysis, especially in the case of white box cryptography.
|
183 |
Efficient learning on high-dimensional operational dataZhang, Hongyi January 2019 (has links)
In a networked system, operational data collected by sensors or extracted from system logs can be used for target performance prediction, anomaly detection, etc. However, the number of metrics collected from a networked system is very large and usually can reach about 106 for a medium-sized system. This project aims to analyze and compare different unsupervised machine learning methods such as Unsupervised Feature Selection, Principle Component Analysis, Autoencoder, which can lead to efficient learning from high-dimensional data. The objective is to reduce the dimensionality of the input space while maintaining the prediction performance when compared with the learning on the full feature space. The data used in this project is collected from a KTH testbed which runs a Video-on-Demand service and a Key-Value store under different types of traffic load. The findings confirm the manifold hypothesis, which states that real-world high-dimensional data lie on lowdimensional manifolds embedded within the high-dimensional space. In addition, this project investigates data visualization of infrastructure measurements through two-dimensional plots. The results show that we can achieve data separation by using different mapping methods. / I ett nätverkssystem kan driftsdata som samlats in av sensorer eller extraherats från systemloggar användas för att förutsäga målprestanda, anomalidetektering etc. Antalet mätvärden som samlats in från ett nätverkssystem är dock mycket stort och kan vanligtvis uppgå till cirka 106 för ett medelstort system. Projektet syftar till att analysera och jämföra olika oövervakade metoder för maskininlärning, till exempel Oövervakad funktionsval, analys av huvudkomponent, autokodare, vilket kan leda till effektivt lärande av högdimensionell data. Målet är att minska ingångsutrymmet och samtidigt bibehålla prediktionsprestanda jämfört med inlärningen på hela funktionen. Uppgifterna som används i detta projekt samlas in från en KTH-testbädd som driver en Video-on-Demand-tjänst och en Key-Value-butik under olika typer av trafikbelastning. Resultaten bekräftar mångfaldshypotesen, som säger att verkliga högdimensionella data ligger på lågdimensionella grenrören inbäddade i det högdimensionella rymden. Dessutom undersöker detta projekt datavisualisering av infrastrukturmätningar genom tvådimensionella tomter. Resultaten visar att vi kan uppnå dataseparering genom att använda olika kartläggningsmetoder.
|
184 |
Machine learning methods for genomic high-content screen data analysis applied to deduce organization of endocytic networkNikitina, Kseniia 13 July 2023 (has links)
High-content screens are widely used to get insight on mechanistic organization of biological systems. Chemical and/or genomic interferences are used to modulate molecular machinery, then light microscopy and quantitative image analysis yield a large number of parameters describing phenotype. However, extracting functional information from such high-content datasets (e.g. links between cellular processes or functions of unknown genes) remains challenging. This work is devoted to the analysis of a multi-parametric image-based genomic screen of endocytosis, the process whereby cells uptake cargoes (signals and nutrients) and distribute them into different subcellular compartments. The complexity of the quantitative endocytic data was approached using different Machine Learning techniques, namely, Clustering methods, Bayesian networks, Principal and Independent component analysis, Artificial neural networks. The main goal of such an analysis is to predict possible modes of action of screened genes and also to find candidate genes that can be involved in a process of interest. The degree of freedom for the multidimensional phenotypic space was identified using the data distributions, and then the high-content data were deconvolved into separate signals from different cellular modules. Some of those basic signals (phenotypic traits) were straightforward to interpret in terms of known molecular processes; the other components gave insight into interesting directions for further research. The phenotypic profile of perturbation of individual genes are sparse in coordinates of the basic signals, and, therefore, intrinsically suggest their functional roles in cellular processes. Being a very fundamental process, endocytosis is specifically modulated by a variety of different pathways in the cell; therefore, endocytic phenotyping can be used for analysis of non-endocytic modules in the cell. Proposed approach can be also generalized for analysis of other high-content screens.:Contents
Objectives
Chapter 1 Introduction
1.1 High-content biological data
1.1.1 Different perturbation types for HCS
1.1.2 Types of observations in HTS
1.1.3 Goals and outcomes of MP HTS
1.1.4 An overview of the classical methods of analysis of biological HT- and HCS data
1.2 Machine learning for systems biology
1.2.1 Feature selection
1.2.2 Unsupervised learning
1.2.3 Supervised learning
1.2.4 Artificial neural networks
1.3 Endocytosis as a system process
1.3.1 Endocytic compartments and main players
1.3.2 Relation to other cellular processes
Chapter 2 Experimental and analytical techniques
2.1 Experimental methods
2.1.1 RNA interference
2.1.2 Quantitative multiparametric image analysis
2.2 Detailed description of the endocytic HCS dataset
2.2.1 Basic properties of the endocytic dataset
2.2.2 Control subset of genes
2.3 Machine learning methods
2.3.1 Latent variables models
2.3.2 Clustering
2.3.3 Bayesian networks
2.3.4 Neural networks
Chapter 3 Results
3.1 Selection of labeled data for training and validation based on KEGG information about genes pathways
3.2 Clustering of genes
3.2.1 Comparison of clustering techniques on control dataset
3.2.2 Clustering results
3.3 Independent components as basic phenotypes
3.3.1 Algorithm for identification of the best number of independent components
3.3.2 Application of ICA on the full dataset and on separate assays of the screen
3.3.3 Gene annotation based on revealed phenotypes
3.3.4 Searching for genes with target function
3.4 Bayesian network on endocytic parameters
3.4.1 Prediction of pathway based on parameters values using Naïve Bayesian Classifier
3.4.2 General Bayesian Networks
3.5 Neural networks
3.5.1 Autoencoders as nonlinear ICA
3.5.2 siRNA sequence motives discovery with deep NN
3.6 Biological results
3.6.1 Rab11 ZNF-specific phenotype found by ICA
3.6.2 Structure of BN revealed dependency between endocytosis and cell adhesion
Chapter 4 Discussion
4.1 Machine learning approaches for discovery of phenotypic patterns
4.1.1 Functional annotation of unknown genes based on phenotypic profiles
4.1.2 Candidate genes search
4.2 Adaptation to other HCS data and generalization
Chapter 5 Outlook and future perspectives
5.1 Handling sequence-dependent off-target effects with neural networks
5.2 Transition between machine learning and systems biology models
Acknowledgements
References
Appendix
A.1 Full list of cellular and endocytic parameters
A.2 Description of independent components of the full dataset
A.3 Description of independent components extracted from separate assays of the HCS
|
185 |
Critical Analysis of Dimensionality Reduction Techniques and Statistical Microstructural Descriptors for Mesoscale Variability QuantificationGalbincea, Nicholas D. January 2017 (has links)
No description available.
|
186 |
Canonical Correlation and Clustering for High Dimensional DataOuyang, Qing January 2019 (has links)
Multi-view datasets arise naturally in statistical genetics when the genetic
and trait profile of an individual is portrayed by two feature vectors.
A motivating problem concerning the Skin Intrinsic Fluorescence (SIF)
study on the Diabetes Control and Complications Trial (DCCT) subjects
is presented. A widely applied quantitative method to explore the correlation
structure between two domains of a multi-view dataset is the
Canonical Correlation Analysis (CCA), which seeks the canonical loading
vectors such that the transformed canonical covariates are maximally
correlated. In the high dimensional case, regularization of the dataset is
required before CCA can be applied. Furthermore, the nature of genetic
research suggests that sparse output is more desirable. In this thesis, two
regularized CCA (rCCA) methods and a sparse CCA (sCCA) method
are presented. When correlation sub-structure exists, stand-alone CCA
method will not perform well. To tackle this limitation, a mixture of
local CCA models can be employed. In this thesis, I review a correlation
clustering algorithm proposed by Fern, Brodley and Friedl (2005),
which seeks to group subjects into clusters such that features are identically
correlated within each cluster. An evaluation study is performed
to assess the effectiveness of CCA and correlation clustering algorithms
using artificial multi-view datasets. Both sCCA and sCCA-based correlation
clustering exhibited superior performance compare to the rCCA and
rCCA-based correlation clustering. The sCCA and the sCCA-clustering
are applied to the multi-view dataset consisted of PrediXcan imputed gene
expression and SIF measurements of DCCT subjects. The stand-alone
sparse CCA method identified 193 among 11538 genes being correlated
with SIF#7. Further investigation of these 193 genes with simple linear
regression and t-test revealed that only two genes, ENSG00000100281.9
and ENSG00000112787.8, were significance in association with SIF#7. No
plausible clustering scheme was detected by the sCCA based correlation
clustering method. / Thesis / Master of Science (MSc)
|
187 |
Development of Hybrid Optimization Techniques of Mechanical Components Employing the Cartesian Grid Finite Element MethodMuñoz Pellicer, David 15 February 2024 (has links)
Tesis por compendio / [ES] Esta tesis explora enfoques innovadores para la optimización estructural, abarcando una variedad de algoritmos de optimización comúnmente utilizados en el campo. Se centra específicamente en la optimización de forma (SO) y la optimización de topología (TO). La primera contribución de esta tesis gira en torno a garantizar y mantener un nivel deseado de precisión durante todo el proceso de TO y la solución propuesta. Al establecer confianza en los componentes sugeridos por el algoritmo de TO, nuestra atención puede centrarse en la siguiente contribución.
La segunda contribución de esta tesis tiene como objetivo establecer una comunicación efectiva entre los algoritmos de TO y SO. Para lograr esto, nuestro objetivo es convertir directamente la distribución óptima de materiales propuesta por el algoritmo de TO en geometría. Posteriormente, optimizamos la geometría utilizando algoritmos de SO. Facilitar una comunicación fluida entre estos dos algoritmos presenta un desafío complejo, que abordamos proponiendo una metodología basada en aprendizaje automático. Este enfoque busca extraer un número reducido de modos geométricos que pueden servir como parametrización para la geometría, lo que permite su optimización mediante algoritmos de SO.
Por último, la tercera contribución recoge algunas de las ideas previas y las lleva un paso hacia delante. La metodología propuesta tiene como objetivo derivar nuevos componentes a través de enfoques basados en el conocimiento existente en lugar de depender únicamente de procesos de TO basados en la física. Sostenemos que este conocimiento se puede obtener del histórico de diseños empleados por una determinada empresa, ya que retienen un valioso conocimiento inmaterial. Esta metodología también se basa en algoritmos de aprendizaje automático, pero también consideramos técnicas para analizar datos de alta dimensionalidad y estrategias de interpolación más adecuadas. / [CA] Aquesta tesi explora enfocaments innovadors per a l'optimització estructural, abastant una varietat d'algorismes d'optimització comunament utilitzats en el camp. Se centra específicament en l'optimització de forma (SO) i l'optimització de topologia (TO). La primera contribució d'aquesta tesi gira entorn de garantir i mantenir un nivell desitjat de precisió durant tot el procés de TO i la solució proposada. En establir confiança en els components suggerits per l'algorisme de TO, la nostra atenció pot centrar-se en la següent contribució.
La segona contribució d'aquesta tesi té com a objectiu establir una comunicació efectiva entre els algorismes de TO i SO. Per a aconseguir això, el nostre objectiu és convertir directament la distribució òptima de materials proposta per l'algorisme de TO en geometria. Posteriorment, optimitzem la geometria utilitzant algorismes de SO. Facilitar una comunicació fluida entre aquests dos algorismes presenta un desafiament complex, que abordem proposant una metodologia basada en aprenentatge automàtic. Aquest enfocament busca extreure un nombre reduït de maneres geomètriques que poden servir com a parametrització per a la geometria, la qual cosa permet la seua optimització mitjançant algorismes de SO.
Finalment, la tercera contribució recull algunes de les idees prèvies i les porta un pas cap endavant. La metodologia recomanada té com a objectiu derivar nous components a través d'enfocaments basats en el coneixement existent en lloc de dependre únicament de processos de TO basats en la física. Sostenim que aquest coneixement es pot obtenir de l'històric de dissenys emprats per una determinada empresa, ja que retenen un valuós coneixement immaterial. Aquesta metodologia també es basa en algorismes d'aprenentatge automàtic, però també considerem tècniques per a analitzar dades d'alta dimensionalitat i estratègies d'interpolació més adequades. / [EN] This thesis explores innovative approaches for structural optimization, encompassing a variety of commonly used optimization algorithms in this field. It specifically focuses on shape optimization (SO) and topology optimization (TO). The first contribution of this research revolves around ensuring and maintaining a desired level of accuracy throughout the TO process and the proposed solution. By establishing confidence in the suggested components of the TO algorithm, our attention can then shift to the subsequent contribution.
The second contribution of this thesis aims to establish effective communication between TO and SO algorithms. To achieve this, our goal is to directly convert the optimal material distribution proposed by the TO algorithm into geometry. Subsequently, we optimize the geometry using SO algorithms. Facilitating seamless communication between these two algorithms presents a non-trivial challenge, which we address by proposing a machine learning-based methodology. This approach seeks to extract a reduced number of geometric modes that can serve as a parameterization for the geometry, enabling further optimization by SO algorithms.
Lastly, the third contribution builds upon the previous idea, taking it a step forward. The proposed methodology aims to derive new components through knowledge-based approaches instead of relying solely on physics-based TO processes. We argue that this knowledge can be acquired from the historical designs employed by a given company as they retain invaluable immaterial know-how. This methodology also relies on machine learning algorithms, but we also consider techniques for analyzing high-dimensional data and more suitable interpolation strategies. / The authors gratefully acknowledge the financial support of Conselleria d’Educació, Investigació, Cultura i Esport, Generalitat Valenciana, project Prometeo/2016/007, Prometeo/2021/046 and CIAICO/2021/226. Ministerio de Economía, Industria y Competitividad project DPI2017-89816-R and Ministerio de Educación FPU16/07121. / Muñoz Pellicer, D. (2024). Development of Hybrid Optimization Techniques of Mechanical Components Employing the Cartesian Grid Finite Element Method [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/202661 / Compendio
|
188 |
Étude et conception d'un système automatisé de contrôle d'aspect des pièces optiques basé sur des techniques connexionnistes / Investigation and design of an automatic system for optical devices' defects detection and diagnosis based on connexionist approachVoiry, Matthieu 15 July 2008 (has links)
Dans différents domaines industriels, la problématique du diagnostic prend une place importante. Ainsi, le contrôle d’aspect des composants optiques est une étape incontournable pour garantir leurs performances opérationnelles. La méthode conventionnelle de contrôle par un opérateur humain souffre de limitations importantes qui deviennent insurmontables pour certaines optiques hautes performances. Dans ce contexte, cette thèse traite de la conception d’un système automatique capable d’assurer le contrôle d’aspect. Premièrement, une étude des capteurs pouvant être mis en oeuvre par ce système est menée. Afin de satisfaire à des contraintes de temps de contrôle, la solution proposée utilise deux capteurs travaillant à des échelles différentes. Un de ces capteurs est basé sur la microscopie Nomarski ; nous présentons ce capteur ainsi qu’un ensemble de méthodes de traitement de l’image qui permettent, à partir des données fournies par celui-ci, de détecter les défauts et de déterminer la rugosité, de manière robuste et répétable. L’élaboration d’un prototype opérationnel, capable de contrôler des pièces optiques de taille limitée valide ces différentes techniques. Par ailleurs, le diagnostic des composants optiques nécessite une phase de classification. En effet, si les défauts permanents sont détectés, il en est de même pour de nombreux « faux » défauts (poussières, traces de nettoyage. . . ). Ce problème complexe est traité par un réseau de neurones artificiels de type MLP tirant partie d’une description invariante des défauts. Cette description, issue de la transformée de Fourier-Mellin est d’une dimension élevée qui peut poser des problèmes liés au « fléau de la dimension ». Afin de limiter ces effets néfastes, différentes techniques de réduction de dimension (Self Organizing Map, Curvilinear Component Analysis et Curvilinear Distance Analysis) sont étudiées. On montre d’une part que les techniques CCA et CDA sont plus performantes que SOM en termes de qualité de projection, et d’autre part qu’elles permettent d’utiliser des classifieurs de taille plus modeste, à performances égales. Enfin, un réseau de neurones modulaire utilisant des modèles locaux est proposé. Nous développons une nouvelle approche de décomposition des problèmes de classification, fondée sur le concept de dimension intrinsèque. Les groupes de données de dimensionnalité homogène obtenus ont un sens physique et permettent de réduire considérablement la phase d’apprentissage du classifieur tout en améliorant ses performances en généralisation / In various industrial fields, the problem of diagnosis is of great interest. For example, the check of surface imperfections on an optical device is necessary to guarantee its operational performances. The conventional control method, based on human expert visual inspection, suffers from limitations, which become critical for some high-performances components. In this context, this thesis deals with the design of an automatic system, able to carry out the diagnosis of appearance flaws. To fulfil the time constraints, the suggested solution uses two sensors working on different scales. We present one of them based on Normarski microscopy, and the image processing methods which allow, starting from issued data, to detect the defects and to determine roughness in a reliable way. The development of an operational prototype, able to check small optical components, validates the proposed techniques. The final diagnosis also requires a classification phase. Indeed, if the permanent defects are detected, many “false” defects (dust, cleaning marks. . . ) are emphasized as well. This complex problem is solved by a MLP Artificial Neural Network using an invariant description of the defects. This representation, resulting from the Fourier-Mellin transform, is a high dimensional vector, what implies some problems linked to the “curse of dimensionality”. In order to limit these harmful effects, various dimensionality reduction techniques (Self Organizing Map, Curvilinear Component Analysis and Curvilinear Distance Analysis) are investigated. On one hand we show that CCA and CDA are more powerful than SOM in terms of projection quality. On the other hand, these methods allow using more simple classifiers with equal performances. Finally, a modular neural network, which exploits local models, is developed. We proposed a new classification problems decomposition scheme, based on the intrinsic dimension concept. The obtained data clusters of homogeneous dimensionality have a physical meaning and permit to reduce significantly the training phase of the classifier, while improving its generalization performances
|
189 |
Towards on-line domain-independent big data learning : novel theories and applicationsMalik, Zeeshan January 2015 (has links)
Feature extraction is an extremely important pre-processing step to pattern recognition, and machine learning problems. This thesis highlights how one can best extract features from the data in an exhaustively online and purely adaptive manner. The solution to this problem is given for both labeled and unlabeled datasets, by presenting a number of novel on-line learning approaches. Specifically, the differential equation method for solving the generalized eigenvalue problem is used to derive a number of novel machine learning and feature extraction algorithms. The incremental eigen-solution method is used to derive a novel incremental extension of linear discriminant analysis (LDA). Further the proposed incremental version is combined with extreme learning machine (ELM) in which the ELM is used as a preprocessor before learning. In this first key contribution, the dynamic random expansion characteristic of ELM is combined with the proposed incremental LDA technique, and shown to offer a significant improvement in maximizing the discrimination between points in two different classes, while minimizing the distance within each class, in comparison with other standard state-of-the-art incremental and batch techniques. In the second contribution, the differential equation method for solving the generalized eigenvalue problem is used to derive a novel state-of-the-art purely incremental version of slow feature analysis (SLA) algorithm, termed the generalized eigenvalue based slow feature analysis (GENEIGSFA) technique. Further the time series expansion of echo state network (ESN) and radial basis functions (EBF) are used as a pre-processor before learning. In addition, the higher order derivatives are used as a smoothing constraint in the output signal. Finally, an online extension of the generalized eigenvalue problem, derived from James Stone’s criterion, is tested, evaluated and compared with the standard batch version of the slow feature analysis technique, to demonstrate its comparative effectiveness. In the third contribution, light-weight extensions of the statistical technique known as canonical correlation analysis (CCA) for both twinned and multiple data streams, are derived by using the same existing method of solving the generalized eigenvalue problem. Further the proposed method is enhanced by maximizing the covariance between data streams while simultaneously maximizing the rate of change of variances within each data stream. A recurrent set of connections used by ESN are used as a pre-processor between the inputs and the canonical projections in order to capture shared temporal information in two or more data streams. A solution to the problem of identifying a low dimensional manifold on a high dimensional dataspace is then presented in an incremental and adaptive manner. Finally, an online locally optimized extension of Laplacian Eigenmaps is derived termed the generalized incremental laplacian eigenmaps technique (GENILE). Apart from exploiting the benefit of the incremental nature of the proposed manifold based dimensionality reduction technique, most of the time the projections produced by this method are shown to produce a better classification accuracy in comparison with standard batch versions of these techniques - on both artificial and real datasets.
|
190 |
Triangular similarity metric learning : A siamese architecture approach / Apprentissage métrique de similarité triangulaire : Une approche d'architecture siamoisZheng, Lilei 10 May 2016 (has links)
Dans de nombreux problèmes d’apprentissage automatique et de reconnaissance des formes, il y a toujours un besoin de fonctions métriques appropriées pour mesurer la distance ou la similarité entre des données. La fonction métrique est une fonction qui définit une distance ou une similarité entre chaque paire d’éléments d’un ensemble de données. Dans cette thèse, nous proposons une nouvelle methode, Triangular Similarity Metric Learning (TSML), pour spécifier une fonction métrique de données automatiquement. Le système TSML proposée repose une architecture Siamese qui se compose de deux sous-systèmes identiques partageant le même ensemble de paramètres. Chaque sous-système traite un seul échantillon de données et donc le système entier reçoit une paire de données en entrée. Le système TSML comprend une fonction de coût qui définit la relation entre chaque paire de données et une fonction de projection permettant l’apprentissage des formes de haut niveau. Pour la fonction de coût, nous proposons d’abord la similarité triangulaire (Triangular Similarity), une nouvelle similarité métrique qui équivaut à la similarité cosinus. Sur la base d’une version simplifiée de la similarité triangulaire, nous proposons la fonction triangulaire (the triangular loss) afin d’effectuer l’apprentissage de métrique, en augmentant la similarité entre deux vecteurs dans la même classe et en diminuant la similarité entre deux vecteurs de classes différentes. Par rapport aux autres distances ou similarités, la fonction triangulaire et sa fonction gradient nous offrent naturellement une interprétation géométrique intuitive et intéressante qui explicite l’objectif d’apprentissage de métrique. En ce qui concerne la fonction de projection, nous présentons trois fonctions différentes: une projection linéaire qui est réalisée par une matrice simple, une projection non-linéaire qui est réalisée par Multi-layer Perceptrons (MLP) et une projection non-linéaire profonde qui est réalisée par Convolutional Neural Networks (CNN). Avec ces fonctions de projection, nous proposons trois systèmes de TSML pour plusieurs applications: la vérification par paires, l’identification d’objet, la réduction de la dimensionnalité et la visualisation de données. Pour chaque application, nous présentons des expérimentations détaillées sur des ensembles de données de référence afin de démontrer l’efficacité de notre systèmes de TSML. / In many machine learning and pattern recognition tasks, there is always a need for appropriate metric functions to measure pairwise distance or similarity between data, where a metric function is a function that defines a distance or similarity between each pair of elements of a set. In this thesis, we propose Triangular Similarity Metric Learning (TSML) for automatically specifying a metric from data. A TSML system is loaded in a siamese architecture which consists of two identical sub-systems sharing the same set of parameters. Each sub-system processes a single data sample and thus the whole system receives a pair of data as the input. The TSML system includes a cost function parameterizing the pairwise relationship between data and a mapping function allowing the system to learn high-level features from the training data. In terms of the cost function, we first propose the Triangular Similarity, a novel similarity metric which is equivalent to the well-known Cosine Similarity in measuring a data pair. Based on a simplified version of the Triangular Similarity, we further develop the triangular loss function in order to perform metric learning, i.e. to increase the similarity between two vectors in the same class and to decrease the similarity between two vectors of different classes. Compared with other distance or similarity metrics, the triangular loss and its gradient naturally offer us an intuitive and interesting geometrical interpretation of the metric learning objective. In terms of the mapping function, we introduce three different options: a linear mapping realized by a simple transformation matrix, a nonlinear mapping realized by Multi-layer Perceptrons (MLP) and a deep nonlinear mapping realized by Convolutional Neural Networks (CNN). With these mapping functions, we present three different TSML systems for various applications, namely, pairwise verification, object identification, dimensionality reduction and data visualization. For each application, we carry out extensive experiments on popular benchmarks and datasets to demonstrate the effectiveness of the proposed systems.
|
Page generated in 0.1441 seconds