Global ETD Search

1	Similarity Search in Continuous Data with Evolving Distance Metric Zhang, Hauyi 12 December 2018 (has links) Similarity search is a task fundamental to many machine learning and data analytics applications, where distance metric learning plays an important role. However, since modern online applications continuously produce objects with new characteristics which tend to change over time, state-of-the-art similarity search using distance metric learning methods tends to fail when deployed in such applications without taking the change into consideration. In this work, we propose a Distance Metric Learning-based Continuous Similarity Search approach (CSS for short) to account for the dynamic nature of such data. CSS system adopts an online metric learning model to achieve distance metric evolving to adapt the dynamic nature of continuous data without large latency. To improve the accuracy of online metric learning model, a compact labeled dataset which is representative of the updated data is dynamically updated. Also, to accelerate similarity search, CSS includes an online maintained Locality Sensitive Hashing index to accelerate the similarity search. One, our labeled data update strategy progressively enriches the labeled data to assure continued representativeness, yet without excessively growing its size to ensure that the computation costs of metric learning remain bounded. Two, our continuous distance metric learning strategy ensures that each update only requires one linear time k-NN search in contrast to the cubic time complexity of relearning the distance metric from scratch. Three, our LSH update mechanism leverages our theoretical insight that the LSH built based on the original distance metric is equally effective in supporting similarity search using the new distance metric as long as the transform matrix learned for the new distance metric is reversible. This important observation empowers CSS to avoid the modiﬁcation of LSH in most cases. Our experimental study using real-world public datasets and large synthetic datasets conﬁrms the effectiveness of CSS in improving the accuracy of classiﬁcation and information retrieval tasks. Also, CSS achieves 3 orders of magnitude speedup of our incremental distance metric learning strategy (and its three underlying components) over the state-of-art methods. distance metric learning lsh
2	Sparse distance metric learning Choy, Tze Leung January 2014 (has links) A good distance metric can improve the accuracy of a nearest neighbour classifier. Xing et al. (2002) proposed distance metric learning to find a linear transformation of the data so that observations of different classes are better separated. For high-dimensional problems where many un-informative variables are present, it is attractive to select a sparse distance metric, both to increase predictive accuracy but also to aid interpretation of the result. In this thesis, we investigate three different types of sparsity assumption for distance metric learning and show that sparse recovery is possible under each type of sparsity assumption with an appropriate choice of L1-type penalty. We show that a lasso penalty promotes learning a transformation matrix having lots of zero entries, a group lasso penalty recovers a transformation matrix having zero rows/columns and a trace norm penalty allows us to learn a low rank transformation matrix. The regularization allows us to consider a large number of covariates and we apply the technique to an expanded set of basis called rule ensemble to allow for a more flexible fit. Finally, we illustrate an application of the metric learning problem via a document retrieval example and discuss how similarity-based information can be applied to learn a classifier. 519.5
3	Transformace dat pomocí evolučních algoritmů / Evolutionary Algorithms for Data Transformation Švec, Ondřej January 2017 (has links) In this work, we propose a novel method for a supervised dimensionality reduc- tion, which learns weights of a neural network using an evolutionary algorithm, CMA-ES, optimising the success rate of the k-NN classifier. If no activation func- tions are used in the neural network, the algorithm essentially performs a linear transformation, which can also be used inside of the Mahalanobis distance. There- fore our method can be considered to be a metric learning algorithm. By adding activations to the neural network, the algorithm can learn non-linear transfor- mations as well. We consider reductions to low-dimensional spaces, which are useful for data visualisation, and demonstrate that the resulting projections pro- vide better performance than other dimensionality reduction techniques and also that the visualisations provide better distinctions between the classes in the data thanks to the locality of the k-NN classifier. 1
4	Information fusion and decision-making using belief functions : application to therapeutic monitoring of cancer / Fusion de l’information et prise de décisions à l’aide des fonctions de croyance : application au suivi thérapeutique du cancer Lian, Chunfeng 27 January 2017 (has links) La radiothérapie est une des méthodes principales utilisée dans le traitement thérapeutique des tumeurs malignes. Pour améliorer son efficacité, deux problèmes essentiels doivent être soigneusement traités : la prédication fiable des résultats thérapeutiques et la segmentation précise des volumes tumoraux. La tomographie d’émission de positrons au traceur Fluoro- 18-déoxy-glucose (FDG-TEP) peut fournir de manière non invasive des informations significatives sur les activités fonctionnelles des cellules tumorales. Les objectifs de cette thèse sont de proposer: 1) des systèmes fiables pour prédire les résultats du traitement contre le cancer en utilisant principalement des caractéristiques extraites des images FDG-TEP; 2) des algorithmes automatiques pour la segmentation de tumeurs de manière précise en TEP et TEP-TDM. La théorie des fonctions de croyance est choisie dans notre étude pour modéliser et raisonner des connaissances incertaines et imprécises pour des images TEP qui sont bruitées et floues. Dans le cadre des fonctions de croyance, nous proposons une méthode de sélection de caractéristiques de manière parcimonieuse et une méthode d’apprentissage de métriques permettant de rendre les classes bien séparées dans l’espace caractéristique afin d’améliorer la précision de classification du classificateur EK-NN. Basées sur ces deux études théoriques, un système robuste de prédiction est proposé, dans lequel le problème d’apprentissage pour des données de petite taille et déséquilibrées est traité de manière efficace. Pour segmenter automatiquement les tumeurs en TEP, une méthode 3-D non supervisée basée sur le regroupement évidentiel (evidential clustering) et l’information spatiale est proposée. Cette méthode de segmentation mono-modalité est ensuite étendue à la co-segmentation dans des images TEP-TDM, en considérant que ces deux modalités distinctes contiennent des informations complémentaires pour améliorer la précision. Toutes les méthodes proposées ont été testées sur des données cliniques, montrant leurs meilleures performances par rapport aux méthodes de l’état de l’art. / Radiation therapy is one of the most principal options used in the treatment of malignant tumors. To enhance its effectiveness, two critical issues should be carefully dealt with, i.e., reliably predicting therapy outcomes to adapt undergoing treatment planning for individual patients, and accurately segmenting tumor volumes to maximize radiation delivery in tumor tissues while minimize side effects in adjacent organs at risk. Positron emission tomography with radioactive tracer fluorine-18 fluorodeoxyglucose (FDG-PET) can noninvasively provide significant information of the functional activities of tumor cells. In this thesis, the goal of our study consists of two parts: 1) to propose reliable therapy outcome prediction system using primarily features extracted from FDG-PET images; 2) to propose automatic and accurate algorithms for tumor segmentation in PET and PET-CT images. The theory of belief functions is adopted in our study to model and reason with uncertain and imprecise knowledge quantified from noisy and blurring PET images. In the framework of belief functions, a sparse feature selection method and a low-rank metric learning method are proposed to improve the classification accuracy of the evidential K-nearest neighbor classifier learnt by high-dimensional data that contain unreliable features. Based on the above two theoretical studies, a robust prediction system is then proposed, in which the small-sized and imbalanced nature of clinical data is effectively tackled. To automatically delineate tumors in PET images, an unsupervised 3-D segmentation based on evidential clustering using the theory of belief functions and spatial information is proposed. This mono-modality segmentation method is then extended to co-segment tumor in PET-CT images, considering that these two distinct modalities contain complementary information to further improve the accuracy. All proposed methods have been performed on clinical data, giving better results comparing to the state of the art ones. Théorie des fonctions de croyances Prédiction Segmentation de tumeurs automatique Tomographie par émission de positrons Imagerie TEP/TDM Tomodensitométrie Clustering des données Classification des données Algorithmes automatiques Apprentissage de métriques Sélection de caractéristiques Theory of belief functions Feature selection Distance metric learning Data classification Data clustering Cancer therapy outcome prediction Automatic tumor segmentation PET/CT imaging Emission tomography Algorithms
5	Analysis and Reconstruction of the Hematopoietic Stem Cell Differentiation Tree: A Linear Programming Approach for Gene Selection Ghadie, Mohamed A. January 2015 (has links) Stem cells differentiate through an organized hierarchy of intermediate cell types to terminally differentiated cell types. This process is largely guided by master transcriptional regulators, but it also depends on the expression of many other types of genes. The discrete cell types in the differentiation hierarchy are often identified based on the expression or non-expression of certain marker genes. Historically, these have often been various cell-surface proteins, which are fairly easy to assay biochemically but are not necessarily causative of the cell type, in the sense of being master transcriptional regulators. This raises important questions about how gene expression across the whole genome controls or reflects cell state, and in particular, differentiation hierarchies. Traditional approaches to understanding gene expression patterns across multiple conditions, such as principal components analysis or K-means clustering, can group cell types based on gene expression, but they do so without knowledge of the differentiation hierarchy. Hierarchical clustering and maximization of parsimony can organize the cell types into a tree, but in general this tree is different from the differentiation hierarchy. Using hematopoietic differentiation as an example, we demonstrate how many genes other than marker genes are able to discriminate between different branches of the differentiation tree by proposing two models for detecting genes that are up-regulated or down-regulated in distinct lineages. We then propose a novel approach to solving the following problem: Given the differentiation hierarchy and gene expression data at each node, construct a weighted Euclidean distance metric such that the minimum spanning tree with respect to that metric is precisely the given differentiation hierarchy. We provide a set of linear constraints that are provably sufficient for the desired construction and a linear programming framework to identify sparse sets of weights, effectively identifying genes that are most relevant for discriminating different parts of the tree. We apply our method to microarray gene expression data describing 38 cell types in the hematopoiesis hierarchy, constructing a sparse weighted Euclidean metric that uses just 175 genes. These 175 genes are different than the marker genes that were used to identify the 38 cell types, hence offering a novel alternative way of discriminating different branches of the tree. A DAVID functional annotation analysis shows that the 175 genes reflect major processes and pathways active in different parts of the tree. However, we find that there are many alternative sets of weights that satisfy the linear constraints. Thus, in the style of random-forest training, we also construct metrics based on random subsets of the genes and compare them to the metric of 175 genes. Our results show that the 175 genes frequently appear in the random metrics, implicating their significance from an empirical point of view as well. Finally, we show how our linear programming method is able to identify columns that were selected to build minimum spanning trees on the nodes of random variable-size matrices. Linear Programming Distance Metric Learning Machine Learning Feature Selection Tree Reconstruction Hierarchical Clustering Minimum Spanning Tree Clustering Optimization Maximum Parsimony Euclidean Distance Weighted Euclidean Stem Cell Differentiation Hematopoiesis Transcriptional Regulation Transcription Factor Gene Selection Gene Expression Microarray Cell Type Marker Gene Functional Annotation Random Forest Biological Function Regulation Statistical Significance Erythropoiesis Natural Killer Cell T Cell B Cell Granulocyte Monocyte Megakaryocyte Minimization Linear Constraint Cell Lineage

1

Page generated in 0.0935 seconds