Global ETD Search

1	Manifold Integration: Data Integration on Multiple Manifolds Choi, Hee Youl 2010 May 1900 (has links) In data analysis, data points are usually analyzed based on their relations to other points (e.g., distance or inner product). This kind of relation can be analyzed on the manifold of the data set. Manifold learning is an approach to understand such relations. Various manifold learning methods have been developed and their effectiveness has been demonstrated in many real-world problems in pattern recognition and signal processing. However, most existing manifold learning algorithms only consider one manifold based on one dissimilarity matrix. In practice, multiple measurements may be available, and could be utilized. In pattern recognition systems, data integration has been an important consideration for improved accuracy given multiple measurements. Some data integration algorithms have been proposed to address this issue. These integration algorithms mostly use statistical information from the data set such as uncertainty of each data source, but they do not use the structural information (i.e., the geometric relations between data points). Such a structure is naturally described by a manifold. Even though manifold learning and data integration have been successfully used for data analysis, they have not been considered in a single integrated framework. When we have multiple measurements generated from the same data set and mapped onto different manifolds, those measurements can be integrated using the structural information on these multiple manifolds. Furthermore, we can better understand the structure of the data set by combining multiple measurements in each manifold using data integration techniques. In this dissertation, I present a new concept, manifold integration, a data integration method using the structure of data expressed in multiple manifolds. In order to achieve manifold integration, I formulated the manifold integration concept, and derived three manifold integration algorithms. Experimental results showed the algorithms' effectiveness in classification and dimension reduction. Moreover, for manifold integration, I showed that there are good theoretical and neuroscientific applications. I expect the manifold integration approach to serve as an effective framework for analyzing multimodal data sets on multiple manifolds. Also, I expect that my research on manifold integration will catalyze both manifold learning and data integration research. manifold integration manifold learning data integration kernel machines sensorimotor integration
2	Kernel Machine Methods for Risk Prediction with High Dimensional Data Sinnott, Jennifer Anne 22 October 2012 (has links) Understanding the relationship between genomic markers and complex disease could have a profound impact on medicine, but the large number of potential markers can make it hard to differentiate true biological signal from noise and false positive associations. A standard approach for relating genetic markers to complex disease is to test each marker for its association with disease outcome by comparing disease cases to healthy controls. It would be cost-effective to use control groups across studies of many different diseases; however, this can be problematic when the controls are genotyped on a platform different from the one used for cases. Since different platforms genotype different SNPs, imputation is needed to provide full genomic coverage, but introduces differential measurement error. In Chapter 1, we consider the effects of this differential error on association tests. We quantify the inﬂation in Type I Error by comparing two healthy control groups drawn from the same cohort study but genotyped on different platforms, and assess several methods for mitigating this error. Analyzing genomic data one marker at a time can effectively identify associations, but the resulting lists of signiﬁcant SNPs or differentially expressed genes can be hard to interpret. Integrating prior biological knowledge into risk prediction with such data by grouping genomic features into pathways reduces the dimensionality of the problem and could improve models by making them more biologically grounded and interpretable. The kernel machine framework has been proposed to model pathway effects because it allows nonlinear associations between the genes in a pathway and disease risk. In Chapter 2, we propose kernel machine regression under the accelerated failure time model. We derive a pseudo-score statistic for testing and a risk score for prediction using genes in a single pathway. We propose omnibus procedures that alleviate the need to prespecify the kernel and allow the data to drive the complexity of the resulting model. In Chapter 3, we extend methods for risk prediction using a single pathway to methods for risk prediction model using multiple pathways using a multiple kernel learning approach to select important pathways and efﬁciently combine information across pathways. high dimensional data kernel machines pathways risk prediction biostatistics
3	Scaling Up Support Vector Machines with Application to Plankton Recognition Luo, Tong 10 February 2005 (has links) Learning a predictive model for a large scale real-world problem presents several challenges: the choice of a good feature set and a scalable machine learning algorithm with small generalization error. A support vector machine (SVM), based on statistical learning theory, obtains good generalization by restricting the capacity of its hypothesis space. A SVM outperforms classical learning algorithms on many benchmark data sets. Its excellent performance makes it the ideal choice for pattern recognition problems. However, training a SVM involves constrained quadratic programming, which leads to poor scalability. In this dissertation, we propose several methods to improve a SVM's scalability. The evaluation is done mainly in the context of a plankton recognition problem. One approach is called active learning, which selectively asks a domain expert to label a subset of examples from a lot of unlabeled data. Active learning minimizes the number of labeled examples needed to build an accurate model and reduces the human effort in manually labeling the data. We propose a new active learning method "Breaking Ties" (BT) for multi-class SVMs. After developing a probability model for multiple class SVMs, "BT" selectively labels examples for which the difference in probabilities between the predicted most likely class and second most likely class is smallest. This simple strategy required several times less labeled plankton images to reach a given recognition accuracy when compared to random sampling in our plankton recognition system. To speed up a SVM's training and prediction, we show how to apply bit reduction to compress the examples into several bins. Weights are assigned to different bins based on the number of examples in the bin. Treating each bin as a weighted example, a SVM builds a model using the reduced-set of weighted examples. Machine learning Data mining Kernel machines Active learning Bit reduction American Studies Arts and Humanities
4	Semi-supervised structured prediction models Brefeld, Ulf 14 March 2008 (has links) Das Lernen aus strukturierten Eingabe- und Ausgabebeispielen ist die Grundlage für die automatisierte Verarbeitung natürlich auftretender Problemstellungen und eine Herausforderung für das Maschinelle Lernen. Die Einordnung von Objekten in eine Klassentaxonomie, die Eigennamenerkennung und das Parsen natürlicher Sprache sind mögliche Anwendungen. Klassische Verfahren scheitern an der komplexen Natur der Daten, da sie die multiplen Abhängigkeiten und Strukturen nicht erfassen können. Zudem ist die Erhebung von klassifizierten Beispielen in strukturierten Anwendungsgebieten aufwändig und ressourcenintensiv, während unklassifizierte Beispiele günstig und frei verfügbar sind. Diese Arbeit thematisiert halbüberwachte, diskriminative Vorhersagemodelle für strukturierte Daten. Ausgehend von klassischen halbüberwachten Verfahren werden die zugrundeliegenden analytischen Techniken und Algorithmen auf das Lernen mit strukturierten Variablen übertragen. Die untersuchten Verfahren basieren auf unterschiedlichen Prinzipien und Annahmen, wie zum Beispiel der Konsensmaximierung mehrerer Hypothesen im Lernen aus mehreren Sichten, oder der räumlichen Struktur der Daten im transduktiven Lernen. Desweiteren wird in einer Fallstudie zur Email-Batcherkennung die räumliche Struktur der Daten ausgenutzt und eine Lösung präsentiert, die der sequenziellen Natur der Daten gerecht wird. Aus den theoretischen Überlegungen werden halbüberwachte, strukturierte Vorhersagemodelle und effiziente Optmierungsstrategien abgeleitet. Die empirische Evaluierung umfasst Klassifikationsprobleme, Eigennamenerkennung und das Parsen natürlicher Sprache. Es zeigt sich, dass die halbüberwachten Methoden in vielen Anwendungen zu signifikant kleineren Fehlerraten führen als vollständig überwachte Baselineverfahren. / Learning mappings between arbitrary structured input and output variables is a fundamental problem in machine learning. It covers many natural learning tasks and challenges the standard model of learning a mapping from independently drawn instances to a small set of labels. Potential applications include classification with a class taxonomy, named entity recognition, and natural language parsing. In these structured domains, labeled training instances are generally expensive to obtain while unlabeled inputs are readily available and inexpensive. This thesis deals with semi-supervised learning of discriminative models for structured output variables. The analytical techniques and algorithms of classical semi-supervised learning are lifted to the structured setting. Several approaches based on different assumptions of the data are presented. Co-learning, for instance, maximizes the agreement among multiple hypotheses while transductive approaches rely on an implicit cluster assumption. Furthermore, in the framework of this dissertation, a case study on email batch detection in message streams is presented. The involved tasks exhibit an inherent cluster structure and the presented solution exploits the streaming nature of the data. The different approaches are developed into semi-supervised structured prediction models and efficient optimization strategies thereof are presented. The novel algorithms generalize state-of-the-art approaches in structural learning such as structural support vector machines. Empirical results show that the semi-supervised algorithms lead to significantly lower error rates than their fully supervised counterparts in many application areas, including multi-class classification, named entity recognition, and natural language parsing. Lernen mit strukturierten Daten halbüberwachtes Lernen Kernverfahren natürliche Sprachverarbeitung Learning with structured data semi-supervised learning kernel machines natural language processing 004 Informatik 28 Informatik, Datenverarbeitung ddc:004

1

Page generated in 0.0785 seconds