Global ETD Search

181	Application of supervised and unsupervised learning to analysis of the arterial pressure pulse Walsh, Andrew Michael, Graduate school of biomedical engineering, UNSW January 2006 (has links) This thesis presents an investigation of statistical analytical methods applied to the analysis of the shape of the arterial pressure waveform. The arterial pulse is analysed by a selection of both supervised and unsupervised methods of learning. Supervised learning methods are generally better known as regression. Unsupervised learning methods seek patterns in data without the specification of a target variable. The theoretical relationship between arterial pressure and wave shape is first investigated by study of a transmission line model of the arterial tree. A meta-database of pulse waveforms obtained by the SphygmoCor"??" device is then analysed by the unsupervised learning technique of Self Organising Maps (SOM). The map patterns indicate that the observed arterial pressures affect the wave shape in a similar way as predicted by the theoretical model. A database of continuous arterial pressure obtained by catheter line during sleep is used to derive supervised models that enable estimation of arterial pressures, based on the measured wave shapes. Independent component analysis (ICA) is also used in a supervised learning methodology to show the theoretical plausibility of separating the pressure signals from unwanted noise components. The accuracy and repeatability of the SphygmoCor?? device is measured and discussed. Alternative regression models are introduced that improve on the existing models in the estimation of central cardiovascular parameters from peripheral arterial wave shapes. Results of this investigation show that from the information in the wave shape, it is possible, in theory, to estimate the continuous underlying pressures within the artery to a degree of accuracy acceptable to the Association for the Advancement of Medical Instrumentation. This could facilitate a new role for non-invasive sphygmographic devices, to be used not only for feature estimation but as alternatives to invasive arterial pressure sensors in the measurement of continuous blood pressure. self organising maps SOM blood pressure supervised learning independent component analysis sphygmocor arterial pressure
182	Mapping land-use in north-western Nigeria (Case study of Dutse) Anavberokhai, Isah January 2007 (has links) <p>This project analyzes satellite images from 1976, 1985 and 2000 of Dutse, Jigawa state, in north-western Nigeria. The analyzed satellite images were used to determine land-use and vegetation changes that have occurred in the land-use from 1976 to 2000 will help recommend possible planning measures in order to protect the vegetation from further deterioration.</p><p>Studying land-use change in north-western Nigeria is essential for analyzing various ecological and developmental consequences over time. The north-western region of Nigeria is of great environmental and economic importance having land cover rich in agricultural production and livestock grazing. The increase of population over time has affected the land-use and hence agricultural and livestock production.</p><p>On completion of this project, the possible land use changes that have taken place in Dutse will be analyzed for future recommendation. The use of supervised classification and change detection of satellite images have produced an economic way to quantify different types of landuse and changes that has occurred over time.</p><p>The percentage difference in land-use between 1976 and 2000 was 37%, which is considered to be high land-use change within the period of study. The result in this project is being used to propose planning strategies that could help in planning sustainable land-use and diversity in Dutse.</p> change detection satellite images land-use land cover Dutse Jigawa supervised classification livestock sustainable. TECHNOLOGY TEKNIKVETENSKAP
183	Weakly supervised methods for learning actions and objects Prest, Alessandro 04 September 2012 (has links) (PDF) Modern Computer Vision systems learn visual concepts through examples (i.e. images) which have been manually annotated by humans. While this paradigm allowed the field to tremendously progress in the last decade, it has now become one of its major bottlenecks. Teaching a new visual concept requires an expensive human annotation effort, limiting systems to scale to thousands of visual concepts from the few dozens that work today. The exponential growth of visual data available on the net represents an invaluable resource for visual learning algorithms and calls for new methods able to exploit this information to learn visual concepts without the need of major human annotation effort. As a first contribution, we introduce an approach for learning human actions as interac- tions between persons and objects in realistic images. By exploiting the spatial structure of human-object interactions, we are able to learn action models automatically from a set of still images annotated only with the action label (weakly-supervised). Extensive experimental evaluation demonstrates that our weakly-supervised approach achieves the same performance of popular fully-supervised methods despite using substantially less supervision. In the second part of this thesis we extend this reasoning to human-object interactions in realistic video and feature length movies. Popular methods represent actions with low- level features such as image gradients or optical flow. In our approach instead, interactions are modeled as the trajectory of the object wrt to the person position, providing a rich and natural description of actions. Our interaction descriptor is an informative cue on its own and is complimentary to traditional low-level features. Finally, in the third part we propose an approach for learning object detectors from real- world web videos (i.e. YouTube). As opposed to the standard paradigm of learning from still images annotated with bounding-boxes, we propose a technique to learn from videos known only to contain objects of a target class. We demonstrate that learning detec- tors from video alone already delivers good performance requiring much less supervision compared to training from images annotated with bounding boxes. We additionally show that training from a combination of weakly annotated videos and fully annotated still images improves over training from still images alone. [STAT:ML] Statistics/Machine Learning computer vision weakly supervised learning
184	Statistical Feature Selection : With Applications in Life Science Nilsson, Roland January 2007 (has links) The sequencing of the human genome has changed life science research in many ways. Novel measurement technologies such as microarray expression analysis, genome-wide SNP typing and mass spectrometry are now producing experimental data of extremely high dimensions. While these techniques provide unprecedented opportunities for exploratory data analysis, the increase in dimensionality also introduces many difficulties. A key problem is to discover the most relevant variables, or features, among the tens of thousands of parallel measurements in a particular experiment. This is referred to as feature selection. For feature selection to be principled, one needs to decide exactly what it means for a feature to be ”relevant”. This thesis considers relevance from a statistical viewpoint, as a measure of statistical dependence on a given target variable. The target variable might be continuous, such as a patient’s blood glucose level, or categorical, such as ”smoker” vs. ”non-smoker”. Several forms of relevance are examined and related to each other to form a coherent theory. Each form of relevance then defines a different feature selection problem. The predictive features are those that allow an accurate predictive model, for example for disease diagnosis. I prove that finding redictive features is a tractable problem, in that consistent estimates can be computed in polynomial time. This is a substantial improvement upon current theory. However, I also demonstrate that selecting features to optimize prediction accuracy does not control feature error rates. This is a severe drawback in life science, where the selected features per se are important, for example as candidate drug targets. To address this problem, I propose a statistical method which to my knowledge is the first to achieve error control. Moreover, I show that in high dimensions, feature sets can be impossible to replicate in independent experiments even with controlled error rates. This finding may explain the lack of agreement among genome-wide association studies and molecular signatures of disease. The most predictive features may not always be the most relevant ones from a biological perspective, since the predictive power of a given feature may depend on measurement noise rather than biological properties. I therefore consider a wider definition of relevance that avoids this problem. The resulting feature selection problem is shown to be asymptotically intractable in the general case; however, I derive a set of simplifying assumptions which admit an intuitive, consistent polynomial-time algorithm. Moreover, I present a method that controls error rates also for this problem. This algorithm is evaluated on microarray data from case studies in diabetes and cancer. In some cases however, I find that these statistical relevance concepts are insufficient to prioritize among candidate features in a biologically reasonable manner. Therefore, effective feature selection for life science requires both a careful definition of relevance and a principled integration of existing biological knowledge. / Sekvenseringen av det mänskliga genomet i början på 2000-talet tillsammans och de senare sekvenseringsprojekten för olika modellorganismer har möjliggjort revolutionerade nya biologiska mätmetoder som omfattar hela genom. Microarrayer, mass-spektrometri och SNP-typning är exempel på sådana mätmetoder. Dessa metoder genererar mycket högdimensionell data. Ett centralt problem i modern biologisk forskning är således att identifiera de relevanta variablerna bland dessa tusentals mätningar. Detta kallas f¨or variabelsökning. För att kunna studera variabelsökning på ett systematiskt sätt är en exakt definition av begreppet ”relevans” nödvändig. I denna avhandling behandlas relevans ur statistisk synvinkel: ”relevans” innebär ett statistiskt beroende av en målvariabel ; denna kan vara kontinuerlig, till exempel en blodtrycksmätning på en patient, eller diskret, till exempel en indikatorvariabel såsom ”rökare” eller ”icke-rökare”. Olika former av relevans behandlas och en sammanhängande teori presenteras. Varje relevansdefinition ger därefter upphov till ett specifikt variabelsökningsproblem. Prediktiva variabler är sådana som kan användas för att konstruera prediktionsmodeller. Detta är viktigt exempelvis i kliniska diagnossystem. Här bevisas att en konsistent skattning av sådana variabler kan beräknas i polynomisk tid, så att variabelssökning är möjlig inom rimlig beräkningstid. Detta är ett genombrott jämfört med tidigare forskning. Dock visas även att metoder för att optimera prediktionsmodeller ofta ger höga andelar irrelevanta varibler, vilket är mycket problematiskt inom biologisk forskning. Därför presenteras också en ny variabelsökningsmetod med vilken de funna variablernas relevans är statistiskt säkerställd. I detta sammanhang visas också att variabelsökningsmetoder inte är reproducerbara i vanlig bemärkelse i höga dimensioner, även då relevans är statistiskt säkerställd. Detta förklarar till viss del varför genetiska associationsstudier som behandlar hela genom hittills har varit svåra att reproducera. Här behandlas också fallet där alla relevanta variabler eftersöks. Detta problem bevisas kräva exponentiell beräkningstid i det allmänna fallet. Dock presenteras en metod som löser problemet i polynomisk tid under vissa statistiska antaganden, vilka kan anses rimliga för biologisk data. Också här tas problemet med falska positiver i beaktande, och en statistisk metod presenteras som säkerställer relevans. Denna metod tillämpas på fallstudier i typ 2-diabetes och cancer. I vissa fall är dock mängden relevanta variabler mycket stor. Statistisk behandling av en enskild datatyp är då otillräcklig. I sådana situationer är det viktigt att nyttja olika datakällor samt existerande biologisk kunskap för att för att sortera fram de viktigaste fynden. Machine learning supervised learning classification dimemsionality reduction multiple testing gene expression microarray cancer Bioinformatics Bioinformatik
185	A data-assisted approach to supporting instructional interventions in technology enhanced learning environments 2012 December 1900 (has links) The design of intelligent learning environments requires significant up-front resources and expertise. These environments generally maintain complex and comprehensive knowledge bases describing pedagogical approaches, learner traits, and content models. This has limited the influence of these technologies in higher education, which instead largely uses learning content management systems in order to deliver non-classroom instruction to learners. This dissertation puts forth a data-assisted approach to embedding intelligence within learning environments. In this approach, instructional experts are provided with summaries of the activities of learners who interact with technology enhanced learning tools. These experts, which may include instructors, instructional designers, educational technologists, and others, use this data to gain insight into the activities of their learners. These insights lead experts to form instructional interventions which can be used to enhance the learning experience. The novel aspect of this approach is that the actions of the intelligent learning environment are now not just those of the learners and software constructs, but also those of the educational experts who may be supporting the learning process. The kinds of insights and interventions that come from application of the data-assisted approach vary with the domain being taught, the epistemology and pedagogical techniques being employed, and the particulars of the cohort being instructed. In this dissertation, three investigations using the data-assisted approach are described. The first of these demonstrates the effects of making available to instructors novel sociogram-based visualizations of online asynchronous discourse. By making instructors aware of the discussion habits of both themselves and learners, the instructors are better able to measure the effect of their teaching practice. This enables them to change their activities in response to the social networks that form between their learners, allowing them to react to deficiencies in the learning environment. Through these visualizations it is demonstrated that instructors can effectively change their pedagogy based on seeing data of their students’ interactions. The second investigation described in this dissertation is the application of unsupervised machine learning to the viewing habits of learners using lecture capture facilities. By clustering learners into groups based on behaviour and correlating groups with academic outcome, a model of positive learning activity can be described. This is particularly useful for instructional designers who are evaluating the role of learning technologies in programs as it contextualizes how technologies enable success in learners. Through this investigation it is demonstrated that the viewership data of learners can be used to assist designers in building higher level models of learning that can be used for evaluating the use of specific tools in blended learning situations. Finally, the results of applying supervised machine learning to the indexing of lecture video is described. Usage data collected from software is increasingly being used by software engineers to make technologies that are more customizable and adaptable. In this dissertation, it is demonstrated that supervised machine learning can provide human-like indexing of lecture videos that is more accurate than current techniques. Further, these indices can be customized for groups of learners, increasing the level of personalization in the learning environment. This investigation demonstrates that the data-assisted approach can also be used by application developers who are building software features for personalization into intelligent learning environments. Through this work, it is shown that a data-assisted approach to supporting instructional interventions in technology enhanced learning environments is both possible and can positively impact the teaching and learning process. By making available to instructional experts the online activities of learners, experts can better understand and react to patterns of use that develop, making for a more effective and personalized learning environment. This approach differs from traditional methods of building intelligent learning environments, which apply learning theories a priori to instructional design, and do not leverage the in situ data collected about learners. e-learning educational technology technology enhanced learning tel supervised machine learning information visualization data-assisted approach
186	Generative manifold learning for the exploration of partially labeled data Cruz Barbosa, Raúl 01 October 2009 (has links) In many real-world application problems, the availability of data labels for supervised learning is rather limited. Incompletely labeled datasets are common in many of the databases generated in some of the currently most active areas of research. It is often the case that a limited number of labeled cases is accompanied by a larger number of unlabeled ones. This is the setting for semi-supervised learning, in which unsupervised approaches assist the supervised problem and vice versa. A manifold learning model, namely Generative Topographic Mapping (GTM), is the basis of the methods developed in this thesis. The non-linearity of the mapping that GTM generates makes it prone to trustworthiness and continuity errors that would reduce the faithfulness of the data representation, especially for datasets of convoluted geometry. In this thesis, a variant of GTM that uses a graph approximation to the geodesic metric is first defined. This model is capable of representing data of convoluted geometries. The standard GTM is here modified to prioritize neighbourhood relationships along the generated manifold. This is accomplished by penalizing the possible divergences between the Euclidean distances from the data points to the model prototypes and the corresponding geodesic distances along the manifold. The resulting Geodesic GTM (Geo-GTM) model is shown to improve the continuity and trustworthiness of the representation generated by the model, as well as to behave robustly in the presence of noise. The thesis then leads towards the definition and development of semi-supervised versions of GTM for partially-labeled data exploration. As a first step in this direction, a two-stage clustering procedure that uses class information is presented. A class information-enriched variant of GTM, namely class-GTM, yields a first cluster description of the data. The number of clusters defined by GTM is usually large for visualization purposes and does not necessarily correspond to the overall class structure. Consequently, in a second stage, clusters are agglomerated using the K-means algorithm with different novel initialization strategies that benefit from the probabilistic definition of GTM. We evaluate if the use of class information influences cluster-wise class separability. A robust variant of GTM that detects outliers while effectively minimizing their negative impact in the clustering process is also assessed in this context. We then proceed to the definition of a novel semi-supervised model, SS-Geo-GTM, that extends Geo-GTM to deal with semi-supervised problems. In SS-Geo-GTM, the model prototypes are linked by the nearest neighbour to the data manifold constructed by Geo-GTM. The resulting proximity graph is used as the basis for a class label propagation algorithm. The performance of SS-Geo-GTM is experimentally assessed, comparing positively with that of an Euclidean distance-based counterpart and that of the alternative Laplacian Eigenmaps method. Finally, the developed models (the two-stage clustering procedure and the semi-supervised models) are applied to the analysis of a human brain tumour dataset (obtained by Nuclear Magnetic Resonance Spectroscopy), where the tasks are, in turn, data clustering and survival prognostic modeling. / Resum de la tesi (màxim 4000 caràcters. Si se supera aquest límit, el resum es tallarà automàticament al caràcter 4000) En muchos problemas de aplicación del mundo real, la disponibilidad de etiquetas de datos para aprendizaje supervisado es bastante limitada. La existencia de conjuntos de datos etiquetados de manera incompleta es común en muchas de las bases de datos generadas en algunas de las áreas de investigación actualmente más activas. Es frecuente que un número limitado de casos etiquetados venga acompañado de un número mucho mayor de datos no etiquetados. Éste es el contexto en el que opera el aprendizaje semi-supervisado, en el cual enfoques no-supervisados prestan ayuda a problemas supervisados y vice versa. Un modelo de aprendizaje de variaciones (manifold learning, en inglés), llamado Mapeo Topográfico Generativo (GTM, en acrónimo de su nombre en inglés), es la base de los métodos desarrollados en esta tesis. La no-linealidad del mapeo que GTM genera hace que éste sea propenso a errores de fiabilidad y continuidad, los cuales pueden reducir la fidelidad de la representación de los datos, especialmente para conjuntos de datos de geometría intrincada. En esta tesis, una extensión de GTM que utiliza una aproximación vía grafos a la métrica geodésica es definida en primer lugar. Este modelo es capaz de representar datos con geometrías intrincadas. En él, el GTM estándar es modificado para priorizar relaciones de vecindad a lo largo de la variación generada. Esto se logra penalizando las divergencias existentes entre las distancias Euclideanas de los datos a los prototipos del modelo y las correspondientes distancias geodésicas a lo largo de la variación. Se muestra que el modelo Geo-GTM resultante mejora la continuidad y fiabilidad de la representación generada y que se comporta de manera robusta en presencia de ruido. Más adelante, la tesis nos lleva a la definición y desarrollo de versiones semi-supervisadas de GTM para la exploración de conjuntos de datos parcialmente etiquetados. Como un primer paso en esta dirección, se presenta un procedimiento de agrupamiento en dos etapas que utiliza información de pertenencia a clase. Una extensión de GTM enriquecida con información de pertenencia a clase, llamada class-GTM, produce una primera descripción de grupos de los datos. El número de grupos definidos por GTM es normalmente grande para propósitos de visualización y no corresponde necesariamente con la estructura de clases global. Por ello, en una segunda etapa, los grupos son aglomerados usando el algoritmo K-means con diferentes estrategias de inicialización novedosas las cuales se benefician de la definición probabilística de GTM. Evaluamos si el uso de información de clase influye en la separabilidad de clase por grupos. Una extensión robusta de GTM que detecta datos atípicos a un tiempo que minimiza de forma efectiva su impacto negativo en el proceso de agrupamiento es evaluada también en este contexto. Se procede después a la definición de un nuevo modelo semi-supervisado, SS-Geo-GTM, que extiende Geo-GTM para ocuparse de problemas semi-supervisados. En SS-Geo-GTM, los prototipos del modelo son vinculados al vecino más cercano a la variación construída por Geo-GTM. El grafo de proximidad resultante es utilizado como base para un algoritmo de propagación de etiquetas de clase. El rendimiento de SS-Geo-GTM es valorado experimentalmente, comparando positivamente tanto con la contraparte de este modelo basada en la distancia Euclideana como con el método alternativo Laplacian Eigenmaps. Finalmente, los modelos desarrollados (el procedimiento de agrupamiento en dos etapas y los modelos semi-supervisados) son aplicados al análisis de un conjunto de datos de tumores cerebrales humanos (obtenidos mediante Espectroscopia de Resonancia Magnética Nuclear), donde las tareas a realizar son el agrupamiento de datos y el modelado de pronóstico de supervivencia. Semi-supervised learning Generative topographic mapping Geodesic distance Manifold learning Clustering Visualitzation Classification 004
187	Learning from Partially Labeled Data: Unsupervised and Semi-supervised Learning on Graphs and Learning with Distribution Shifting Huang, Jiayuan January 2007 (has links) This thesis focuses on two fundamental machine learning problems:unsupervised learning, where no label information is available, and semi-supervised learning, where a small amount of labels are given in addition to unlabeled data. These problems arise in many real word applications, such as Web analysis and bioinformatics,where a large amount of data is available, but no or only a small amount of labeled data exists. Obtaining classification labels in these domains is usually quite difficult because it involves either manual labeling or physical experimentation. This thesis approaches these problems from two perspectives: graph based and distribution based. First, I investigate a series of graph based learning algorithms that are able to exploit information embedded in different types of graph structures. These algorithms allow label information to be shared between nodes in the graph---ultimately communicating information globally to yield effective unsupervised and semi-supervised learning. In particular, I extend existing graph based learning algorithms, currently based on undirected graphs, to more general graph types, including directed graphs, hypergraphs and complex networks. These richer graph representations allow one to more naturally capture the intrinsic data relationships that exist, for example, in Web data, relational data, bioinformatics and social networks. For each of these generalized graph structures I show how information propagation can be characterized by distinct random walk models, and then use this characterization to develop new unsupervised and semi-supervised learning algorithms. Second, I investigate a more statistically oriented approach that explicitly models a learning scenario where the training and test examples come from different distributions. This is a difficult situation for standard statistical learning approaches, since they typically incorporate an assumption that the distributions for training and test sets are similar, if not identical. To achieve good performance in this scenario, I utilize unlabeled data to correct the bias between the training and test distributions. A key idea is to produce resampling weights for bias correction by working directly in a feature space and bypassing the problem of explicit density estimation. The technique can be easily applied to many different supervised learning algorithms, automatically adapting their behavior to cope with distribution shifting between training and test data. unsupervised learning semi-supervised learning graph based learning distribution shifting Computer Science
188	Detecting Land Cover Change over a 20 Year Time Period in the Niagara Escarpment Plan Using Satellite Remote Sensing Waite, Holly January 2009 (has links) The Niagara Escarpment is one of Southern Ontario’s most important landscapes. Due to the nature of the landform and its location, the Escarpment is subject to various development pressures including urban expansion, mineral resource extraction, agricultural practices and recreation. In 1985, Canada’s first large scale environmentally based land use plan was put in place to ensure that only development that is compatible with the Escarpment occurred within the Niagara Escarpment Plan (NEP). The southern extent of the NEP is of particular interest in this study, since a portion of the Plan is located within the rapidly expanding Greater Toronto Area (GTA). The Plan area located in the Regional Municipalities of Hamilton and Halton represent both urban and rural geographical areas respectively, and are both experiencing development pressures and subsequent changes in land cover. Monitoring initiatives on the NEP have been established, but have done little to identify consistent techniques for monitoring land cover on the Niagara Escarpment. Land cover information is an important part of planning and environmental monitoring initiatives. Remote sensing has the potential to provide frequent and accurate land cover information over various spatial scales. The goal of this research was to examine land cover change in the Regional Municipalities of Hamilton and Halton portions of the NEP. This was achieved through the creation of land cover maps for each region using Landsat 5 Thematic Mapper (TM) remotely sensed data. These maps aided in determining the qualitative and quantitative changes that had occurred in the Plan area over a 20 year time period from 1986 to 2006. Change was also examined based on the NEP’s land use designations, to determine if the Plan policy has been effective in protecting the Escarpment. To obtain land cover maps, five different supervised classification methods were explored: Minimum Distance, Mahalanobis Distance, Maximum Likelihood, Object-oriented and Support Vector Machine. Seven land cover classes were mapped (forest, water, recreation, bare agricultural fields, vegetated agricultural fields, urban and mineral resource extraction areas) at a regional scale. SVM proved most successful at mapping land cover on the Escarpment, providing classification maps with an average accuracy of 86.7%. Land cover change analysis showed promising results with an increase in the forested class and only slight increases to the urban and mineral resource extraction classes. Negatively, there was a decrease in agricultural land overall. An examination of land cover change based on the NEP land use designations showed little change, other than change that is regulated under Plan policies, proving the success of the NEP for protecting vital Escarpment lands insofar as this can be revealed through remote sensing. Land cover should be monitored in the NEP consistently over time to ensure changes in the Plan area are compatible with the Niagara Escarpment. Remote sensing is a tool that can provide this information to the Niagara Escarpment Commission (NEC) in a timely, comprehensive and cost-effective way. The information gained from remotely sensed data can aid in environmental monitoring and policy planning into the future. Niagara Escarpment Land Cover Change Remote Sensing Support Vector Machine Mapping Supervised Classification Landsat Geography
189	Fundamental Limitations of Semi-Supervised Learning Lu, Tyler (Tian) 30 April 2009 (has links) The emergence of a new paradigm in machine learning known as semi-supervised learning (SSL) has seen benefits to many applications where labeled data is expensive to obtain. However, unlike supervised learning (SL), which enjoys a rich and deep theoretical foundation, semi-supervised learning, which uses additional unlabeled data for training, still remains a theoretical mystery lacking a sound fundamental understanding. The purpose of this research thesis is to take a first step towards bridging this theory-practice gap. We focus on investigating the inherent limitations of the benefits SSL can provide over SL. We develop a framework under which one can analyze the potential benefits, as measured by the sample complexity of SSL. Our framework is utopian in the sense that a SSL algorithm trains on a labeled sample and an unlabeled distribution, as opposed to an unlabeled sample in the usual SSL model. Thus, any lower bound on the sample complexity of SSL in this model implies lower bounds in the usual model. Roughly, our conclusion is that unless the learner is absolutely certain there is some non-trivial relationship between labels and the unlabeled distribution (``SSL type assumption''), SSL cannot provide significant advantages over SL. Technically speaking, we show that the sample complexity of SSL is no more than a constant factor better than SL for any unlabeled distribution, under a no-prior-knowledge setting (i.e. without SSL type assumptions). We prove that for the class of thresholds in the realizable setting the sample complexity of SL is at most twice that of SSL. Also, we prove that in the agnostic setting for the classes of thresholds and union of intervals the sample complexity of SL is at most a constant factor larger than that of SSL. We conjecture this to be a general phenomenon applying to any hypothesis class. We also discuss issues regarding SSL type assumptions, and in particular the popular cluster assumption. We give examples that show even in the most accommodating circumstances, learning under the cluster assumption can be hazardous and lead to prediction performance much worse than simply ignoring the unlabeled data and doing supervised learning. We conclude with a look into future research directions that build on our investigation. artificial intelligence machine learning semi-supervised learning statistical learning theory Computer Science
190	Contributions to Unsupervised and Semi-Supervised Learning Pal, David 21 May 2009 (has links) This thesis studies two problems in theoretical machine learning. The first part of the thesis investigates the statistical stability of clustering algorithms. In the second part, we study the relative advantage of having unlabeled data in classification problems. Clustering stability was proposed and used as a model selection method in clustering tasks. The main idea of the method is that from a given data set two independent samples are taken. Each sample individually is clustered with the same clustering algorithm, with the same setting of its parameters. If the two resulting clusterings turn out to be close in some metric, it is concluded that the clustering algorithm and the setting of its parameters match the data set, and that clusterings obtained are meaningful. We study asymptotic properties of this method for certain types of cost minimizing clustering algorithms and relate their asymptotic stability to the number of optimal solutions of the underlying optimization problem. In classification problems, it is often expensive to obtain labeled data, but on the other hand, unlabeled data are often plentiful and cheap. We study how the access to unlabeled data can decrease the amount of labeled data needed in the worst-case sense. We propose an extension of the probably approximately correct (PAC) model in which this question can be naturally studied. We show that for certain basic tasks the access to unlabeled data might, at best, halve the amount of labeled data needed. machine learning statistics unsupervised learning semi-supervised learning learning theory Computer Science

Search results