Global ETD Search

1231	Extraction de caractéristiques et apprentissage statistique pour l'imagerie biomédicale cellulaire et tissulaire / Feature extraction and machine learning for cell and tissue biomedical imaging Zubiolo, Alexis 11 December 2015 (has links) L'objectif de cette thèse est de s'intéresser à la classification de cellules et de tissus au sein d'images d'origine biomédicales en s'appuyant sur des critères morphologiques. Le but est de permettre aux médecins et aux biologistes de mieux comprendre les lois qui régissent certains phénomènes biologiques. Ce travail se décompose en trois principales parties correspondant aux trois problèmes typiques des divers domaines de l'imagerie biomédicale abordés. L'objet de la première est l'analyse de vidéos d'endomicroscopie du colon dans lesquelles il s'agit de déterminer automatiquement la classe pathologique des polypes qu'on y observe. Cette tâche est réalisée par un apprentissage supervisé multiclasse couplant les séparateurs à vaste marge à des outils de théorie des graphes. La deuxième partie s'intéresse à l'étude de la morphologie de neurones de souris observés par microscopie confocale en fluorescence. Afin de disposer d'une information riche, les neurones sont observés à deux grossissements, l'un permettant de bien caractériser les corps cellulaires, l'autre, plus faible, pour voir les dendrites apicales dans leur intégralité. Sur ces images, des descripteurs morphologiques des neurones sont extraits automatiquement en vue d'une classification. La dernière partie concerne le traitement multi-échelle d'images d'histologie digitale dans le contexte du cancer du rein. Le réseau vasculaire est extrait et mis sous forme de graphe afin de pouvoir établir un lien entre l'architecture vasculaire de la tumeur et sa classe pathologique. / The purpose of this Ph.D. thesis is to study the classification based on morphological features of cells and tissues taken from biomedical images. The goal is to help medical doctors and biologists better understand some biological phenomena. This work is spread in three main parts corresponding to the three typical problems in biomedical imaging tackled. The first part consists in analyzing endomicroscopic videos of the colon in which the pathological class of the polyps has to be determined. This task is performed using a supervised multiclass machine learning algorithm combining support vector machines and graph theory tools. The second part concerns the study of the morphology of mice neurons taken from fluorescent confocal microscopy. In order to obtain a rich information, the neurons are imaged at two different magnifications, the higher magnification where the soma appears in details, and the lower showing the whole cortex, including the apical dendrites. On these images, morphological features are automatically extracted with the intention of performing a classification. The last part is about the multi-scale processing of digital histology images in the context of kidney cancer. The vascular network is extracted and modeled by a graph to establish a link between the architecture of the tumor and its pathological class. Apprentissage statistique Traitement d'images biomédicales Machine learning Biomedical image processing
1232	Generalised Bayesian matrix factorisation models Mohamed, Shakir January 2011 (has links) Factor analysis and related models for probabilistic matrix factorisation are of central importance to the unsupervised analysis of data, with a colourful history more than a century long. Probabilistic models for matrix factorisation allow us to explore the underlying structure in data, and have relevance in a vast number of application areas including collaborative filtering, source separation, missing data imputation, gene expression analysis, information retrieval, computational finance and computer vision, amongst others. This thesis develops generalisations of matrix factorisation models that advance our understanding and enhance the applicability of this important class of models. The generalisation of models for matrix factorisation focuses on three concerns: widening the applicability of latent variable models to the diverse types of data that are currently available; considering alternative structural forms in the underlying representations that are inferred; and including higher order data structures into the matrix factorisation framework. These three issues reflect the reality of modern data analysis and we develop new models that allow for a principled exploration and use of data in these settings. We place emphasis on Bayesian approaches to learning and the advantages that come with the Bayesian methodology. Our port of departure is a generalisation of latent variable models to members of the exponential family of distributions. This generalisation allows for the analysis of data that may be real-valued, binary, counts, non-negative or a heterogeneous set of these data types. The model unifies various existing models and constructs for unsupervised settings, the complementary framework to the generalised linear models in regression. Moving to structural considerations, we develop Bayesian methods for learning sparse latent representations. We define ideas of weakly and strongly sparse vectors and investigate the classes of prior distributions that give rise to these forms of sparsity, namely the scale-mixture of Gaussians and the spike-and-slab distribution. Based on these sparsity favouring priors, we develop and compare methods for sparse matrix factorisation and present the first comparison of these sparse learning approaches. As a second structural consideration, we develop models with the ability to generate correlated binary vectors. Moment-matching is used to allow binary data with specified correlation to be generated, based on dichotomisation of the Gaussian distribution. We then develop a novel and simple method for binary PCA based on Gaussian dichotomisation. The third generalisation considers the extension of matrix factorisation models to multi-dimensional arrays of data that are increasingly prevalent. We develop the first Bayesian model for non-negative tensor factorisation and explore the relationship between this model and the previously described models for matrix factorisation. 006.3
1233	Generative probabilistic models of goal-directed users in task-oriented dialogs Eshky, Aciel January 2014 (has links) A longstanding objective of human-computer interaction research is to develop better dialog systems for end users. The subset of user modelling research specifically, aims to provide dialog researchers with models of user behaviour to aid with the design and improvement of dialog systems. Where dialog systems are commercially deployed, they are often to be used by a vast number of users, where sub-optimal performance could lead to an immediate financial loss for the service provider, and even user alienation. Thus, there is a strong incentive to make dialog systems as functional as possible immediately, and crucially prior to their release to the public. Models of user behaviour fill this gap, by simulating the role of human users in the lab, without the losses associated with sub-optimal system performance. User models can also tremendously aid design decisions, by serving as tools for exploratory analysis of real user behaviour, prior to designing dialog software. User modelling is the central problem of this thesis. We focus on a particular kind of dialogs termed task-oriented dialogs (those centred around solving an explicit task) because they represent the frontier of current dialog research and commercial deployment. Users taking part in these dialogs behave according to a set of user goals, which specify what they wish to accomplish from the interaction, and tend to exhibit variability of behaviour given the same set of goals. Our objective is to capture and reproduce (at the semantic utterance level) the range of behaviour that users exhibit while being consistent with their goals. We approach the problem as an instance of generative probabilistic modelling, with explicit user goals, and induced entirely from data. We argue that doing so has numerous practical and theoretical benefits over previous approaches to user modelling which have either lacked a model of user goals, or have been not been driven by real dialog data. A principal problem with user modelling development thus far has been the difficulty in evaluation. We demonstrate how treating user models as probabilistic models alleviates some of these problems through the ability to leverage a whole raft of techniques and insights from machine learning for evaluation. We demonstrate the efficacy of our approach by applying it to two different kinds of task-oriented dialog domains, which exhibit two different sub-problems encountered in real dialog corpora. The first are informational (or slot-filling) domains, specifically those concerning flight and bus route information. In slot-filling domains, user goals take categorical values which allow multiple surface realisations, and are corrupted by speech recognition errors. We address this issue by adopting a topic model representation of user goals which allows us capture both synonymy and phonetic confusability in a unified model. We first evaluate our model intrinsically using held-out probability and perplexity, and demonstrate substantial gains over an alternative string-goal representations, and over a non-goal-directed model. We then show in an extrinsic evaluation that features derived from our model lead to substantial improvements over strong baseline in the task of discriminating between real dialogs (consistent dialogs) and dialogs comprised of real turns sampled from different dialogs (inconsistent dialogs). We then move on to a spatial navigational domain in which user goals are spatial trajectories across a landscape. The disparity between the representation of spatial routes as raw pixel coordinates and their grounding as semantic utterances creates an interesting challenge compared to conventional slot-filling domains. We derive a feature-based representation of spatial goals which facilitates reasoning and admits generalisation to new routes not encountered at training time. The probabilistic formulation of our model allows us to capture variability of behaviour given the same underlying goal, a property frequently exhibited by human users in the domain. We first evaluate intrinsically using held-out probability and perplexity, and find a substantial reduction in uncertainty brought by our spatial representation. We further evaluate extrinsically in a human judgement task and find that our model’s behaviour does not differ significantly from the behaviour of real users. We conclude by sketching two novel ideas for future work: the first is to deploy the user models as transition functions for MDP-based dialog managers; the second is to use the models as a means of restricting the search space for optimal policies, by treating optimal behaviour as a subset of the (distributions over) plausible behaviour which we have induced. 006.3
1234	Personalized Medicine through Automatic Extraction of Information from Medical Texts Frunza, Oana Magdalena January 2012 (has links) The wealth of medical-related information available today gives rise to a multidimensional source of knowledge. Research discoveries published in prestigious venues, electronic-health records data, discharge summaries, clinical notes, etc., all represent important medical information that can assist in the medical decision-making process. The challenge that comes with accessing and using such vast and diverse sources of data stands in the ability to distil and extract reliable and relevant information. Computer-based tools that use natural language processing and machine learning techniques have proven to help address such challenges. This current work proposes automatic reliable solutions for solving tasks that can help achieve a personalized-medicine, a medical practice that brings together general medical knowledge and case-specific medical information. Phenotypic medical observations, along with data coming from test results, are not enough when assessing and treating a medical case. Genetic, life-style, background and environmental data also need to be taken into account in the medical decision process. This thesis’s goal is to prove that natural language processing and machine learning techniques represent reliable solutions for solving important medical-related problems. From the numerous research problems that need to be answered when implementing personalized medicine, the scope of this thesis is restricted to four, as follows: 1. Automatic identification of obesity-related diseases by using only textual clinical data; 2. Automatic identification of relevant abstracts of published research to be used for building systematic reviews; 3. Automatic identification of gene functions based on textual data of published medical abstracts; 4. Automatic identification and classification of important medical relations between medical concepts in clinical and technical data. This thesis investigation on finding automatic solutions for achieving a personalized medicine through information identification and extraction focused on individual specific problems that can be later linked in a puzzle-building manner. A diverse representation technique that follows a divide-and-conquer methodological approach shows to be the most reliable solution for building automatic models that solve the above mentioned tasks. The methodologies that I propose are supported by in-depth research experiments and thorough discussions and conclusions. Natural Language Processing Machine Learning Text Mining Medical Informatics
1235	Active Learning for One-class Classification Barnabé-Lortie, Vincent January 2015 (has links) Active learning is a common solution for reducing labeling costs and maximizing the impact of human labeling efforts in binary and multi-class classification settings. However, when we are faced with extreme levels of class imbalance, a situation in which it is not safe to assume that we have a representative sample of the minority class, it has been shown effective to replace the binary classifiers with a one-class classifiers. In such a setting, traditional active learning methods, and many previously proposed in the literature for one-class classifiers, prove to be inappropriate, as they rely on assumptions about the data that no longer stand. In this thesis, we propose a novel approach to active learning designed for one-class classification. The proposed method does not rely on many of the inappropriate assumptions of its predecessors and leads to more robust classification performance. The gist of this method consists of labeling, in priority, the instances considered to fit the learned class the least by previous iterations of a one-class classification model. Throughout the thesis, we provide evidence for the merits of our method, then deepen our understanding of these merits by exploring the properties of the method that allow it to outperform the alternatives. active learning one-class classification class imbalance problem machine learning
1236	k-Nearest Neighbour Classification of Datasets with a Family of Distances Hatko, Stan January 2015 (has links) The k-nearest neighbour (k-NN) classifier is one of the oldest and most important supervised learning algorithms for classifying datasets. Traditionally the Euclidean norm is used as the distance for the k-NN classifier. In this thesis we investigate the use of alternative distances for the k-NN classifier. We start by introducing some background notions in statistical machine learning. We define the k-NN classifier and discuss Stone's theorem and the proof that k-NN is universally consistent on the normed space R^d. We then prove that k-NN is universally consistent if we take a sequence of random norms (that are independent of the sample and the query) from a family of norms that satisfies a particular boundedness condition. We extend this result by replacing norms with distances based on uniformly locally Lipschitz functions that satisfy certain conditions. We discuss the limitations of Stone's lemma and Stone's theorem, particularly with respect to quasinorms and adaptively choosing a distance for k-NN based on the labelled sample. We show the universal consistency of a two stage k-NN type classifier where we select the distance adaptively based on a split labelled sample and the query. We conclude by giving some examples of improvements of the accuracy of classifying various datasets using the above techniques. Machine Learning k-Nearest Neighbour Classifier Universal Consistency Data Science
1237	Beyond the Boundaries of SMOTE: A Framework for Manifold-based Synthetic Oversampling Bellinger, Colin January 2016 (has links) Within machine learning, the problem of class imbalance refers to the scenario in which one or more classes is significantly outnumbered by the others. In the most extreme case, the minority class is not only significantly outnumbered by the majority class, but it also considered to be rare, or absolutely imbalanced. Class imbalance appears in a wide variety of important domains, ranging from oil spill and fraud detection, to text classification and medical diagnosis. Given this, it has been deemed as one of the ten most important research areas in data mining, and for more than a decade now the machine learning community has been coming together in an attempt to unequivocally solve the problem. The fundamental challenge in the induction of a classifier from imbalanced training data is in managing the prediction bias. The current state-of-the-art methods deal with this by readjusting misclassification costs or by applying resampling methods. In cases of absolute imbalance, these methods are insufficient; rather, it has been observed that we need more training examples. The nature of class imbalance, however, dictates that additional examples cannot be acquired, and thus, synthetic oversampling becomes the natural choice. We recognize the importance of selecting algorithms with assumptions and biases that are appropriate for the properties of the target data, and argue that this is of absolute importance when it comes to developing synthetic oversampling methods because a large generative leap must be made from a relatively small training set. In particular, our research into gamma-ray spectral classification has demonstrated the benefits of incorporating prior knowledge of conformance to the manifold assumption into the synthetic oversampling algorithms. We empirically demonstrate the negative impact of the manifold property on the state-of-the-art methods, and propose a framework for manifold-based synthetic oversampling. We algorithmically present the generic form of the framework and demonstrate formalizations of it with PCA and the denoising autoencoder. Through use of the helix and swiss roll datasets, which are standards in the manifold learning community, we visualize and qualitatively analyze the benefits of our proposed framework. Moreover, we unequivocally show the framework to be superior on three real-world gamma-ray spectral datasets and on sixteen benchmark UCI datasets in general. Specifically, our results demonstrate that the framework for manifold-based synthetic oversampling produces higher area under the ROC results than the current state-of-the-art and degrades less on data that conforms to the manifold assumption. machine learning class imbalance synthetic oversampling manifold learning
1238	Monitoring Tweets for Depression to Detect At-Risk Users Jamil, Zunaira January 2017 (has links) According to the World Health Organization, mental health is an integral part of health and well-being. Mental illness can affect anyone, rich or poor, male or female. One such example of mental illness is depression. In Canada 5.3% of the population had presented a depressive episode in the past 12 months. Depression is difficult to diagnose, resulting in high under-diagnosis. Diagnosing depression is often based on self-reported experiences, behaviors reported by relatives, and a mental status examination. Currently, author- ities use surveys and questionnaires to identify individuals who may be at risk of depression. This process is time-consuming and costly. We propose an automated system that can identify at-risk users from their public social media activity. More specifically, we identify at-risk users from Twitter. To achieve this goal we trained a user-level classifier using Support Vector Machine (SVM) that can detect at-risk users with a recall of 0.8750 and a precision of 0.7778. We also trained a tweet-level classifier that predicts if a tweet indicates distress. This task was much more difficult due to the imbalanced data. In the dataset that we labeled, we came across 5% distress tweets and 95% non-distress tweets. To handle this class imbalance, we used undersampling methods. The resulting classifier uses SVM and performs with a recall of 0.8020 and a precision of 0.1237. Our system can be used by authorities to find a focused group of at-risk users. It is not a platform for labeling an individual as a patient with depres- sion, but only a platform for raising an alarm so that the relevant authorities could take necessary interventions to further analyze the predicted user to confirm his/her state of mental health. We respect the ethical boundaries relating to the use of social media data and therefore do not use any user identification information in our research. NLP Machine Learning Tweets text mining social media sentiment analysis
1239	Composing Recommendations Using Computer Screen Images: A Deep Learning Recommender System for PC Users Shapiro, Daniel January 2017 (has links) A new way to train a virtual assistant with unsupervised learning is presented in this thesis. Rather than integrating with a particular set of programs and interfaces, this new approach involves shallow integration between the virtual assistant and computer through machine vision. In effect the assistant interprets the computer screen in order to produce helpful recommendations to assist the computer user. In developing this new approach, called AVRA, the following methods are described: an unsupervised learning algorithm which enables the system to watch and learn from user behavior, a method for fast filtering of the text displayed on the computer screen, a deep learning classifier used to recognize key onscreen text in the presence of OCR translation errors, and a recommendation filtering algorithm to triage the many possible action recommendations. AVRA is compared to a similar commercial state-of-the-art system, to highlight how this work adds to the state of the art. AVRA is a deep learning image processing and recommender system that can col- laborate with the computer user to accomplish various tasks. This document presents a comprehensive overview of the development and possible applications of this novel vir- tual assistant technology. It detects onscreen tasks based upon the context it perceives by analyzing successive computer screen images with neural networks. AVRA is a rec- ommender system, as it assists the user by producing action recommendations regarding onscreen tasks. In order to simplify the interaction between the user and AVRA, the system was designed to only produce action recommendations that can be accepted with a single mouse click. These action recommendations are produced without integration into each individual application executing on the computer. Furthermore, the action recommendations are personalized to the user’s interests utilizing a history of the user’s interaction. Deep Learning Machine Learning Artificial Intelligence Recommender System
1240	Unsupervised Segmentation and Labeling for Smartphone Acquired Gait Data Martinez, Matthew, De Leon, Phillip L. 11 1900 (has links) As the population ages, prediction of falls risk is becoming an increasingly important research area. Due to built-in inertial sensors and ubiquity, smartphones provide an at- tractive data collection and computing platform for falls risk prediction and continuous gait monitoring. One challenge in continuous gait monitoring is that signi cant signal variability exists between individuals with a high falls risk and those with low-risk. This variability increases the di cultly in building a universal system which segments and labels changes in signal state. This paper presents a method which uses unsu- pervised learning techniques to automatically segment a gait signal by computing the dissimilarity between two consecutive windows of data, applying an adaptive threshold algorithm to detect changes in signal state, and using a rule-based gait recognition al- gorithm to label the data. Using inertial data,the segmentation algorithm is compared against manually segmented data and is capable of achieving recognition rates greater than 71.8%. falls risk gait accelerometer machine learning smartphone app

Search results