Global ETD Search

891	Effect of cognitive biases on human understanding of rule-based machine learning models Kliegr, Tomas January 2017 (has links) This thesis investigates to what extent do cognitive biases a ect human understanding of interpretable machine learning models, in particular of rules discovered from data. Twenty cognitive biases (illusions, e ects) are analysed in detail, including identi cation of possibly e ective debiasing techniques that can be adopted by designers of machine learning algorithms and software. This qualitative research is complemented by multiple experiments aimed to verify, whether, and to what extent, do selected cognitive biases in uence human understanding of actual rule learning results. Two experiments were performed, one focused on eliciting plausibility judgments for pairs of inductively learned rules, second experiment involved replication of the Linda experiment with crowdsourcing and two of its modi cations. Altogether nearly 3.000 human judgments were collected. We obtained empirical evidence for the insensitivity to sample size e ect. There is also limited evidence for the disjunction fallacy, misunderstanding of and , weak evidence e ect and availability heuristic. While there seems no universal approach for eliminating all the identi ed cognitive biases, it follows from our analysis that the e ect of many biases can be ameliorated by making rule-based models more concise. To this end, in the second part of thesis we propose a novel machine learning framework which postprocesses rules on the output of the seminal association rule classi cation algorithm CBA [Liu et al, 1998]. The framework uses original undiscretized numerical attributes to optimize the discovered association rules, re ning the boundaries of literals in the antecedent of the rules produced by CBA. Some rules as well as literals from the rules can consequently be removed, which makes the resulting classi er smaller. Benchmark of our approach on 22 UCI datasets shows average 53% decrease in the total size of the model as measured by the total number of conditions in all rules. Model accuracy remains on the same level as for CBA.
892	Maximum margin learning under uncertainty Tzelepis, Christos January 2018 (has links) In this thesis we study the problem of learning under uncertainty using the statistical learning paradigm. We rst propose a linear maximum margin classi er that deals with uncertainty in data input. More speci cally, we reformulate the standard Support Vector Machine (SVM) framework such that each training example can be modeled by a multi-dimensional Gaussian distribution described by its mean vector and its covariance matrix { the latter modeling the uncertainty. We address the classi cation problem and de ne a cost function that is the expected value of the classical SVM cost when data samples are drawn from the multi-dimensional Gaussian distributions that form the set of the training examples. Our formulation approximates the classical SVM formulation when the training examples are isotropic Gaussians with variance tending to zero. We arrive at a convex optimization problem, which we solve e - ciently in the primal form using a stochastic gradient descent approach. The resulting classi er, which we name SVM with Gaussian Sample Uncertainty (SVM-GSU), is tested on synthetic data and ve publicly available and popular datasets; namely, the MNIST, WDBC, DEAP, TV News Channel Commercial Detection, and TRECVID MED datasets. Experimental results verify the e ectiveness of the proposed method. Next, we extended the aforementioned linear classi er so as to lead to non-linear decision boundaries, using the RBF kernel. This extension, where we use isotropic input uncertainty and we name Kernel SVM with Isotropic Gaussian Sample Uncertainty (KSVM-iGSU), is used in the problems of video event detection and video aesthetic quality assessment. The experimental results show that exploiting input uncertainty, especially in problems where only a limited number of positive training examples are provided, can lead to better classi cation, detection, or retrieval performance. Finally, we present a preliminary study on how the above ideas can be used under the deep convolutional neural networks learning paradigm so as to exploit inherent sources of uncertainty, such as spatial pooling operations, that are usually used in deep networks.
893	Towards the Automatic Classification of Student Answers to Open-ended Questions Alvarado Mantecon, Jesus Gerardo 24 April 2019 (has links) One of the main research challenges nowadays in the context of Massive Open Online Courses (MOOCs) is the automation of the evaluation process of text-based assessments effectively. Text-based assessments, such as essay writing, have been proved to be better indicators of higher level of understanding than machine-scored assessments (E.g. Multiple Choice Questions). Nonetheless, due to the rapid growth of MOOCs, text-based evaluation has become a difficult task for human markers, creating the need of automated systems for grading. In this thesis, we focus on the automated short answer grading task (ASAG), which automatically assesses natural language answers to open-ended questions into correct and incorrect classes. We propose an ensemble supervised machine learning approach that relies on two types of classifiers: a response-based classifier, which centers around feature extraction from available responses, and a reference-based classifier which considers the relationships between responses, model answers and questions. For each classifier, we explored a set of features based on words and entities. For the response-based classifier, we tested and compared 5 features: traditional n-gram models, entity URIs (Uniform Resource Identifier) and entity mentions both extracted using a semantic annotation API, entity mention embeddings based on GloVe and entity URI embeddings extracted from Wikipedia. For the reference-based classifier, we explored fourteen features: cosine similarity between sentence embeddings from student answers and model answers, number of overlapping elements (words, entity URI, entity mention) between student answers and model answers or question text, Jaccard similarity coefficient between student answers and model answers or question text (based on words, entity URI or entity mentions) and a sentence embedding representation. We evaluated our classifiers on three datasets, two of which belong to the SemEval ASAG competition (Dzikovska et al., 2013). Our results show that, in general, reference-based features perform much better than response-based features in terms of accuracy and macro average f1-score. Within the reference-based approach, we observe that the use of S6 embedding representation, which considers question text, student and model answer, generated the best performing models. Nonetheless, their combination with other similarity features helped build more accurate classifiers. As for response-based classifiers, models based on traditional n-gram features remained the best models. Finally, we combined our best reference-based and response-based classifiers using an ensemble learning model. Our ensemble classifiers combining both approaches achieved the best results for one of the evaluation datasets, but underperformed on the remaining two. We also compared the best two classifiers with some of the main state-of-the-art results on the SemEval competition. Our final embedded meta-classifier outperformed the top-ranking result on the SemEval Beetle dataset and our top classifier on SemEval SciEntBank, trained on reference-based features, obtained the 2nd position. In conclusion, the reference-based approach, powered mainly by sentence level embeddings and other similarity features, proved to generate the most efficient models in two out of three datasets and the ensemble model was the best on the SemEval Beetle dataset. Natural Language Processing Machine Learning Educational Data Mining
894	Machine learning and statistical analysis of complex mathematical models : an application to epilepsy Ferrat, L. January 2019 (has links) The electroencephalogram (EEG) is a commonly used tool for studying the emergent electrical rhythms of the brain. It has wide utility in psychology, as well as bringing a useful diagnostic aid for neurological conditions such as epilepsy. It is of growing importance to better understand the emergence of these electrical rhythms and, in the case of diagnosis of neurological conditions, to find mechanistic differences between healthy individuals and those with a disease. Mathematical models are an important tool that offer the potential to reveal these otherwise hidden mechanisms. In particular Neural Mass Models (NMMs), which describe the macroscopic activity of large populations of neurons, are increasingly used to uncover large-scale mechanisms of brain rhythms in both health and disease. The dynamics of these models is dependent upon the choice of parameters, and therefore it is crucial to be able to understand how dynamics change when parameters are varied. Despite they are considered low-dimensional in comparison to micro-scale neural network models, with regards to understanding the relationship between parameters and dynamics NMMs are still prohibitively high dimensional for classical approaches such as numerical continuation. We need alternative methods to characterise the dynamics of NMMs in high dimensional parameter spaces. The primary aim of this thesis is to develop a method to explore and analyse the high dimensional parameter space of these mathematical models. We develop an approach based on statistics and machine learning methods called decision tree mapping (DTM). This method is used to analyse the parameter space of a mathematical model by studying all the parameters simultaneously. With this approach, the parameter space can efficiently be mapped in high dimension. We have used measures linked with this method to determine which parameters play a key role in the output of the model. This approach recursively splits the parameter space into smaller subspaces with an increasing homogeneity of dynamics. The concepts of decision tree learning, random forest, measures of importance, statistical tests and visual tools are introduced to explore and analyse the parameter space. We introduce formally the theoretical background and the methods with examples. The DTM approach is used in three distinct studies to: • Identify the role of parameters on the dynamic model. For example, which parameters have a role in the emergence of seizure dynamics? • Constrain the parameter space, such that regions of the parameter space which give implausible dynamic are removed. • Compare the parameter sets to fit different groups. How does the thalamocortical connectivity of people with and without epilepsy differ? We demonstrate that classical studies have not taken into account the complexity of the parameter space. DTM can easily be extended to other fields using mathematical models. We advocate the use of this method in the future to constrain high dimensional parameter spaces in order to enable more efficient, person-specific model calibration.
895	Geometric modeling with primitives Angles, Baptiste 29 April 2019 (has links) Both man-made or natural objects contain repeated geometric elements that can be interpreted as primitive shapes. Plants, trees, living organisms or even crystals, showcase primitives that repeat themselves. Primitives are also commonly found in man-made environments because architects tend to reuse the same patterns over a building and typically employ simple shapes, such as rectangular windows and doors. During my PhD I studied geometric primitives from three points of view: their composition, simulation and autonomous discovery. In the first part I present a method to reverse-engineer the function by which some primitives are combined. Our system is based on a composition function template that is represented by a parametric surface. The parametric surface is deformed via a non-rigid alignment of a surface that, once converged, represents the desired operator. This enables the interactive modeling of operators via a simple sketch, solving a major usability gap of composition modeling. In the second part I introduce the use of a novel primitive for real-time physics simulations. This primitive is suitable to efficiently model volume-preserving deformations of rods but also of more complex structures such as muscles. One of the core advantages of our approach is that our primitive can serve as a unified representation to do collision detection, simulation, and surface skinning. In the third part I present an unsupervised deep learning framework to learn and detect primitives. In a signal containing a repetition of elements, the method is able to automatically identify the structure of these elements (i.e. primitives) with minimal supervision. In order to train the network that contains a non-differentiable operation, a novel multi-step training process is presented. / Graduate geometric modeling computer graphics computer vision geometric primitives machine learning
896	The emergence of the data science profession Brandt, Philipp Soeren January 2016 (has links) This thesis studies the formation of a novel expert role—the data scientist—in order to ask how arcane knowledge becomes publicly salient. This question responds to the two-sided public debate, wherein data science is associated with problems such as discriminatory consequences and privacy infringements, but has also become linked with opportunities related to new forms of work. A puzzle arises also, as institutional boundaries have obscured earlier instances of quantitative expertise. Even a broader perspective reveals few expert groups that have gained lay salience on the basis of arcane knowledge, other than lawyers and doctors. This empirical puzzle recovers a gap in the literature between two main lines of argument. An institutionalist view has developed ways for understanding expert work with respect to formal features such as licensing, associations and training. A constructivist view identifies limitations in those arguments, highlighting their failure to explain many instances in which arcane knowledge emerges through informal processes, including the integration of lay knowledge through direct collaboration. Consistent with this critique, data nerds largely define their work on an informal basis. Yet, they also draw heavily on a formalized stock of knowledge. In order to reconcile the two sides, this thesis proposes viewing data science as an emerging “thought community.” Such a perspective leads to an analytical strategy that scrutinizes contours that emerge as data nerds define arcane expertise as theirs. The analysis unfolds across three empirical settings that complement each other. The first setting considers data nerds as they define their expertise in the context of public events in New York City’s technology scene. This part draws on observations beginning in 2012, shortly after data science’s first lay recognition, and covers three years of its early emergence. Two further studies comparatively test whether and in what ways contours of data science’s abstract knowledge are associated with its lay salience. They respectively consider economic and academic settings, which are most relevant to data nerds in part one. Both studies leverage specifically designed quantitative datasets consisting of traces of lay knowledge recognition and arcane knowledge construction. Together the three studies reveal distinctive contours of data science. The main argument that follows suggests that data science gains lay salience because it relies on informal practices for recombining formal principles of knowledge construction and application, in a collective effort. Data nerds define their thought community on the basis of illustrative and persuasive tactics that combine formal ideas with informal interpretations. This form of improvisation leads data nerds to connect diverse substantive problems through an array of formal representations. They thereby undermine bureaucratic control that otherwise defines tasks in the context where data scientists mostly apply their arcane knowledge. Despite its name and arcane content, moreover, data science differs from scientific principles of knowledge construction. The main contribution of this thesis is a first detailed and multifaceted analysis of data science. Results of this study address the main public problems. This thesis demonstrates that data science creates new opportunities for work provided that data nerds are willing to embrace the uncertainty associated with a formally undefined area of problems. The first perspective, focusing on community identification principles, furthermore allows identifying new forms of work in the ongoing technological transformation data science is part of. At the same time, the main argument supports reason for concerns as well precisely because data nerds often operate on an individually anonymous basis, despite their association with formal organizations. It has remained unclear how to address the social consequences of their work because data nerds undermine those conventional forms of control and oversight. The findings of this thesis suggest that although data nerds depart from scientific principles for identifying relevant problems, they coordinate those deviant activities through forms of discipline that qualitatively resemble those common in academic fields. Data nerds define their knowledge as a community. It follows that embedding public concerns in data science’s disciplinary forms of coordination, and enhancing those forms, offers the most effective mechanisms for preserving the utility of data science applications while limiting their potentially harmful consequences. Finally, conceptual and methodological contributions follow as well. The focus on thought communities reveals new leverage for understanding social processes that unfold as a combination of informal activities in local settings and institutional dynamics that are largely removed from individual actors. This problem is common for many instances of skilled work. This additional leverage is the result of an integrated methodological design that relies as much on qualitative observations as on formal analyses. As part of this integration this thesis has directly encoded phenomenologically salient contours into a quantitative design, effectively leading to an analysis of data science through data science. Data mining Statistics Machine learning Organizational sociology Sociology
897	ML4JIT- um arcabouço para pesquisa com aprendizado de máquina em compiladores JIT. / ML4JIT - a framework for research on machine learning in JIT compilers. Mignon, Alexandre dos Santos 27 June 2017 (has links) Determinar o melhor conjunto de otimizações para serem aplicadas a um programa tem sido o foco de pesquisas em otimização de compilação por décadas. Em geral, o conjunto de otimizações é definido manualmente pelos desenvolvedores do compilador e aplicado a todos os programas. Técnicas de aprendizado de máquina supervisionado têm sido usadas para o desenvolvimento de heurísticas de otimização de código. Elas pretendem determinar o melhor conjunto de otimizações com o mínimo de interferência humana. Este trabalho apresenta o ML4JIT, um arcabouço para pesquisa com aprendizado de máquina em compiladores JIT para a linguagem Java. O arcabouço permite que sejam realizadas pesquisas para encontrar uma melhor sintonia das otimizações específica para cada método de um programa. Experimentos foram realizados para a validação do arcabouço com o objetivo de verificar se com seu uso houve uma redução no tempo de compilação dos métodos e também no tempo de execução do programa. / Determining the best set of optimizations to be applied in a program has been the focus of research on compile optimization for decades. In general, the set of optimization is manually defined by compiler developers and apply to all programs. Supervised machine learning techniques have been used for the development of code optimization heuristics. They intend to determine the best set of optimization with minimal human intervention. This work presents the ML4JIT, a framework for research with machine learning in JIT compilers for Java language. The framework allows research to be performed to better tune the optimizations specific to each method of a program. Experiments were performed for the validation of the framework with the objective of verifying if its use had a reduction in the compilation time of the methods and also in the execution time of the program. Aprendizado computacional Code optimization JIT compilers Machine learning Montadores e compiladores
898	Statistical Learning Methods for Personalized Medicine Qiu, Xin January 2018 (has links) The theme of this dissertation is to develop simple and interpretable individualized treatment rules (ITRs) using statistical learning methods to assist personalized decision making in clinical practice. Considerable heterogeneity in treatment response is observed among individuals with mental disorders. Administering an individualized treatment rule according to patient-specific characteristics offers an opportunity to tailor treatment strategies to improve response. Black-box machine learning methods for estimating ITRs may produce treatment rules that have optimal benefit but lack transparency and interpretability. Barriers to implementing personalized treatments in clinical psychiatry include a lack of evidence-based, clinically interpretable, individualized treatment rules, a lack of diagnostic measure to evaluate candidate ITRs, a lack of power to detect treatment modifiers from a single study, and a lack of reproducibility of treatment rules estimated from single studies. This dissertation contains three parts to tackle these barriers: (1) methods to estimate the best linear ITR with guaranteed performance among the class of linear rules; (2) a tree-based method to improve the performance of a linear ITR fitted from the overall sample and identify subgroups with a large benefit; and (3) an integrative learning combining information across trials to provide an integrative ITR with improved efficiency and reproducibility. In the first part of the dissertation, we propose a machine learning method to estimate optimal linear individualized treatment rules for data collected from single stage randomized controlled trials (RCTs). In clinical practice, an informative and practically useful treatment rule should be simple and transparent. However, because simple rules are likely to be far from optimal, effective methods to construct such rules must guarantee performance, in terms of yielding the best clinical outcome (highest reward) among the class of simple rules under consideration. Furthermore, it is important to evaluate the benefit of the derived rules on the whole sample and in pre-specified subgroups (e.g., vulnerable patients). To achieve both goals, we propose a robust machine learn- ing algorithm replacing zero-one loss with an authentic approximation loss (ramp loss) for value maximization, referred to as the asymptotically best linear O-learning (ABLO), which estimates a linear treatment rule that is guaranteed to achieve optimal reward among the class of all linear rules. We then develop a diagnostic measure and inference procedure to evaluate the benefit of the obtained rule and compare it with the rules estimated by other methods. We provide theoretical justification for the proposed method and its inference procedure, and we demonstrate via simulations its superior performance when compared to existing methods. Lastly, we apply the proposed method to the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) trial on major depressive disorder (MDD) and show that the estimated optimal linear rule provides a large benefit for mildly depressed and severely depressed patients but manifests a lack-of-fit for moderately depressed patients. The second part of the dissertation is motivated by the results of real data analysis in the first part, where the global linear rule estimated by ABLO from the overall sample performs inadequately on the subgroup of moderately depressed patients. Therefore, we aim to derive a simple and interpretable piece-wise linear ITR to maintain certain optimality that leads to improved benefit in subgroups of patients, as well as the overall sample. In this work, we propose a tree-based robust learning method to estimate optimal piece-wise linear ITRs and identify subgroups of patients with a large benefit. We achieve these goals by simultaneously identifying qualitative and quantitative interactions through a tree model, referred to as the composite interaction tree (CITree). We show that it has improved performance compared to existing methods on both overall sample and subgroups via extensive simulation studies. Lastly, we fit CITree to Research Evaluating the Value of Augmenting Medication with Psychotherapy (REVAMP) trial for treating major depressive disorders, where we identified both qualitative and quantitative interactions and subgroups of patients with a large benefit. The third part deals with the difficulties in the low power of identifying ITRs and replicating ITRs due to small sample sizes of single randomized controlled trials. In this work, a novel integrative learning method is developed to synthesize evidence across trials and provide an integrative ITR that improves efficiency and reproducibility. Our method does not require all studies to collect a common set of variables and thus allows information to be combined from ITRs identified from randomized controlled trials with heterogeneous sets of baseline covariates collected from different domains with different resolution. Based on the research goal, the integrative learning can be used to enhance a high-resolution ITR by borrowing information from coarsened ITRs or improve the coarsened ITR from a high-resolution ITR. With a simple modification, the proposed integrative learning can also be applied to improve the estimation of ITRs for studies with blockwise missing feature variables. We conduct extensive simulation studies to show that our method has improved performance compared to existing methods where only single-trial ITRs are used to learn personalized treatment rules. Lastly, we apply the proposed method to RCTs of major depressive disorder and other comorbid mental disorders. We found that by combining information from two studies, the integrated ITR has a greater benefit and improved efficiency compared to single-trial rules or universal non-personalized treatment rule. Biometry Machine learning Medical statistics Therapeutics Personalized medicine
899	Statistical Machine Learning Methods for the Large Scale Analysis of Neural Data Mena, Gonzalo Esteban January 2018 (has links) Modern neurotechnologies enable the recording of neural activity at the scale of entire brains and with single-cell resolution. However, the lack of principled approaches to extract structure from these massive data streams prevent us from fully exploiting the potential of these technologies. This thesis, divided in three parts, introduces new statistical machine learning methods to enable the large-scale analysis of some of these complex neural datasets. In the first part, I present a method that leverages Gaussian quadrature to accelerate inference of neural encoding models from a certain type of observed neural point processes --- spike trains --- resulting in substantial improvements over existing methods. The second part focuses on the simultaneous electrical stimulation and recording of neurons using large electrode arrays. There, identification of neural activity is hindered by stimulation artifacts that are much larger than spikes, and overlap temporally with spikes. To surmount this challenge, I develop an algorithm to infer and cancel this artifact, enabling inference of the neural signal of interest. This algorithm is based on a a bayesian generative model for recordings, where a structured gaussian process is used to represent prior knowledge of the artifact. The algorithm achieves near perfect accuracy and enables the analysis of data hundreds of time faster than previous approaches. The third part is motivated by the problem of inference of neural dynamics in the worm C.elegans: when taking a data-driven approach to this question, e.g., when using whole-brain calcium imaging data, one is faced with the need to match neural recordings to canonical neural identities, in practice resolved by tedious human labor. Alternatively, on a bayesian setup this problem may be cast as posterior inference of a latent permutation. I introduce methods that enable gradient-based approximate posterior inference of permutations, overcoming the difficulties imposed by the combinatorial and discrete nature of this object. Results suggest the feasibility of automating neural identification, and demonstrate variational inference in permutations is a sensible alternative to MCMC. Statistics Machine learning Neural circuitry--Data processing Neurotechnology (Bioengineering)
900	Genomic and machine-learning analysis of germline variants in cancer Madubata, Chioma January 2018 (has links) Cancer often develops from specific DNA alterations, and these cancer-associated mutations influence precision cancer treatment. These alterations can be specific to the tumor DNA (somatic mutations) or they can be heritable and present in normal and tumor DNA (germline mutations). Germline variants can affect how patients respond to therapy and can influence clinical surveillance of patients and their families. While identifying cancer-associated germline variants traditionally required studying families with inherited cancer predispositions, large-scale cancer sequencing cohorts enable alternative analysis of germline variants. In this dissertation, we develop and apply multiple strategies for analyzing germline DNA from cancer sequencing cohorts. First, we develop the Tumor-Only Boosting Identification framework (TOBI) to learn biological features of true somatic mutations and generate a classification model that identifies DNA variants with somatic characteristics. TOBI has high sensitivity in identifying true somatic variants across several cancer types, particularly in known driver genes. After predicting somatic variants with TOBI, we assess the identified somatic-like germline variants for known oncogenic germline variants and enrichment in biological pathways. We find germline and somatic variants inactivating the Fanconi anemia pathway in 11% of patients with bladder cancer. Finally, we investigate germline, diagnosis, and relapse variants in a large cohort of patients with pediatric acute lymphoblastic leukemia (ALL). Our somatic analysis captures known ALL driver genes, and we describe the sequential order of diagnosis and relapse mutations, including late events in NT5C2. We apply both the TOBI framework and guidelines American College of Medical Genetics and Genomics to identify potentially cancer-associated germline variants, and nominate nonsynonymous variants in TERT and ATM. Bioinformatics Cytology Genetics Machine learning Germ cells Cancer

Search results