Global ETD Search

211	Advances in kernel methods : towards general-purpose and scalable models Samo, Yves-Laurent Kom January 2017 (has links) A wide range of statistical and machine learning problems involve learning one or multiple latent functions, or properties thereof, from datasets. Examples include regression, classification, principal component analysis, optimisation, learning intensity functions of point processes and reinforcement learning to name but a few. For all these problems, positive semi-definite kernels (or simply kernels) provide a powerful tool for postulating flexible nonparametric hypothesis spaces over functions. Despite recent work on such kernel methods, parametric alternatives, such as deep neural networks, have been at the core of most artificial intelligence breakthroughs in recent years. In this thesis, both theoretical and methodological foundations are presented for constructing fully automated, scalable, and general-purpose kernel machines that perform very well over a wide range of input dimensions and sample sizes. This thesis aims to contribute towards bridging the gap between kernel methods and deep learning and to propose methods that have the advantage over deep learning in performing well on both small and large scale problems. In Part I we provide a gentle introduction to kernel methods, review recent work, identify remaining gaps and outline our contributions. In Part II we develop flexible and scalable Bayesian kernel methods in order to address gaps in methods capable of dealing with the special case of datasets exhibiting locally homogeneous patterns. We begin with two motivating applications. First we consider inferring the intensity function of an inhomogeneous point process in Chapter 2. This application is used to illustrate that often, by carefully adding some mild asymmetry in the dependency structure in Bayesian kernel methods, one may considerably scale-up inference while improving flexibility and accuracy. In Chapter 3 we propose a scalable scheme for online forecasting of time series and fully-online learning of related model parameters, under a kernel-based generative model that is provably sufficiently flexible. This application illustrates that, for one-dimensional input spaces, restricting the degree of differentiability of the latent function of interest may considerably speed-up inference without resorting to approximations and without any adverse effect on flexibility or accuracy. Chapter 4 generalizes these approaches and proposes a novel class of stochastic processes we refer to as string Gaussian processes (string GPs) that, when used as functional prior in a Bayesian nonparametric framework, allow for inference in linear time complexity and linear memory requirement, without resorting to approximations. More importantly, the corresponding inference scheme, which we derive in Chapter 5, also allows flexible learning of locally homogeneous patterns and automated learning of model complexity - that is automated learning of whether there are local patterns in the data in the first place, how much local patterns are present, and where they are located. In Part III we provide a broader discussion covering all types of patterns (homogeneous, locally homogeneous or heterogeneous patterns) and both Bayesian or frequentist kernel methods. In Chapter 6 we begin by discussing what properties a family of kernels should possess to enable fully automated kernel methods that are applicable to any type of datasets. In this chapter, we discuss a novel mathematical formalism for the notion of âgeneral-purpose' families of kernels, and we argue that existing families of kernels are not general-purpose. In Chapter 7 we derive weak sufficient conditions for families of kernels to be general-purpose, and we exhibit tractable such families that enjoy a suitable parametrisation, that we refer to as generalized spectral kernels (GSKs). In Chapter 8 we provide a scalable inference scheme for automated kernel learning using general-purpose families of kernels. The proposed inference scheme scales linearly with the sample size and enables automated learning of nonstationarity and model complexity from the data, in virtually any kernel method. Finally, we conclude with a discussion in Chapter 9 where we show that deep learning can be regarded as a particular type of kernel learning method, and we discuss possible extensions in Chapter 10.
212	Non-parametric Bayesian models for structured output prediction Bratières, Sébastien January 2018 (has links) Structured output prediction is a machine learning tasks in which an input object is not just assigned a single class, as in classification, but multiple, interdependent labels. This means that the presence or value of a given label affects the other labels, for instance in text labelling problems, where output labels are applied to each word, and their interdependencies must be modelled. Non-parametric Bayesian (NPB) techniques are probabilistic modelling techniques which have the interesting property of allowing model capacity to grow, in a controllable way, with data complexity, while maintaining the advantages of Bayesian modelling. In this thesis, we develop NPB algorithms to solve structured output problems. We first study a map-reduce implementation of a stochastic inference method designed for the infinite hidden Markov model, applied to a computational linguistics task, part-of-speech tagging. We show that mainstream map-reduce frameworks do not easily support highly iterative algorithms. The main contribution of this thesis consists in a conceptually novel discriminative model, GPstruct. It is motivated by labelling tasks, and combines attractive properties of conditional random fields (CRF), structured support vector machines, and Gaussian process (GP) classifiers. In probabilistic terms, GPstruct combines a CRF likelihood with a GP prior on factors; it can also be described as a Bayesian kernelized CRF. To train this model, we develop a Markov chain Monte Carlo algorithm based on elliptical slice sampling and investigate its properties. We then validate it on real data experiments, and explore two topologies: sequence output with text labelling tasks, and grid output with semantic segmentation of images. The latter case poses scalability issues, which are addressed using likelihood approximations and an ensemble method which allows distributed inference and prediction. The experimental validation demonstrates: (a) the model is flexible and its constituent parts are modular and easy to engineer; (b) predictive performance and, most crucially, the probabilistic calibration of predictions are better than or equal to that of competitor models, and (c) model hyperparameters can be learnt from data.
213	Performance modelling and optimization for video-analytic algorithms in a cloud-like environment using machine learning Al-Rawahi, Manal N. K. January 2016 (has links) CCTV cameras produce a large amount of video surveillance data per day, and analysing them require the use of significant computing resources that often need to be scalable. The emergence of the Hadoop distributed processing framework has had a significant impact on various data intensive applications as the distributed computed based processing enables an increase of the processing capability of applications it serves. Hadoop is an open source implementation of the MapReduce programming model. It automates the operation of creating tasks for each function, distribute data, parallelize executions and handles machine failures that reliefs users from the complexity of having to manage the underlying processing and only focus on building their application. It is noted that in a practical deployment the challenge of Hadoop based architecture is that it requires several scalable machines for effective processing, which in turn adds hardware investment cost to the infrastructure. Although using a cloud infrastructure offers scalable and elastic utilization of resources where users can scale up or scale down the number of Virtual Machines (VM) upon requirements, a user such as a CCTV system operator intending to use a public cloud would aspire to know what cloud resources (i.e. number of VMs) need to be deployed so that the processing can be done in the fastest (or within a known time constraint) and the most cost effective manner. Often such resources will also have to satisfy practical, procedural and legal requirements. The capability to model a distributed processing architecture where the resource requirements can be effectively and optimally predicted will thus be a useful tool, if available. In literature there is no clear and comprehensive modelling framework that provides proactive resource allocation mechanisms to satisfy a user's target requirements, especially for a processing intensive application such as video analytic. In this thesis, with the hope of closing the above research gap, novel research is first initiated by understanding the current legal practices and requirements of implementing video surveillance system within a distributed processing and data storage environment, since the legal validity of data gathered or processed within such a system is vital for a distributed system's applicability in such domains. Subsequently the thesis presents a comprehensive framework for the performance ii modelling and optimization of resource allocation in deploying a scalable distributed video analytic application in a Hadoop based framework, running on virtualized cluster of machines. The proposed modelling framework investigates the use of several machine learning algorithms such as, decision trees (M5P, RepTree), Linear Regression, Multi Layer Perceptron(MLP) and the Ensemble Classifier Bagging model, to model and predict the execution time of video analytic jobs, based on infrastructure level as well as job level parameters. Further in order to propose a novel framework for the allocate resources under constraints to obtain optimal performance in terms of job execution time, we propose a Genetic Algorithms (GAs) based optimization technique. Experimental results are provided to demonstrate the proposed framework's capability to successfully predict the job execution time of a given video analytic task based on infrastructure and input data related parameters and its ability determine the minimum job execution time, given constraints of these parameters. Given the above, the thesis contributes to the state-of-art in distributed video analytics, design, implementation, performance analysis and optimisation. 006.3 CCTV ; Algorithms ; Machine learning
214	Classification of plants in corn fields using machine learning techniques Dhodda, Pruthvidhar Reddy January 1900 (has links) Master of Science / Department of Computer Science / William H. Hsu / This thesis addresses the tasks of detecting vegetation and classifying plants into target crops and weeds using combinations of machine learning and pattern recognition algorithms and models. Solutions to these problems have many useful applications in precision agriculture, such as estimating the yield of a target crop or identifying weeds to help automate the selective application of weedicides and thereby reducing cost and pollution. The novel contribution of this work includes development and application of image processing and computer vision techniques to create training data with minimal human intervention, thus saving substantial human time and effort. All of the data used in this work was collected from corn fields and is in the RGB format. As part of this thesis, I first discuss several steps that are part of a general methodology and data science pipeline for these tasks, such as: vegetation detection, feature engineering, crop row detection, training data generation, training, and testing. Next, I develop software components for segmentation and classification subtasks based on extant image processing and machine learning algorithms. I then present a comparison of different classifier models developed through this process using their Receiver Operating Characteristic (ROC) curves. The difference in models lies in the way they are trained - locally or globally. I also investigate the effect of the altitude at which data is collected on the performance of classifiers. Scikit-learn, a Python library for machine learning, is used to train decision trees and other classification learning models. Finally, I compare the precision, recall, and accuracy attained by segmenting (recognizing the boundary of) plants using the excess green index (ExG) with that of a learned Gaussian mixture model. I performed all image processing tasks using OpenCV, an open source computer vision library. Machine learning Image processing Classification
215	Computational Natural Language Inference: Robust and Interpretable Question Answering Sharp, Rebecca, Sharp, Rebecca January 2017 (has links) We address the challenging task of computational natural language inference, by which we mean bridging two or more natural language texts while also providing an explanation of how they are connected. In the context of question answering (i.e., finding short answers to natural language questions), this inference connects the question with its answer and we learn to approximate this inference with machine learning. In particular, here we present four approaches to question answering, each of which shows a significant improvement in performance over baseline methods. In our first approach, we make use of the underlying discourse structure inherent in free text (i.e. whether the text contains an explanation, elaboration, contrast, etc.) in order to increase the amount of training data for (and subsequently the performance of) a monolingual alignment model. In our second work, we propose a framework for training customized lexical semantics models such that each one represents a single semantic relation. We use causality as a use case, and demonstrate that our customized model is able to both identify causal relations as well as significantly improve our ability to answer causal questions. We then propose two approaches that seek to answer questions by learning to rank human-readable justifications for the answers, such that the model selects the answer with the best justification. The first uses a graph-structured representation of the background knowledge and performs information aggregation to construct multi-sentence justifications. The second reduces pre-processing costs by limiting itself to a single sentence and using a neural network to learn a latent representation of the background knowledge. For each of these, we show that in addition to significant improvement in correctly answering questions, we also outperform a strong baseline in terms of the quality of the answer justification given. Inference Machine Learning Question Answering
216	Supervised machine learning for email thread summarization Ulrich, Jan 11 1900 (has links) Email has become a part of most people's lives, and the ever increasing amount of messages people receive can lead to email overload. We attempt to mitigate this problem using email thread summarization. Summaries can be used for things other than just replacing an incoming email message. They can be used in the business world as a form of corporate memory, or to allow a new team member an easy way to catch up on an ongoing conversation. Email threads are of particular interest to summarization because they contain much structural redundancy due to their conversational nature. Our email thread summarization approach uses machine learning to pick which sentences from the email thread to use in the summary. A machine learning summarizer must be trained using previously labeled data, i.e. manually created summaries. After being trained our summarization algorithm can generate summaries that on average contain over 70% of the same sentences as human annotators. We show that labeling some key features such as speech acts, meta sentences, and subjectivity can improve performance to over 80% weighted recall. To create such email summarization software, an email dataset is needed for training and evaluation. Since email communication is a private matter, it is hard to get access to real emails for research. Furthermore these emails must be annotated with human generated summaries as well. As these annotated datasets are rare, we have created one and made it publicly available. The BC3 corpus contains annotations for 40 email threads which include extractive summaries, abstractive summaries with links, and labeled speech acts, meta sentences, and subjective sentences. While previous research has shown that machine learning algorithms are a promising approach to email summarization, there has not been a study on the impact of the choice of algorithm. We explore new techniques in email thread summarization using several different kinds of regression, and the results show that the choice of classifier is very critical. We also present a novel feature set for email summarization and do analysis on two email corpora: the BC3 corpus and the Enron corpus. / Science, Faculty of / Computer Science, Department of / Graduate Email Summarization Machine learning Corpus
217	Information systems for tactical decision making Fairley, Andrew January 1994 (has links) No description available. 003.5
218	Extending AdaBoost:Varying the Base Learners and Modifying the Weight Calculation Neves de Souza, Erico January 2014 (has links) AdaBoost has been considered one of the best classifiers ever developed, but two important problems have not yet been addressed. The first is the dependency on the ``weak" learner, and the second is the failure to maintain the performance of learners with small error rates (i.e. ``strong" learners). To solve the first problem, this work proposes using a different learner in each iteration - known as AdaBoost Dynamic (AD) - thereby ensuring that the performance of the algorithm is almost equal to that of the best ``weak" learner executed with AdaBoost.M1. The work then further modifies the procedure to vary the learner in each iteration, in order to locate the learner with the smallest error rate in its training data. This is done using the same weight calculation as in the original AdaBoost; this version is known as AdaBoost Dynamic with Exponential Loss (AB-EL). The results were poor, because AdaBoost does not perform well with strong learners, so, in this sense, the work confirmed previous works' results. To determine how to improve the performance, the weight calculation is modified to use the sigmoid function with algorithm output being the derivative of the same sigmoid function, rather than the logistic regression weight calculation originally used by AdaBoost; this version is known as AdaBoost Dynamic with Logistic Loss (AB-DL). This work presents the convergence proof that binomial weight calculation works, and that this approach improves the results for the strong learner, both theoretically and empirically. AB-DL also has some disadvantages, like the search for the ``best" classifier and that this search reduces the diversity among the classifiers. In order to attack these issues, another algorithm is proposed that combines AD ``weak" leaner execution policy with a small modification of AB-DL's weight calculation, called AdaBoost Dynamic with Added Cost (AD-AC). AD-AC also has a theoretical upper bound error, and the algorithm offers a small accuracy improvement when compared with AB-DL, and traditional AdaBoost approaches. Lastly, this work also adapts AD-AC's weight calculation approach to deal with data stream problem, where classifiers must deal with very large data sets (in the order of millions of instances), and limited memory availability. AdaBoost Machine Learning Data Stream
219	Predicting drug target proteins and their properties Bull, Simon January 2015 (has links) The discovery of drug targets is a vital component in the development of therapeutic treatments, as it is only through the modulation of a target’s activity that a drug can alleviate symptoms or cure. Accurate identification of drug targets is therefore an important part of any development program, and has an outsized impact on the program’s success due to its position as the first step in the pipeline. This makes the stringent selection of potential targets all the more vital when attempting to control the increasing cost and time needed to successfully complete a development program, and in order to increase the throughput of the entire drug discovery pipeline. In this work, a computational approach was taken to the investigation of protein drug targets. First, a new heuristic, Leaf, for the approximation of a maximum independent set was developed, and evaluated in terms of its ability to remove redundancy from protein datasets, the goal being to generate the largest possible non-redundant dataset. The ability of Leaf to remove redundancy was compared to that of pre-existing heuristics and an optimal algorithm, Cliquer. Not only did Leaf find unbiased non-redundant sets that were around 10% larger than the commonly used PISCES algorithm, it found ones that were no more than one protein smaller than the maximum possible found by Cliquer. Following this, the human proteome was mined to discover properties of proteins that may be important in determining their suitability for pharmaceutical modulation. Data was gathered concerning each protein’s sequence, post-translational modifications, secondary structure, germline variants, expression profile and target status. The data was then analysed to determine features for which the target and non-target proteins had significantly different values. This analysis was repeated for subsets of the proteome consisting of all GPCRs, ion channels, kinases and proteases, as well as for a subset consisting of all proteins that are implicated in cancer. Next, machine learning was used to quantify the proteins in each dataset in terms of their potential to serve as a drug target. For each dataset, this was accomplished by first inducing a random forest that could distinguish between its targets and non-targets, and then using the random forest to quantify the drug target likeness of the non-targets. The properties that can best differentiate targets from non-targets were primarily found to be those that are directly related to a protein’s sequence (e.g. secondary structure). Germline variants, expression levels and interactions between proteins had minimal discriminative power. Overall, the best indicators of drug target likeness were found to be the proteins’ hydrophobicities, in vivo half-lives, propensity for being membrane bound and the fraction of non-polar amino acids in their sequences. In terms of predicting potential targets, datasets of proteases, ion channels and cancer proteins were able to induce random forests that were highly capable of distinguishing between targets and non-targets. The non-target proteins predicted to be targets by these random forests comprise the set of the most suitable potential future drug targets, and are therefore likely to produce the best results if used as the basis for building a drug development programme. 615.7
220	Machine learning-based approaches to data quality improvement in mobile crowdsensing and crowdsourcing Jiang, Jinghan 13 September 2021 (has links) With the wide popularity of smart devices such as smartphones, smartwatches, and smart cameras, Mobile Crowdsensing (MCS) and Crowdsourcing (CS) have been broadly applied for collecting data from a large group of ordinary participants. The quality of participants' contributed data, however, is hard to guarantee, and as such it is critical to develop efficient and effective methods to automatically improve data quality over MCS/CS platforms. In this thesis, we propose three machine learning-based solutions for data quality enhancement in different participatory MCS/CS scenarios. Our solutions aim at the data extraction phase as well as the data collection phase of participatory MCS/CS, including (1) trustworthy information extraction from conflicting data, (2) recognition of learning patterns, and (3) worker recruitment based on interactive training and learning pattern extraction. The first one is designed for the data extraction phase and the other two for the data collection phase. First, to derive reliable data from diverse or even conflicting labels from the crowd, we design a mechanism to infuse knowledge from domain experts into the labels from the crowd to automatically make correct decisions on classification-based MCS tasks. Our solution, named EFusion, utilizes a probabilistic graphical model and the expectation maximization (EM) algorithm to infer the most likely expertise level of each crowd worker, the difficulty level of tasks, and the ground truth answers. Furthermore, we introduce a method to extend EFusion from solving binary classification problems to handling multi-class classification problems. We evaluate EFusion using real-world case studies as well as simulations. Evaluation results demonstrate that EFusion can return more accurate and stable classification results than the majority voting method and state-of-the-art methods. Second, we propose Goldilocks, an interactive learning pattern recognition framework that can identify suitable participants whose performance follows desired learning patterns. To accurately extract a participant's learning pattern, we first estimate the impact of previous training questions on the participant before she answers a new question. After the participant answers each new question, we adjust the estimation of her capability by considering a quantitative measure of the impact of previous questions and her answer to the new question. Based on the extracted learning curve of each participant, we recruit the candidates, who have showed good learning capability and desired learning patterns, for the formal MCS/CS task. We further develop a web service over Amazon Web Services (AWS) that automatically adjusts questions to maximize individual participants' learning performance. This website also profiles the participants' learning patterns, which can be used for task assignment in MCS/CS. Third, we present HybrTraining, a hybrid deep learning framework that captures each candidate’s capability from a long-term perspective and excludes the undesired candidates in the early stage of the training phase. Using two collaborative deep learning networks, HybrTraining can dynamically match participants and MCS/CS tasks. In detail, we build a deep Q-network (DQN) to match the candidates and training batches in the training phase, and develop a long short-term memory (LSTM) model that extracts the learning patterns of different candidates and helps the DQN make better worker-task matching decisions. We build HyberTraining on Compute Canada and evaluate it over two scientific datasets. For each dataset, the learning data of candidates is collected with a Python-based Django website over Amazon Elastic Compute Cloud (Amazon EC2). Evaluation results show that HybrTraining can increase data collection efficiency and improve data quality in MCS/CS. / Graduate / 2022-08-19 Crowdsourcing Mobile Crowdsensing Machine Learning

Search results