Global ETD Search

121	Robot developmental learning of an object ontology grounded in sensorimotor experience Modayil, Joseph Varughese. January 1900 (has links) Thesis (Ph. D.)--University of Texas at Austin, 2007. / Vita. Includes bibliographical references.
122	Data mining logic explanations from numerical data / Riehl, Katrina. January 2006 (has links) Thesis. / Includes vita. Includes bibliographical references (leaves 79-86) Data mining. Machine learning. Logic.
123	Performance modelling and optimization for video-analytic algorithms in a cloud-like environment using machine learning Al-Rawahi, Manal N. K. January 2016 (has links) CCTV cameras produce a large amount of video surveillance data per day, and analysing them require the use of significant computing resources that often need to be scalable. The emergence of the Hadoop distributed processing framework has had a significant impact on various data intensive applications as the distributed computed based processing enables an increase of the processing capability of applications it serves. Hadoop is an open source implementation of the MapReduce programming model. It automates the operation of creating tasks for each function, distribute data, parallelize executions and handles machine failures that reliefs users from the complexity of having to manage the underlying processing and only focus on building their application. It is noted that in a practical deployment the challenge of Hadoop based architecture is that it requires several scalable machines for effective processing, which in turn adds hardware investment cost to the infrastructure. Although using a cloud infrastructure offers scalable and elastic utilization of resources where users can scale up or scale down the number of Virtual Machines (VM) upon requirements, a user such as a CCTV system operator intending to use a public cloud would aspire to know what cloud resources (i.e. number of VMs) need to be deployed so that the processing can be done in the fastest (or within a known time constraint) and the most cost effective manner. Often such resources will also have to satisfy practical, procedural and legal requirements. The capability to model a distributed processing architecture where the resource requirements can be effectively and optimally predicted will thus be a useful tool, if available. In literature there is no clear and comprehensive modelling framework that provides proactive resource allocation mechanisms to satisfy a user's target requirements, especially for a processing intensive application such as video analytic. In this thesis, with the hope of closing the above research gap, novel research is first initiated by understanding the current legal practices and requirements of implementing video surveillance system within a distributed processing and data storage environment, since the legal validity of data gathered or processed within such a system is vital for a distributed system's applicability in such domains. Subsequently the thesis presents a comprehensive framework for the performance ii modelling and optimization of resource allocation in deploying a scalable distributed video analytic application in a Hadoop based framework, running on virtualized cluster of machines. The proposed modelling framework investigates the use of several machine learning algorithms such as, decision trees (M5P, RepTree), Linear Regression, Multi Layer Perceptron(MLP) and the Ensemble Classifier Bagging model, to model and predict the execution time of video analytic jobs, based on infrastructure level as well as job level parameters. Further in order to propose a novel framework for the allocate resources under constraints to obtain optimal performance in terms of job execution time, we propose a Genetic Algorithms (GAs) based optimization technique. Experimental results are provided to demonstrate the proposed framework's capability to successfully predict the job execution time of a given video analytic task based on infrastructure and input data related parameters and its ability determine the minimum job execution time, given constraints of these parameters. Given the above, the thesis contributes to the state-of-art in distributed video analytics, design, implementation, performance analysis and optimisation. 006.3 CCTV ; Algorithms ; Machine learning
124	Computational Natural Language Inference: Robust and Interpretable Question Answering Sharp, Rebecca, Sharp, Rebecca January 2017 (has links) We address the challenging task of computational natural language inference, by which we mean bridging two or more natural language texts while also providing an explanation of how they are connected. In the context of question answering (i.e., finding short answers to natural language questions), this inference connects the question with its answer and we learn to approximate this inference with machine learning. In particular, here we present four approaches to question answering, each of which shows a significant improvement in performance over baseline methods. In our first approach, we make use of the underlying discourse structure inherent in free text (i.e. whether the text contains an explanation, elaboration, contrast, etc.) in order to increase the amount of training data for (and subsequently the performance of) a monolingual alignment model. In our second work, we propose a framework for training customized lexical semantics models such that each one represents a single semantic relation. We use causality as a use case, and demonstrate that our customized model is able to both identify causal relations as well as significantly improve our ability to answer causal questions. We then propose two approaches that seek to answer questions by learning to rank human-readable justifications for the answers, such that the model selects the answer with the best justification. The first uses a graph-structured representation of the background knowledge and performs information aggregation to construct multi-sentence justifications. The second reduces pre-processing costs by limiting itself to a single sentence and using a neural network to learn a latent representation of the background knowledge. For each of these, we show that in addition to significant improvement in correctly answering questions, we also outperform a strong baseline in terms of the quality of the answer justification given. Inference Machine Learning Question Answering
125	Supervised machine learning for email thread summarization Ulrich, Jan 11 1900 (has links) Email has become a part of most people's lives, and the ever increasing amount of messages people receive can lead to email overload. We attempt to mitigate this problem using email thread summarization. Summaries can be used for things other than just replacing an incoming email message. They can be used in the business world as a form of corporate memory, or to allow a new team member an easy way to catch up on an ongoing conversation. Email threads are of particular interest to summarization because they contain much structural redundancy due to their conversational nature. Our email thread summarization approach uses machine learning to pick which sentences from the email thread to use in the summary. A machine learning summarizer must be trained using previously labeled data, i.e. manually created summaries. After being trained our summarization algorithm can generate summaries that on average contain over 70% of the same sentences as human annotators. We show that labeling some key features such as speech acts, meta sentences, and subjectivity can improve performance to over 80% weighted recall. To create such email summarization software, an email dataset is needed for training and evaluation. Since email communication is a private matter, it is hard to get access to real emails for research. Furthermore these emails must be annotated with human generated summaries as well. As these annotated datasets are rare, we have created one and made it publicly available. The BC3 corpus contains annotations for 40 email threads which include extractive summaries, abstractive summaries with links, and labeled speech acts, meta sentences, and subjective sentences. While previous research has shown that machine learning algorithms are a promising approach to email summarization, there has not been a study on the impact of the choice of algorithm. We explore new techniques in email thread summarization using several different kinds of regression, and the results show that the choice of classifier is very critical. We also present a novel feature set for email summarization and do analysis on two email corpora: the BC3 corpus and the Enron corpus. / Science, Faculty of / Computer Science, Department of / Graduate Email Summarization Machine learning Corpus
126	Information systems for tactical decision making Fairley, Andrew January 1994 (has links) No description available. 003.5
127	Extending AdaBoost:Varying the Base Learners and Modifying the Weight Calculation Neves de Souza, Erico January 2014 (has links) AdaBoost has been considered one of the best classifiers ever developed, but two important problems have not yet been addressed. The first is the dependency on the ``weak" learner, and the second is the failure to maintain the performance of learners with small error rates (i.e. ``strong" learners). To solve the first problem, this work proposes using a different learner in each iteration - known as AdaBoost Dynamic (AD) - thereby ensuring that the performance of the algorithm is almost equal to that of the best ``weak" learner executed with AdaBoost.M1. The work then further modifies the procedure to vary the learner in each iteration, in order to locate the learner with the smallest error rate in its training data. This is done using the same weight calculation as in the original AdaBoost; this version is known as AdaBoost Dynamic with Exponential Loss (AB-EL). The results were poor, because AdaBoost does not perform well with strong learners, so, in this sense, the work confirmed previous works' results. To determine how to improve the performance, the weight calculation is modified to use the sigmoid function with algorithm output being the derivative of the same sigmoid function, rather than the logistic regression weight calculation originally used by AdaBoost; this version is known as AdaBoost Dynamic with Logistic Loss (AB-DL). This work presents the convergence proof that binomial weight calculation works, and that this approach improves the results for the strong learner, both theoretically and empirically. AB-DL also has some disadvantages, like the search for the ``best" classifier and that this search reduces the diversity among the classifiers. In order to attack these issues, another algorithm is proposed that combines AD ``weak" leaner execution policy with a small modification of AB-DL's weight calculation, called AdaBoost Dynamic with Added Cost (AD-AC). AD-AC also has a theoretical upper bound error, and the algorithm offers a small accuracy improvement when compared with AB-DL, and traditional AdaBoost approaches. Lastly, this work also adapts AD-AC's weight calculation approach to deal with data stream problem, where classifiers must deal with very large data sets (in the order of millions of instances), and limited memory availability. AdaBoost Machine Learning Data Stream
128	Predicting drug target proteins and their properties Bull, Simon January 2015 (has links) The discovery of drug targets is a vital component in the development of therapeutic treatments, as it is only through the modulation of a target’s activity that a drug can alleviate symptoms or cure. Accurate identification of drug targets is therefore an important part of any development program, and has an outsized impact on the program’s success due to its position as the first step in the pipeline. This makes the stringent selection of potential targets all the more vital when attempting to control the increasing cost and time needed to successfully complete a development program, and in order to increase the throughput of the entire drug discovery pipeline. In this work, a computational approach was taken to the investigation of protein drug targets. First, a new heuristic, Leaf, for the approximation of a maximum independent set was developed, and evaluated in terms of its ability to remove redundancy from protein datasets, the goal being to generate the largest possible non-redundant dataset. The ability of Leaf to remove redundancy was compared to that of pre-existing heuristics and an optimal algorithm, Cliquer. Not only did Leaf find unbiased non-redundant sets that were around 10% larger than the commonly used PISCES algorithm, it found ones that were no more than one protein smaller than the maximum possible found by Cliquer. Following this, the human proteome was mined to discover properties of proteins that may be important in determining their suitability for pharmaceutical modulation. Data was gathered concerning each protein’s sequence, post-translational modifications, secondary structure, germline variants, expression profile and target status. The data was then analysed to determine features for which the target and non-target proteins had significantly different values. This analysis was repeated for subsets of the proteome consisting of all GPCRs, ion channels, kinases and proteases, as well as for a subset consisting of all proteins that are implicated in cancer. Next, machine learning was used to quantify the proteins in each dataset in terms of their potential to serve as a drug target. For each dataset, this was accomplished by first inducing a random forest that could distinguish between its targets and non-targets, and then using the random forest to quantify the drug target likeness of the non-targets. The properties that can best differentiate targets from non-targets were primarily found to be those that are directly related to a protein’s sequence (e.g. secondary structure). Germline variants, expression levels and interactions between proteins had minimal discriminative power. Overall, the best indicators of drug target likeness were found to be the proteins’ hydrophobicities, in vivo half-lives, propensity for being membrane bound and the fraction of non-polar amino acids in their sequences. In terms of predicting potential targets, datasets of proteases, ion channels and cancer proteins were able to induce random forests that were highly capable of distinguishing between targets and non-targets. The non-target proteins predicted to be targets by these random forests comprise the set of the most suitable potential future drug targets, and are therefore likely to produce the best results if used as the basis for building a drug development programme. 615.7
129	Machine learning-based approaches to data quality improvement in mobile crowdsensing and crowdsourcing Jiang, Jinghan 13 September 2021 (has links) With the wide popularity of smart devices such as smartphones, smartwatches, and smart cameras, Mobile Crowdsensing (MCS) and Crowdsourcing (CS) have been broadly applied for collecting data from a large group of ordinary participants. The quality of participants' contributed data, however, is hard to guarantee, and as such it is critical to develop efficient and effective methods to automatically improve data quality over MCS/CS platforms. In this thesis, we propose three machine learning-based solutions for data quality enhancement in different participatory MCS/CS scenarios. Our solutions aim at the data extraction phase as well as the data collection phase of participatory MCS/CS, including (1) trustworthy information extraction from conflicting data, (2) recognition of learning patterns, and (3) worker recruitment based on interactive training and learning pattern extraction. The first one is designed for the data extraction phase and the other two for the data collection phase. First, to derive reliable data from diverse or even conflicting labels from the crowd, we design a mechanism to infuse knowledge from domain experts into the labels from the crowd to automatically make correct decisions on classification-based MCS tasks. Our solution, named EFusion, utilizes a probabilistic graphical model and the expectation maximization (EM) algorithm to infer the most likely expertise level of each crowd worker, the difficulty level of tasks, and the ground truth answers. Furthermore, we introduce a method to extend EFusion from solving binary classification problems to handling multi-class classification problems. We evaluate EFusion using real-world case studies as well as simulations. Evaluation results demonstrate that EFusion can return more accurate and stable classification results than the majority voting method and state-of-the-art methods. Second, we propose Goldilocks, an interactive learning pattern recognition framework that can identify suitable participants whose performance follows desired learning patterns. To accurately extract a participant's learning pattern, we first estimate the impact of previous training questions on the participant before she answers a new question. After the participant answers each new question, we adjust the estimation of her capability by considering a quantitative measure of the impact of previous questions and her answer to the new question. Based on the extracted learning curve of each participant, we recruit the candidates, who have showed good learning capability and desired learning patterns, for the formal MCS/CS task. We further develop a web service over Amazon Web Services (AWS) that automatically adjusts questions to maximize individual participants' learning performance. This website also profiles the participants' learning patterns, which can be used for task assignment in MCS/CS. Third, we present HybrTraining, a hybrid deep learning framework that captures each candidate’s capability from a long-term perspective and excludes the undesired candidates in the early stage of the training phase. Using two collaborative deep learning networks, HybrTraining can dynamically match participants and MCS/CS tasks. In detail, we build a deep Q-network (DQN) to match the candidates and training batches in the training phase, and develop a long short-term memory (LSTM) model that extracts the learning patterns of different candidates and helps the DQN make better worker-task matching decisions. We build HyberTraining on Compute Canada and evaluate it over two scientific datasets. For each dataset, the learning data of candidates is collected with a Python-based Django website over Amazon Elastic Compute Cloud (Amazon EC2). Evaluation results show that HybrTraining can increase data collection efficiency and improve data quality in MCS/CS. / Graduate / 2022-08-19 Crowdsourcing Mobile Crowdsensing Machine Learning
130	Genetic Programming Approach for Nonstationary Data Analytics Kuranga, Cry 16 February 2021 (has links) Nonstationary data with concept drift occurring is usually made up of different underlying data generating processes. Therefore, if the knowledge of the existence of different segments in the dataset is not taken into consideration, then the induced predictive model is distorted by the past existing patterns. Thus, the challenge posed to a regressor is to select an appropriate segment that depicts the current underlying data generating process to be used in a model induction. The proposed genetic programming approach for nonstationary data analytics (GPANDA) provides a piecewise nonlinear regression model for nonstationary data. The GPANDA consists of three components: dynamic differential evolution-based clustering algorithm to split the parameter space into subspaces that resemble different data generating processes present in the dataset; the dynamic particle swarm optimization-based model induction technique to induce nonlinear models that describe each generated cluster; and dynamic genetic programming that evolves model trees that define the boundaries of nonlinear models which are expressed as terminal nodes. If an environmental change is detected in a nonstationary dataset, a dynamic differential evolution-based clustering algorithm clusters the data. For the clusters that change, the dynamic particle swarm optimization-based model induction approach adapts nonlinear models or induces new models to create an updated genetic programming terminal set and then, purple the genetic programming evolves a piecewise predictive model to fit the dataset. To evaluate the effectiveness of GPANDA, experimental evaluations were conducted on both artificial and real-world datasets. Two stock market datasets, GDP and CPI were selected to benchmark the performance of the proposed model to the leading studies. GPANDA outperformed the genetic programming algorithms designed for dynamic environments and was competitive to the state-of-art-techniques. / Thesis (PhD)--University of Pretoria, 2020. / UP Postgraduate Research Bursary / Computer Science / PhD / Unrestricted Computational Intelligence Machine learning UCTD

Search results