Global ETD Search

401	Scoring rules, divergences and information in Bayesian machine learning Huszár, Ferenc January 2013 (has links) No description available. 620
402	Investigating machine learning methods in chemistry Lowe, Robert Alexander January 2012 (has links) No description available. 540
403	Anomaly Detection Through Statistics-Based Machine Learning For Computer Networks Zhu, Xuejun January 2006 (has links) The intrusion detection in computer networks is a complex research problem, which requires the understanding of computer networks and the mechanism of intrusions, the configuration of sensors and the collected data, the selection of the relevant attributes, and the monitor algorithms for online detection. It is critical to develop general methods for data dimension reduction, effective monitoring algorithms for intrusion detection, and means for their performance improvement. This dissertation is motivated by the timely need to develop statistics-based machine learning methods for effective detection of computer network anomalies.Three fundamental research issues related to data dimension reduction, control charts design and performance improvement have been addressed accordingly. The major research activities and corresponding contributions are summarized as follows:(1) Filter and Wrapper models are integrated to extract a small number of the informative attributes for computer network intrusion detection. A two-phase analyses method is proposed for the integration of Filter and Wrapper models. The proposed method has successfully reduced the original 41 attributes to 12 informative attributes while increasing the accuracy of the model. The comparison of the results in each phase shows the effectiveness of the proposed method.(2) Supervised kernel based control charts for anomaly intrusion detection. We propose to construct control charts in a feature space. The first contribution is the use of multi-objective Genetic Algorithm in the parameter pre-selection for SVM based control charts. The second contribution is the performance evaluation of supervised kernel based control charts.(3) Unsupervised kernel based control charts for anomaly intrusion detection. Two types of unsupervised kernel based control charts are investigated: Kernel PCA control charts and Support Vector Clustering based control charts. The applications of SVC based control charts on computer networks audit data are also discussed to demonstrate the effectiveness of the proposed method.Although the developed methodologies in this dissertation are demonstrated in the computer network intrusion detection applications, the methodologies are also expected to be applied to other complex system monitoring, where the database consists of a large dimensional data with non-Gaussian distribution. Intrusion detection Statistics Machine Learning Anomaly detection
404	Identifying Deviating Systems with Unsupervised Learning Panholzer, Georg January 2008 (has links) We present a technique to identify deviating systems among a group of systems in a self-organized way. A compressed representation of each system is used to compute similarity measures, which are combined in an affinity matrix of all systems. Deviation detection and clustering is then used to identify deviating systems based on this affinity matrix. The compressed representation is computed with Principal Component Analysis and Kernel Principal Component Analysis. The similarity measure between two compressed representations is based on the angle between the spaces spanned by the principal components, but other methods of calculating a similarity measure are suggested as well. The subsequent deviation detection is carried out by computing the probability of each system to be observed given all the other systems. Clustering of the systems is done with hierarchical clustering and spectral clustering. The whole technique is demonstrated on four data sets of mechanical systems, two of a simulated cooling system and two of human gait. The results show its applicability on these mechanical systems. Deviation Detection Clustering Eigen-Subspace Machine Learning
405	Pattern Discovery in DNA Sequences Yan, Rui 20 March 2014 (has links) A pattern is a relatively short sequence that represents a phenomenon in a set of sequences. Not all short sequences are patterns; only those that are statistically significant are referred to as patterns or motifs. Pattern discovery methods analyze sequences and attempt to identify and characterize meaningful patterns. This thesis extends the application of pattern discovery algorithms to a new problem domain - Single Nucleotide Polymorphism (SNP) classification. SNPs are single base-pair (bp) variations in the genome, and are probably the most common form of genetic variation. On average, one in every thousand bps may be an SNP. The function of most SNPs, especially those not associated with protein sequence changes, remains unclear. However, genome-wide linkage analyses have associated many SNPs with disorders ranging from Crohn’s disease, to cancer, to quantitative traits such as height or hair color. As a result, many groups are working to predict the functional effects of individual SNPs. In contrast, very little research has examined the causes of SNPs: Why do SNPs occur where they do? This thesis addresses this problem by using pattern discovery algorithms to study DNA non-coding sequences. The hypothesis is that short DNA patterns can be used to predict SNPs. For example, such patterns found in the SNP sequence might block the DNA repair mechanism for the SNP, thus causing SNP occurrence. In order to test the hypothesis, a model is developed to predict SNPs by using pattern discovery methods. The results show that SNP prediction with pattern discovery methods is weak (50 2%), whereas machine learning classification algorithms can achieve prediction accuracy as high as 68%. To determine whether the poor performance of pattern discovery is due to data characteristics (such as sequence length or pattern length) or to the specific biological problem (SNP prediction), a survey was conducted by profiling eight representative pattern discovery methods at multiple parameter settings on 6,754 real biological datasets. This is the first systematic review of pattern discovery methods with assessments of prediction accuracy, CPU usage and memory consumption. It was found that current pattern discovery methods do not consider positional information and do not handle short sequences well (<150 bps), including SNP sequences. Therefore, this thesis proposes a new supervised pattern discovery classification algorithm, referred to as Weighted-Position Pattern Discovery and Classification (WPPDC). The WPPDC is able to exploit positional information to identify positionally-enriched motifs, and to select motifs with a high information content for further classification. Tree structure is applied to WPPDC (referred to as T-WPPDC) in order to reduce algorithmic complexity. Compared to pattern discovery methods T-WPPDC not only showed consistently superior prediction accuracy and but generated patterns with positional information. Machine-learning classification methods (such as Random Forests) showed comparable prediction accuracy. However, unlike T-WPPDC, they are classification methods and are unable to generate SNP-associated patterns. Pattern Discovery SNPs Machine Learning Bioinformatics 0984
406	Pattern Discovery in DNA Sequences Yan, Rui 20 March 2014 (has links) A pattern is a relatively short sequence that represents a phenomenon in a set of sequences. Not all short sequences are patterns; only those that are statistically significant are referred to as patterns or motifs. Pattern discovery methods analyze sequences and attempt to identify and characterize meaningful patterns. This thesis extends the application of pattern discovery algorithms to a new problem domain - Single Nucleotide Polymorphism (SNP) classification. SNPs are single base-pair (bp) variations in the genome, and are probably the most common form of genetic variation. On average, one in every thousand bps may be an SNP. The function of most SNPs, especially those not associated with protein sequence changes, remains unclear. However, genome-wide linkage analyses have associated many SNPs with disorders ranging from Crohn’s disease, to cancer, to quantitative traits such as height or hair color. As a result, many groups are working to predict the functional effects of individual SNPs. In contrast, very little research has examined the causes of SNPs: Why do SNPs occur where they do? This thesis addresses this problem by using pattern discovery algorithms to study DNA non-coding sequences. The hypothesis is that short DNA patterns can be used to predict SNPs. For example, such patterns found in the SNP sequence might block the DNA repair mechanism for the SNP, thus causing SNP occurrence. In order to test the hypothesis, a model is developed to predict SNPs by using pattern discovery methods. The results show that SNP prediction with pattern discovery methods is weak (50 2%), whereas machine learning classification algorithms can achieve prediction accuracy as high as 68%. To determine whether the poor performance of pattern discovery is due to data characteristics (such as sequence length or pattern length) or to the specific biological problem (SNP prediction), a survey was conducted by profiling eight representative pattern discovery methods at multiple parameter settings on 6,754 real biological datasets. This is the first systematic review of pattern discovery methods with assessments of prediction accuracy, CPU usage and memory consumption. It was found that current pattern discovery methods do not consider positional information and do not handle short sequences well (<150 bps), including SNP sequences. Therefore, this thesis proposes a new supervised pattern discovery classification algorithm, referred to as Weighted-Position Pattern Discovery and Classification (WPPDC). The WPPDC is able to exploit positional information to identify positionally-enriched motifs, and to select motifs with a high information content for further classification. Tree structure is applied to WPPDC (referred to as T-WPPDC) in order to reduce algorithmic complexity. Compared to pattern discovery methods T-WPPDC not only showed consistently superior prediction accuracy and but generated patterns with positional information. Machine-learning classification methods (such as Random Forests) showed comparable prediction accuracy. However, unlike T-WPPDC, they are classification methods and are unable to generate SNP-associated patterns. Pattern Discovery SNPs Machine Learning Bioinformatics 0984
407	Assisting bug report triage through recommendation Anvik, John 05 1900 (has links) A key collaborative hub for many software development projects is the issue tracking system, or bug repository. The use of a bug repository can improve the software development process in a number of ways including allowing developers who are geographically distributed to communicate about project development. However, reports added to the repository need to be triaged by a human, called the triager, to determine if reports are meaningful. If a report is meaningful, the triager decides how to organize the report for integration into the project's development process. We call triager decisions with the goal of determining if a report is meaningful, repository-oriented decisions, and triager decisions that organize reports for the development process, development-oriented decisions. Triagers can become overwhelmed by the number of reports added to the repository. Time spent triaging also typically diverts valuable resources away from the improvement of the product to the managing of the development process. To assist triagers, this dissertation presents a machine learning approach to create recommenders that assist with a variety of development-oriented decisions. In this way, we strive to reduce human involvement in triage by moving the triager's role from having to gather information to make a decision to that of confirming a suggestion. This dissertation introduces a triage-assisting recommender creation process that can create a variety of different development-oriented decision recommenders for a range of projects. The recommenders created with this approach are accurate: recommenders for which developer to assign a report have a precision of 70% to 98% over five open source projects, recommenders for which product component the report is for have a recall of 72% to 92%, and recommenders for who to add to the cc: list of a report that have a recall of 46% to 72%. We have evaluated recommenders created with our triage-assisting recommender creation process using both an analytic evaluation and a field study. In addition, we present in this dissertation an approach to assist project members to specify the project-specific values for the triage-assisting recommender creation process, and show that such recommenders can be created with a subset of the repository data. bug report triage machine learning recommender
408	Design of a self-paced brain computer interface system using features extracted from three neurological phenomena Fatourechi, Mehrdad 05 1900 (has links) Self-paced Brain computer interface (SBCI) systems allow individuals with motor disabilities to use their brain signals to control devices, whenever they wish. These systems are required to identify the user’s “intentional control (IC)” commands and they must remain inactive during all periods in which users do not intend control (called “no control (NC)” periods). This dissertation addresses three issues related to the design of SBCI systems: 1) their presently high false positive (FP) rates, 2) the presence of artifacts and 3) the identification of a suitable evaluation metric. To improve the performance of SBCI systems, the following are proposed: 1) a method for the automatic user-customization of a 2-state SBCI system, 2) a two-stage feature reduction method for selecting wavelet coefficients extracted from movement-related potentials (MRP), 3) an SBCI system that classifies features extracted from three neurological phenomena: MRPs, changes in the power of the Mu and Beta rhythms; 4) a novel method that effectively combines methods developed in 2) and 3 ) and 5) generalizing the system developed in 3) for detecting a right index finger flexion to detecting the right hand extension. Results of these studies using actual movements show an average true positive (TP) rate of 56.2% at the FP rate of 0.14% for the finger flexion study and an average TP rate of 33.4% at the FP rate of 0.12% for the hand extension study. These FP results are significantly lower than those achieved in other SBCI systems, where FP rates vary between 1-10%. We also conduct a comprehensive survey of the BCI literature. We demonstrate that many BCI papers do not properly deal with artifacts. We show that the proposed BCI achieves a good performance of TP=51.8% and FP=0.4% in the presence of eye movement artifacts. Further tests of the performance of the proposed system in a pseudo-online environment, shows an average TP rate =48.8% at the FP rate of 0.8%. Finally, we propose a framework for choosing a suitable evaluation metric for SBCI systems. This framework shows that Kappa coefficient is more suitable than other metrics in evaluating the performance during the model selection procedure. brain computer interface pattern recognition machine learning
409	Data analysis in proteomics novel computational strategies for modeling and interpreting complex mass spectrometry data Sniatynski, Matthew John 11 1900 (has links) Contemporary proteomics studies require computational approaches to deal with both the complexity of the data generated, and with the volume of data produced. The amalgamation of mass spectrometry -- the analytical tool of choice in proteomics -- with the computational and statistical sciences is still recent, and several avenues of exploratory data analysis and statistical methodology remain relatively unexplored. The current study focuses on three broad analytical domains, and develops novel exploratory approaches and practical tools in each. Data transform approaches are the first explored. These methods re-frame data, allowing for the visualization and exploitation of features and trends that are not immediately evident. An exploratory approach making use of the correlation transform is developed, and is used to identify mass-shift signals in mass spectra. This approach is used to identify and map post-translational modifications on individual peptides, and to identify SILAC modification-containing spectra in a full-scale proteomic analysis. Secondly, matrix decomposition and projection approaches are explored; these use an eigen-decomposition to extract general trends from groups of related spectra. A data visualization approach is demonstrated using these techniques, capable of visualizing trends in large numbers of complex spectra, and a data compression and feature extraction technique is developed suitable for use in spectral modeling. Finally, a general machine learning approach is developed based on conditional random fields (CRFs). These models are capable of dealing with arbitrary sequence modeling tasks, similar to hidden Markov models (HMMs), but are far more robust to interdependent observational features, and do not require limiting independence assumptions to remain tractable. The theory behind this approach is developed, and a simple machine learning fragmentation model is developed to test the hypothesis that reproducible sequence-specific intensity ratios are present within the distribution of fragment ions originating from a common peptide bond breakage. After training, the model shows very good performance associating peptide sequences and fragment ion intensity information, lending strong support to the hypothesis. Proteomics Bioinformatics Machine learning Mass spectrometry
410	Machine Learning Methods and Models for Ranking Volkovs, Maksims 13 August 2013 (has links) Ranking problems are ubiquitous and occur in a variety of domains that include social choice, information retrieval, computational biology and many others. Recent advancements in information technology have opened new data processing possibilities and signi cantly increased the complexity of computationally feasible methods. Through these advancements ranking models are now beginning to be applied to many new and diverse problems. Across these problems data, which ranges from gene expressions to images and web-documents, has vastly di erent properties and is often not human generated. This makes it challenging to apply many of the existing models for ranking which primarily originate in social choice and are typically designed for human generated preference data. As the field continues to evolve a new trend has recently emerged where machine learning methods are being used to automatically learn the ranking models. While these methods typically lack the theoretical support of the social choice models they often show excellent empirical performance and are able to handle large and diverse data placing virtually no restrictions on the data type. These model have now been successfully applied to many diverse ranking problems including image retrieval, protein selection, machine translation and many others. Inspired by these promising results the work presented in this thesis aims to advance machine methods for ranking and develop new techniques to allow e ective modeling of existing and future problems. The presented work concentrates on three di erent but related domains: information retrieval, preference aggregation and collaborative ltering. In each domain we develop new models together with learning and inference methods and empirically verify our models on real-life data. Applied Sciences Artificial Intelligence Machine Learning 0800

Search results