Global ETD Search

261	Pattern Discovery in DNA Sequences Yan, Rui 20 March 2014 (has links) A pattern is a relatively short sequence that represents a phenomenon in a set of sequences. Not all short sequences are patterns; only those that are statistically significant are referred to as patterns or motifs. Pattern discovery methods analyze sequences and attempt to identify and characterize meaningful patterns. This thesis extends the application of pattern discovery algorithms to a new problem domain - Single Nucleotide Polymorphism (SNP) classification. SNPs are single base-pair (bp) variations in the genome, and are probably the most common form of genetic variation. On average, one in every thousand bps may be an SNP. The function of most SNPs, especially those not associated with protein sequence changes, remains unclear. However, genome-wide linkage analyses have associated many SNPs with disorders ranging from Crohn’s disease, to cancer, to quantitative traits such as height or hair color. As a result, many groups are working to predict the functional effects of individual SNPs. In contrast, very little research has examined the causes of SNPs: Why do SNPs occur where they do? This thesis addresses this problem by using pattern discovery algorithms to study DNA non-coding sequences. The hypothesis is that short DNA patterns can be used to predict SNPs. For example, such patterns found in the SNP sequence might block the DNA repair mechanism for the SNP, thus causing SNP occurrence. In order to test the hypothesis, a model is developed to predict SNPs by using pattern discovery methods. The results show that SNP prediction with pattern discovery methods is weak (50 2%), whereas machine learning classification algorithms can achieve prediction accuracy as high as 68%. To determine whether the poor performance of pattern discovery is due to data characteristics (such as sequence length or pattern length) or to the specific biological problem (SNP prediction), a survey was conducted by profiling eight representative pattern discovery methods at multiple parameter settings on 6,754 real biological datasets. This is the first systematic review of pattern discovery methods with assessments of prediction accuracy, CPU usage and memory consumption. It was found that current pattern discovery methods do not consider positional information and do not handle short sequences well (<150 bps), including SNP sequences. Therefore, this thesis proposes a new supervised pattern discovery classification algorithm, referred to as Weighted-Position Pattern Discovery and Classification (WPPDC). The WPPDC is able to exploit positional information to identify positionally-enriched motifs, and to select motifs with a high information content for further classification. Tree structure is applied to WPPDC (referred to as T-WPPDC) in order to reduce algorithmic complexity. Compared to pattern discovery methods T-WPPDC not only showed consistently superior prediction accuracy and but generated patterns with positional information. Machine-learning classification methods (such as Random Forests) showed comparable prediction accuracy. However, unlike T-WPPDC, they are classification methods and are unable to generate SNP-associated patterns. Pattern Discovery SNPs Machine Learning Bioinformatics 0984
262	Pattern Discovery in DNA Sequences Yan, Rui 20 March 2014 (has links) A pattern is a relatively short sequence that represents a phenomenon in a set of sequences. Not all short sequences are patterns; only those that are statistically significant are referred to as patterns or motifs. Pattern discovery methods analyze sequences and attempt to identify and characterize meaningful patterns. This thesis extends the application of pattern discovery algorithms to a new problem domain - Single Nucleotide Polymorphism (SNP) classification. SNPs are single base-pair (bp) variations in the genome, and are probably the most common form of genetic variation. On average, one in every thousand bps may be an SNP. The function of most SNPs, especially those not associated with protein sequence changes, remains unclear. However, genome-wide linkage analyses have associated many SNPs with disorders ranging from Crohn’s disease, to cancer, to quantitative traits such as height or hair color. As a result, many groups are working to predict the functional effects of individual SNPs. In contrast, very little research has examined the causes of SNPs: Why do SNPs occur where they do? This thesis addresses this problem by using pattern discovery algorithms to study DNA non-coding sequences. The hypothesis is that short DNA patterns can be used to predict SNPs. For example, such patterns found in the SNP sequence might block the DNA repair mechanism for the SNP, thus causing SNP occurrence. In order to test the hypothesis, a model is developed to predict SNPs by using pattern discovery methods. The results show that SNP prediction with pattern discovery methods is weak (50 2%), whereas machine learning classification algorithms can achieve prediction accuracy as high as 68%. To determine whether the poor performance of pattern discovery is due to data characteristics (such as sequence length or pattern length) or to the specific biological problem (SNP prediction), a survey was conducted by profiling eight representative pattern discovery methods at multiple parameter settings on 6,754 real biological datasets. This is the first systematic review of pattern discovery methods with assessments of prediction accuracy, CPU usage and memory consumption. It was found that current pattern discovery methods do not consider positional information and do not handle short sequences well (<150 bps), including SNP sequences. Therefore, this thesis proposes a new supervised pattern discovery classification algorithm, referred to as Weighted-Position Pattern Discovery and Classification (WPPDC). The WPPDC is able to exploit positional information to identify positionally-enriched motifs, and to select motifs with a high information content for further classification. Tree structure is applied to WPPDC (referred to as T-WPPDC) in order to reduce algorithmic complexity. Compared to pattern discovery methods T-WPPDC not only showed consistently superior prediction accuracy and but generated patterns with positional information. Machine-learning classification methods (such as Random Forests) showed comparable prediction accuracy. However, unlike T-WPPDC, they are classification methods and are unable to generate SNP-associated patterns. Pattern Discovery SNPs Machine Learning Bioinformatics 0984
263	Assisting bug report triage through recommendation Anvik, John 05 1900 (has links) A key collaborative hub for many software development projects is the issue tracking system, or bug repository. The use of a bug repository can improve the software development process in a number of ways including allowing developers who are geographically distributed to communicate about project development. However, reports added to the repository need to be triaged by a human, called the triager, to determine if reports are meaningful. If a report is meaningful, the triager decides how to organize the report for integration into the project's development process. We call triager decisions with the goal of determining if a report is meaningful, repository-oriented decisions, and triager decisions that organize reports for the development process, development-oriented decisions. Triagers can become overwhelmed by the number of reports added to the repository. Time spent triaging also typically diverts valuable resources away from the improvement of the product to the managing of the development process. To assist triagers, this dissertation presents a machine learning approach to create recommenders that assist with a variety of development-oriented decisions. In this way, we strive to reduce human involvement in triage by moving the triager's role from having to gather information to make a decision to that of confirming a suggestion. This dissertation introduces a triage-assisting recommender creation process that can create a variety of different development-oriented decision recommenders for a range of projects. The recommenders created with this approach are accurate: recommenders for which developer to assign a report have a precision of 70% to 98% over five open source projects, recommenders for which product component the report is for have a recall of 72% to 92%, and recommenders for who to add to the cc: list of a report that have a recall of 46% to 72%. We have evaluated recommenders created with our triage-assisting recommender creation process using both an analytic evaluation and a field study. In addition, we present in this dissertation an approach to assist project members to specify the project-specific values for the triage-assisting recommender creation process, and show that such recommenders can be created with a subset of the repository data. bug report triage machine learning recommender
264	Design of a self-paced brain computer interface system using features extracted from three neurological phenomena Fatourechi, Mehrdad 05 1900 (has links) Self-paced Brain computer interface (SBCI) systems allow individuals with motor disabilities to use their brain signals to control devices, whenever they wish. These systems are required to identify the user’s “intentional control (IC)” commands and they must remain inactive during all periods in which users do not intend control (called “no control (NC)” periods). This dissertation addresses three issues related to the design of SBCI systems: 1) their presently high false positive (FP) rates, 2) the presence of artifacts and 3) the identification of a suitable evaluation metric. To improve the performance of SBCI systems, the following are proposed: 1) a method for the automatic user-customization of a 2-state SBCI system, 2) a two-stage feature reduction method for selecting wavelet coefficients extracted from movement-related potentials (MRP), 3) an SBCI system that classifies features extracted from three neurological phenomena: MRPs, changes in the power of the Mu and Beta rhythms; 4) a novel method that effectively combines methods developed in 2) and 3 ) and 5) generalizing the system developed in 3) for detecting a right index finger flexion to detecting the right hand extension. Results of these studies using actual movements show an average true positive (TP) rate of 56.2% at the FP rate of 0.14% for the finger flexion study and an average TP rate of 33.4% at the FP rate of 0.12% for the hand extension study. These FP results are significantly lower than those achieved in other SBCI systems, where FP rates vary between 1-10%. We also conduct a comprehensive survey of the BCI literature. We demonstrate that many BCI papers do not properly deal with artifacts. We show that the proposed BCI achieves a good performance of TP=51.8% and FP=0.4% in the presence of eye movement artifacts. Further tests of the performance of the proposed system in a pseudo-online environment, shows an average TP rate =48.8% at the FP rate of 0.8%. Finally, we propose a framework for choosing a suitable evaluation metric for SBCI systems. This framework shows that Kappa coefficient is more suitable than other metrics in evaluating the performance during the model selection procedure. brain computer interface pattern recognition machine learning
265	Data analysis in proteomics novel computational strategies for modeling and interpreting complex mass spectrometry data Sniatynski, Matthew John 11 1900 (has links) Contemporary proteomics studies require computational approaches to deal with both the complexity of the data generated, and with the volume of data produced. The amalgamation of mass spectrometry -- the analytical tool of choice in proteomics -- with the computational and statistical sciences is still recent, and several avenues of exploratory data analysis and statistical methodology remain relatively unexplored. The current study focuses on three broad analytical domains, and develops novel exploratory approaches and practical tools in each. Data transform approaches are the first explored. These methods re-frame data, allowing for the visualization and exploitation of features and trends that are not immediately evident. An exploratory approach making use of the correlation transform is developed, and is used to identify mass-shift signals in mass spectra. This approach is used to identify and map post-translational modifications on individual peptides, and to identify SILAC modification-containing spectra in a full-scale proteomic analysis. Secondly, matrix decomposition and projection approaches are explored; these use an eigen-decomposition to extract general trends from groups of related spectra. A data visualization approach is demonstrated using these techniques, capable of visualizing trends in large numbers of complex spectra, and a data compression and feature extraction technique is developed suitable for use in spectral modeling. Finally, a general machine learning approach is developed based on conditional random fields (CRFs). These models are capable of dealing with arbitrary sequence modeling tasks, similar to hidden Markov models (HMMs), but are far more robust to interdependent observational features, and do not require limiting independence assumptions to remain tractable. The theory behind this approach is developed, and a simple machine learning fragmentation model is developed to test the hypothesis that reproducible sequence-specific intensity ratios are present within the distribution of fragment ions originating from a common peptide bond breakage. After training, the model shows very good performance associating peptide sequences and fragment ion intensity information, lending strong support to the hypothesis. Proteomics Bioinformatics Machine learning Mass spectrometry
266	Machine Learning Methods and Models for Ranking Volkovs, Maksims 13 August 2013 (has links) Ranking problems are ubiquitous and occur in a variety of domains that include social choice, information retrieval, computational biology and many others. Recent advancements in information technology have opened new data processing possibilities and signi cantly increased the complexity of computationally feasible methods. Through these advancements ranking models are now beginning to be applied to many new and diverse problems. Across these problems data, which ranges from gene expressions to images and web-documents, has vastly di erent properties and is often not human generated. This makes it challenging to apply many of the existing models for ranking which primarily originate in social choice and are typically designed for human generated preference data. As the field continues to evolve a new trend has recently emerged where machine learning methods are being used to automatically learn the ranking models. While these methods typically lack the theoretical support of the social choice models they often show excellent empirical performance and are able to handle large and diverse data placing virtually no restrictions on the data type. These model have now been successfully applied to many diverse ranking problems including image retrieval, protein selection, machine translation and many others. Inspired by these promising results the work presented in this thesis aims to advance machine methods for ranking and develop new techniques to allow e ective modeling of existing and future problems. The presented work concentrates on three di erent but related domains: information retrieval, preference aggregation and collaborative ltering. In each domain we develop new models together with learning and inference methods and empirically verify our models on real-life data. Applied Sciences Artificial Intelligence Machine Learning 0800
267	Machine Learning Methods and Models for Ranking Volkovs, Maksims 13 August 2013 (has links) Ranking problems are ubiquitous and occur in a variety of domains that include social choice, information retrieval, computational biology and many others. Recent advancements in information technology have opened new data processing possibilities and signi cantly increased the complexity of computationally feasible methods. Through these advancements ranking models are now beginning to be applied to many new and diverse problems. Across these problems data, which ranges from gene expressions to images and web-documents, has vastly di erent properties and is often not human generated. This makes it challenging to apply many of the existing models for ranking which primarily originate in social choice and are typically designed for human generated preference data. As the field continues to evolve a new trend has recently emerged where machine learning methods are being used to automatically learn the ranking models. While these methods typically lack the theoretical support of the social choice models they often show excellent empirical performance and are able to handle large and diverse data placing virtually no restrictions on the data type. These model have now been successfully applied to many diverse ranking problems including image retrieval, protein selection, machine translation and many others. Inspired by these promising results the work presented in this thesis aims to advance machine methods for ranking and develop new techniques to allow e ective modeling of existing and future problems. The presented work concentrates on three di erent but related domains: information retrieval, preference aggregation and collaborative ltering. In each domain we develop new models together with learning and inference methods and empirically verify our models on real-life data. Applied Sciences Artificial Intelligence Machine Learning 0800
268	Document Clustering with Dual Supervision Hu, Yeming 19 June 2012 (has links) Nowadays, academic researchers maintain a personal library of papers, which they would like to organize based on their needs, e.g., research, projects, or courseware. Clustering techniques are often employed to achieve this goal by grouping the document collection into different topics. Unsupervised clustering does not require any user effort but only produces one universal output with which users may not be satisfied. Therefore, document clustering needs user input for guidance to generate personalized clusters for different users. Semi-supervised clustering incorporates prior information and has the potential to produce customized clusters. Traditional semi-supervised clustering is based on user supervision in the form of labeled instances or pairwise instance constraints. However, alternative forms of user supervision exist such as labeling features. For document clustering, document supervision involves labeling documents while feature supervision involves labeling features. Their joint of use has been called dual supervision. In this thesis, we first explore and propose a framework to use feature supervision for interactive feature selection by indicating whether a feature is useful for clustering. Second, we enhance the semi-supervised clustering with feature supervision using feature reweighting. Third, we propose a unified framework to combine document supervision and feature supervision through seeding. The newly proposed algorithms are evaluated using oracles and demonstrated to be more helpful in producing better clusters matching a single user's point of view than document clustering without any supervision and with only document supervision. Finally, we conduct a user study to confirm that different users have different understandings of the same document collection and prefer personalized clusters. At the same time, we demonstrate that document clustering with dual supervision is able to produce good personalized clusters even with noisy user input. Dual supervision is also demonstrated to be more effective in personalized clustering than no supervision or any single supervision. We also analyze users' behaviors during the user study and present suggestions for the design of document management software. Document Management Text Mining Machine Learning
269	Towards Coevolutionary Genetic Programming with Pareto Archiving Under Streaming Data Atwater, Aaron 13 August 2013 (has links) Classification under streaming data constraints implies that training must be performed continuously, can only access individual exemplars for a short time after they arrive, must adapt to dynamic behaviour over time, and must be able to retrieve a current classifier at any time. A coevolutionary genetic programming framework is adapted to operate in non-stationary streaming data environments. Methods to generate synthetic datasets for benchmarking streaming classification algorithms are introduced, and the proposed framework is evaluated against them. The use of Pareto archiving is evaluated as a mechanism for retaining access to a limited number of useful exemplars throughout training, and several fitness sharing heuristics for archiving are evaluated. Fitness sharing alone is found to be most effective under streams with continuous (incremental) changes, while the addition of an aging heuristic is preferred when the stream has stepwise changes. Tapped delay lines are explored as a method for explicitly incorporating sequence context in cyclical data streams, and their use in combination with the aging heuristic suggests a promising route forward. / Hyperref'd copy available at: https://web.cs.dal.ca/~atwater/ computer science genetic programming machine learning classification
270	Learning multi-agent pursuit of a moving target Lu, Jieshan Unknown Date No description available. moving target search features machine learning

Search results