Global ETD Search

1	Contrasting sequence groups by emerging sequences Deng, Kang 11 1900 (has links) Group comparison per se is a fundamental task in many scientific endeavours but is also the basis of any classifier. Comparing groups of sequence data is a relevant task. To contrast sequence groups, we define Emerging Sequences (ESs) as subsequences that are frequent in sequences of one group and less frequent in another, and thus distinguishing sequences of different classes. There are two challenges to distinguish sequence classes by ESs: the extraction of ESs is not trivially efficient and only exact matches of sequences are considered. In our work we address those problems by a suffix tree-based framework and a sliding window matching mechanism. A classification model based on ESs is also proposed. Evaluating against several other learning algorithms, the experiments on two datasets show that our similar ESs-based classification model outperforms the baseline approaches. With the ESs' high discriminative power, our proposed model achieves satisfactory F-measures on classifying sequences. Emerging Sequences Sequence Classification Sequence Similarity
2	Contrasting sequence groups by emerging sequences Deng, Kang Unknown Date No description available. Emerging Sequences Sequence Classification Sequence Similarity
3	LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification Mungre, Surbhi January 1900 (has links) Master of Science / Department of Computing and Information Sciences / Doina Caragea / Several computational biology and bioinformatics problems involve DNA sequence classification using supervised machine learning algorithms. The performance of these algorithms is largely dependent on the availability of labeled data and the approach used to represent DNA sequences as {\it feature vectors}. For many organisms, the labeled DNA data is scarce, while the unlabeled data is easily available. However, for a small number of well-studied model organisms, large amounts of labeled data are available. This calls for {\it domain adaptation} approaches, which can transfer knowledge from a {\it source} domain, for which labeled data is available, to a {\it target} domain, for which large amounts of unlabeled data are available. Intuitively, one approach to domain adaptation can be obtained by extracting and representing the features that the source domain and the target domain sequences share. \emph{Latent Dirichlet Allocation} (LDA) is an unsupervised dimensionality reduction technique that has been successfully used to generate features for sequence data such as text. In this work, we explore the use of LDA for generating predictive DNA sequence features, that can be used in both supervised and domain adaptation frameworks. More precisely, we propose two dimensionality reduction approaches, LDA Words (LDAW) and LDA Distribution (LDAD) for DNA sequences. LDA is a probabilistic model, which is generative in nature, and is used to model collections of discrete data such as document collections. For our problem, a sequence is considered to be a ``document" and k-mers obtained from a sequence are ``document words". We use LDA to model our sequence collection. Given the LDA model, each document can be represented as a distribution over topics (where a topic can be seen as a distribution over k-mers). In the LDAW method, we use the top k-mers in each topic as our features (i.e., k-mers with the highest probability); while in the LDAD method, we use the topic distribution to represent a document as a feature vector. We study LDA-based dimensionality reduction approaches for both supervised DNA sequence classification, as well as domain adaptation approaches. We apply the proposed approaches on the splice site predication problem, which is an important DNA sequence classification problem in the context of genome annotation. In the supervised learning framework, we study the effectiveness of LDAW and LDAD methods by comparing them with a traditional dimensionality reduction technique based on the information gain criterion. In the domain adaptation framework, we study the effect of increasing the evolutionary distances between the source and target organisms, and the effect of using different weights when combining labeled data from the source domain and with labeled data from the target domain. Experimental results show that LDA-based features can be successfully used to perform dimensionality reduction and domain adaptation for DNA sequence classification problems. Domain Adaptation Splice Site Prediction Latent Dirichlet Allocation DNA Sequence Classification Dimentionality Reduction Computer Science (0984)
4	Unsupervised feature construction approaches for biological sequence classification Tangirala, Karthik January 1900 (has links) Doctor of Philosophy / Department of Computing and Information Sciences / Doina Caragea / Recent advancements in biological sciences have resulted in the availability of large amounts of sequence data (DNA and protein sequences). Biological sequence data can be annotated using machine learning techniques, but most learning algorithms require data to be represented by a vector of features. In the absence of biologically informative features, k-mers generated using a sliding window-based approach are commonly used to represent biological sequences. A larger k value typically results in better features; however, the number of k-mer features is exponential in k, and many k-mers are not informative. Feature selection is widely used to reduce the dimensionality of the input feature space. Most feature selection techniques use feature-class dependency scores to rank the features. However, when the amount of available labeled data is small, feature selection techniques may not accurately capture feature-class dependency scores. Therefore, instead of working with all k-mers, this dissertation proposes the construction of a reduced set of informative k-mers that can be used to represent biological sequences. This work resulted in three novel unsupervised approaches to construct features: 1. Burrows Wheeler Transform-based approach, that uses the sorted permutations of a given sequence to construct sequential features (subsequences) that occur multiple times in a given sequence. 2. Community detection-based approach, that uses a community detection algorithm to group similar subsequences into communities and refines the communities to form motifs (group of similar subsequences). Motifs obtained using the community detection-based approach satisfy the ZOMOPS constraint (Zero, One or Multiple Occurrences of a Motif Per Sequence). All possible unique subsequences of the obtained motifs are then used as features to represent the sequences. 3. Hybrid-based approach, that combines the Burrows Wheeler Transform-based approach and the community detection-based approach to allow certain mismatches to the features constructed using the Burrows Wheeler Transform-based approach. To evaluate the predictive power of the features constructed using the proposed approaches, experiments were conducted in three learning scenarios: supervised, semi-supervised, and domain adaptation for both nucleotide and protein sequence classification problems. The performance of classifiers learned using features generated with the proposed approaches was compared with the performance of the classifiers learned using k-mers (with feature selection) and feature hashing (another unsupervised dimensionality reduction technique). Experimental results from the three learning scenarios showed that features constructed with the proposed approaches were typically more informative than k-mers and feature hashing. Machine learning Bioinformatics Feature construction Biological sequence classification Unsupervised approaches Bioinformatics (0715) Computer Science (0984)
5	Domain adaptation algorithms for biological sequence classification Herndon, Nic January 1900 (has links) Doctor of Philosophy / Department of Computing and Information Sciences / Doina Caragea / The large volume of data generated in the recent years has created opportunities for discoveries in various fields. In biology, next generation sequencing technologies determine faster and cheaper the exact order of nucleotides present within a DNA or RNA fragment. This large volume of data requires the use of automated tools to extract information and generate knowledge. Machine learning classification algorithms provide an automated means to annotate data but require some of these data to be manually labeled by human experts, a process that is costly and time consuming. An alternative to labeling data is to use existing labeled data from a related domain, the source domain, if any such data is available, to train a classifier for the domain of interest, the target domain. However, the classification accuracy usually decreases for the domain of interest as the distance between the source and target domains increases. Another alternative is to label some data and complement it with abundant unlabeled data from the same domain, and train a semi-supervised classifier, although the unlabeled data can mislead such classifier. In this work another alternative is considered, domain adaptation, in which the goal is to train an accurate classifier for a domain with limited labeled data and abundant unlabeled data, the target domain, by leveraging labeled data from a related domain, the source domain. Several domain adaptation classifiers are proposed, derived from a supervised discriminative classifier (logistic regression) or a supervised generative classifier (naïve Bayes), and some of the factors that influence their accuracy are studied: features, data used from the source domain, how to incorporate the unlabeled data, and how to combine all available data. The proposed approaches were evaluated on two biological problems -- protein localization and ab initio splice site prediction. The former is motivated by the fact that predicting where a protein is localized provides an indication for its function, whereas the latter is an essential step in gene prediction. Machine learning Biological sequence classification Protein localization Domain adapation Splice site prediction
6	Transport mode inference by multimodal map matching and sequence classification / Inferens i transportläge genom multimodal kartmatchning och sekvensklassificering Salerno, Bruno January 2020 (has links) Automation of travel diary collection, an essential input for transport planning, has been a fruitful line of research for the last years; in particular, concerning the problem of automatic inference of transport modes. Taking advantage of technological advance, several solutions based on the collection of mobile devices data, such as GPS locations and variables related to movement (such as speed) and motion (e.g. measurements from accelerometer), have been investigated. The literature shows that many of them rely on explicit initial segmentation of GPS trajectories into trip legs, followed by a segment-based classification problem. In some cases, GIS-related features are included in the classification instance, but usually in terms of distance to transport networks or to specific points of interest (POIs). The aim of this MSc Thesis is to investigate a novel transport mode inference procedure based on the generation of topological features from a multimodal map matching instance. We define topological features as the topological context of each point of a GPS trajectory. Further utilization of these features as part of a sequence classification problem leads to mode prediction and to the implicit definition of the trip legs. In addition to not depending on an explicit segmentation step, the proposed routine also has less requirements in terms of the complexity of the required GIS features: there is no need to consider distance features, and the proposed map matching implementation does not require the usage of one unified multimodal network —as other multimodal map matching approaches do. The procedure was tested with a travel diary data set collected in Stockholm, containing 4246 trips from 368 different commuters. The transport modes considered were walk, subway, commuter train, bus and tram. In order to assess the impact of the topological context, different feature set compositions were investigated, including topological and conventional movement and motion features. Three different classifiers —decision tree, support vector machine and conditional random field— were evaluated as well. The results show that the proposed procedure reached high accuracy, with a performance that is similar to the one offered by current approaches; and that the most performant feature set composition was the one that included both topological and movement and motion features. The best evaluation measures were obtained with decision tree and conditional random field classifiers, but with some differences: while both of the them presented similar recall, the former yielded better precision and the latter achieved a higher segmentation quality. transport mode inference multimodal map matching sequence classification Other Social Sciences Annan samhällsvetenskap
7	Common Features in lncRNA Annotation and Classification: A Survey Klapproth, Christopher, Sen, Rituparno, Stadler, Peter F., Findeiß, Sven, Fallmann, Jörg 05 May 2023 (has links) Long non-coding RNAs (lncRNAs) are widely recognized as important regulators of gene expression. Their molecular functions range from miRNA sponging to chromatin-associated mechanisms, leading to effects in disease progression and establishing them as diagnostic and therapeutic targets. Still, only a few representatives of this diverse class of RNAs are well studied, while the vast majority is poorly described beyond the existence of their transcripts. In this review we survey common in silico approaches for lncRNA annotation. We focus on the well-established sets of features used for classification and discuss their specific advantages and weaknesses. While the available tools perform very well for the task of distinguishing coding sequence from other RNAs, we find that current methods are not well suited to distinguish lncRNAs or parts thereof from other non-protein-coding input sequences. We conclude that the distinction of lncRNAs from intronic sequences and untranslated regions of coding mRNAs remains a pressing research gap. info:eu-repo/classification/ddc/540 ddc:540
8	SEQUENCE CLASSIFICATION USING HIDDEN MARKOV MODELS DESAI, PRANAY A. 13 July 2005 (has links) No description available. HMM Hidden Markov Models Sequence Classification Viterbi Algorithm Forward Algorithm Backward Algorithm
9	Design of a Robust and Flexible Grammar for Speech Control Ludyga, Tomasz 28 May 2024 (has links) Voice interaction is an established automatization and accessibility feature. While many satisfactory speech recognition solutions are available today, the interpretation of text se-mantic is in some use-cases difficult. Differentiated can be two types of text semantic ex-traction models: probabilistic and pure rule-based. Rule-based reasoning is formalizable into grammars and enables fast language validation, transparent decision-making and easy customization. In this thesis we develop a context-free ANTLR semantic grammar to control software by speech in a medical, smart glasses related, domain. The implementation is preceded by research of state-of-the-art, requirements consultation and a thorough design of reusable system abstractions. Design includes definitions of DSL, meta grammar, generic system ar-chitecture and tool support. Additionally, we investigate trivial and experimental grammar improvement techniques. Due to multifaceted flexibility and robustness of the designed framework, we indicate its usability in critical and adaptive systems. We determine 75% semantic recognition accuracy in the medical main use-case. We compare it against se-mantic extraction using SpaCy and two fine-tuned AI classifiers. The evaluation reveals high accuracy of BERT for sequence classification and big potential of hybrid solutions with AI techniques on top grammars, essentially for detection of alerts. The accuracy is strong dependent on input quality, highlighting the importance of speech recognition tailored to specific vocabulary.:1 Introduction 1 1.1 Motivation 1 1.2 CAIS.ME Project 2 1.3 Problem Statement 2 1.4 Thesis Overview 3 2 Related Work 4 3 Foundational Concepts and Systems 6 3.1 Human-Computer Interaction in Speech 6 3.2 Speech Recognition 7 3.2.1 Open-source technologies 8 3.2.2 Other technologies 9 3.3 Language Recognition 9 3.3.1 Regular expressions 10 3.3.2 Lexical tokenization 10 3.3.3 Parsing 10 3.3.4 Domain Specific Languages 11 3.3.5 Formal grammars 11 3.3.6 Natural Language Processing 12 3.3.7 Model-Driven Engineering 14 4 State-of-the-Art: Grammars 15 4.1 Overview 15 4.2 Workbenches for Grammar Design 16 4.2.1 ANTLR 16 4.2.2 Xtext 17 4.2.3 JetBrains MPS 17 4.2.4 Other tools 18 4.3 Design Approaches 19 5 Problem Analysis 23 5.1 Methodology 23 5.2 Identification of Use-Cases 24 5.3 Requirements Analysis 26 5.3.1 Functional requirements 26 5.3.2 Qualitative requirements 26 5.3.3 Acceptance criteria 27 6 Design 29 6.1 Preprocessing 29 6.2 Underlying Domain Specific Modelling 31 6.2.1 Language model definition 31 6.2.2 Formalization 32 6.2.3 Constraints 32 6.3 Generic Grammar Syntax 33 6.4 Architecture 36 6.5 Integration of AI Techniques 38 6.6 Grammar Improvement 40 6.6.1 Identification of synonyms 40 6.6.2 Automatic addition of synonyms 42 6.6.3 Addition of same-meaning strings 42 6.6.4 Addition and modification of rules 43 6.7 Processing of unrecognized input 44 6.8 Summary 45 7 Implementation and Evaluation 47 7.1 Development Environment 47 7.2 Implementation 48 7.2.1 Grammar model transformation 48 7.2.2 Output construction 50 7.2.3 Testing 50 7.2.4 Reusability for similar use-cases 51 7.3 Limitations and Challenges 52 7.4 Comparison to NLP Solutions 54 8 Conclusion 58 8.1 Summary of Findings 58 8.2 Future Research and Development 60 Acronyms 62 Bibliography 63 List of Figures 73 List of Tables 74 List of Listings 75 info:eu-repo/classification/ddc/006 ddc:006
10	Process pattern mining: identifying sources of assignable error using event logs Shetty, Bhupesh 01 December 2018 (has links) This thesis examines the problem of identifying patterns in process event logs that are correlated with binary events that are undetected until the end of the process. Specifically, we consider the task of identifying patterns in a machine shop manufacturing process that are correlated with product defect. We introduce a pattern mining algorithm based on Apriori to identify frequent patterns, and use binary correlation measures to identify patterns associated with elevated defect rate. We design a simulation model to generate synthetic datasets to test our algorithm. We compare the effectiveness of different correlation measures, target pattern complexities, and sample sizes with and without knowledge of the underlying process. We show that knowledge of the underlying process helps in identifying the pattern that is associated with defects. We also develop a decision support tool based on p-value simulation to help managers identify sources of error in real-life settings. Finally, we apply our method to real world data and extract useful information from the data to help plant managers make decisions related to investments and workforce planning. This thesis also explores the problem of predicting the defect probability given an ordered list of events and its defect status. We develop a supervised learning model using the frequency of patterns deduced from the event log as the feature set. We discuss the challenges faced in this approach and conclude that random forest algorithm performs better than other methods. We apply this approach to a real world case study and discuss the applications in the machine shop. Finally, the thesis explores the order-bidding process in the machine shop industry, and proposes an optimization-based model to maximize the profit of the machine shop. Through a case study example, we show the advantages of using the defect probability in the proposed optimization model to determine the machine-worker schedule to execute job orders in a machine shop. Apriori Association-based methods Binary Correlation Measures Integer optimization Pattern Mining Sequence Classification

Search results