• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 8
  • 4
  • 1
  • Tagged with
  • 16
  • 16
  • 6
  • 6
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Contrasting sequence groups by emerging sequences

Deng, Kang 11 1900 (has links)
Group comparison per se is a fundamental task in many scientific endeavours but is also the basis of any classifier. Comparing groups of sequence data is a relevant task. To contrast sequence groups, we define Emerging Sequences (ESs) as subsequences that are frequent in sequences of one group and less frequent in another, and thus distinguishing sequences of different classes. There are two challenges to distinguish sequence classes by ESs: the extraction of ESs is not trivially efficient and only exact matches of sequences are considered. In our work we address those problems by a suffix tree-based framework and a sliding window matching mechanism. A classification model based on ESs is also proposed. Evaluating against several other learning algorithms, the experiments on two datasets show that our similar ESs-based classification model outperforms the baseline approaches. With the ESs' high discriminative power, our proposed model achieves satisfactory F-measures on classifying sequences.
2

Contrasting sequence groups by emerging sequences

Deng, Kang Unknown Date
No description available.
3

LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification

Mungre, Surbhi January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / Doina Caragea / Several computational biology and bioinformatics problems involve DNA sequence classification using supervised machine learning algorithms. The performance of these algorithms is largely dependent on the availability of labeled data and the approach used to represent DNA sequences as {\it feature vectors}. For many organisms, the labeled DNA data is scarce, while the unlabeled data is easily available. However, for a small number of well-studied model organisms, large amounts of labeled data are available. This calls for {\it domain adaptation} approaches, which can transfer knowledge from a {\it source} domain, for which labeled data is available, to a {\it target} domain, for which large amounts of unlabeled data are available. Intuitively, one approach to domain adaptation can be obtained by extracting and representing the features that the source domain and the target domain sequences share. \emph{Latent Dirichlet Allocation} (LDA) is an unsupervised dimensionality reduction technique that has been successfully used to generate features for sequence data such as text. In this work, we explore the use of LDA for generating predictive DNA sequence features, that can be used in both supervised and domain adaptation frameworks. More precisely, we propose two dimensionality reduction approaches, LDA Words (LDAW) and LDA Distribution (LDAD) for DNA sequences. LDA is a probabilistic model, which is generative in nature, and is used to model collections of discrete data such as document collections. For our problem, a sequence is considered to be a ``document" and k-mers obtained from a sequence are ``document words". We use LDA to model our sequence collection. Given the LDA model, each document can be represented as a distribution over topics (where a topic can be seen as a distribution over k-mers). In the LDAW method, we use the top k-mers in each topic as our features (i.e., k-mers with the highest probability); while in the LDAD method, we use the topic distribution to represent a document as a feature vector. We study LDA-based dimensionality reduction approaches for both supervised DNA sequence classification, as well as domain adaptation approaches. We apply the proposed approaches on the splice site predication problem, which is an important DNA sequence classification problem in the context of genome annotation. In the supervised learning framework, we study the effectiveness of LDAW and LDAD methods by comparing them with a traditional dimensionality reduction technique based on the information gain criterion. In the domain adaptation framework, we study the effect of increasing the evolutionary distances between the source and target organisms, and the effect of using different weights when combining labeled data from the source domain and with labeled data from the target domain. Experimental results show that LDA-based features can be successfully used to perform dimensionality reduction and domain adaptation for DNA sequence classification problems.
4

Unsupervised feature construction approaches for biological sequence classification

Tangirala, Karthik January 1900 (has links)
Doctor of Philosophy / Department of Computing and Information Sciences / Doina Caragea / Recent advancements in biological sciences have resulted in the availability of large amounts of sequence data (DNA and protein sequences). Biological sequence data can be annotated using machine learning techniques, but most learning algorithms require data to be represented by a vector of features. In the absence of biologically informative features, k-mers generated using a sliding window-based approach are commonly used to represent biological sequences. A larger k value typically results in better features; however, the number of k-mer features is exponential in k, and many k-mers are not informative. Feature selection is widely used to reduce the dimensionality of the input feature space. Most feature selection techniques use feature-class dependency scores to rank the features. However, when the amount of available labeled data is small, feature selection techniques may not accurately capture feature-class dependency scores. Therefore, instead of working with all k-mers, this dissertation proposes the construction of a reduced set of informative k-mers that can be used to represent biological sequences. This work resulted in three novel unsupervised approaches to construct features: 1. Burrows Wheeler Transform-based approach, that uses the sorted permutations of a given sequence to construct sequential features (subsequences) that occur multiple times in a given sequence. 2. Community detection-based approach, that uses a community detection algorithm to group similar subsequences into communities and refines the communities to form motifs (group of similar subsequences). Motifs obtained using the community detection-based approach satisfy the ZOMOPS constraint (Zero, One or Multiple Occurrences of a Motif Per Sequence). All possible unique subsequences of the obtained motifs are then used as features to represent the sequences. 3. Hybrid-based approach, that combines the Burrows Wheeler Transform-based approach and the community detection-based approach to allow certain mismatches to the features constructed using the Burrows Wheeler Transform-based approach. To evaluate the predictive power of the features constructed using the proposed approaches, experiments were conducted in three learning scenarios: supervised, semi-supervised, and domain adaptation for both nucleotide and protein sequence classification problems. The performance of classifiers learned using features generated with the proposed approaches was compared with the performance of the classifiers learned using k-mers (with feature selection) and feature hashing (another unsupervised dimensionality reduction technique). Experimental results from the three learning scenarios showed that features constructed with the proposed approaches were typically more informative than k-mers and feature hashing.
5

Domain adaptation algorithms for biological sequence classification

Herndon, Nic January 1900 (has links)
Doctor of Philosophy / Department of Computing and Information Sciences / Doina Caragea / The large volume of data generated in the recent years has created opportunities for discoveries in various fields. In biology, next generation sequencing technologies determine faster and cheaper the exact order of nucleotides present within a DNA or RNA fragment. This large volume of data requires the use of automated tools to extract information and generate knowledge. Machine learning classification algorithms provide an automated means to annotate data but require some of these data to be manually labeled by human experts, a process that is costly and time consuming. An alternative to labeling data is to use existing labeled data from a related domain, the source domain, if any such data is available, to train a classifier for the domain of interest, the target domain. However, the classification accuracy usually decreases for the domain of interest as the distance between the source and target domains increases. Another alternative is to label some data and complement it with abundant unlabeled data from the same domain, and train a semi-supervised classifier, although the unlabeled data can mislead such classifier. In this work another alternative is considered, domain adaptation, in which the goal is to train an accurate classifier for a domain with limited labeled data and abundant unlabeled data, the target domain, by leveraging labeled data from a related domain, the source domain. Several domain adaptation classifiers are proposed, derived from a supervised discriminative classifier (logistic regression) or a supervised generative classifier (naïve Bayes), and some of the factors that influence their accuracy are studied: features, data used from the source domain, how to incorporate the unlabeled data, and how to combine all available data. The proposed approaches were evaluated on two biological problems -- protein localization and ab initio splice site prediction. The former is motivated by the fact that predicting where a protein is localized provides an indication for its function, whereas the latter is an essential step in gene prediction.
6

Transport mode inference by multimodal map matching and sequence classification / Inferens i transportläge genom multimodal kartmatchning och sekvensklassificering

Salerno, Bruno January 2020 (has links)
Automation of travel diary collection, an essential input for transport planning, has been a fruitful line of research for the last years; in particular, concerning the problem of automatic inference of transport modes. Taking advantage of technological advance, several solutions based on the collection of mobile devices data, such as GPS locations and variables related to movement (such as speed) and motion (e.g. measurements from accelerometer), have been investigated. The literature shows that many of them rely on explicit initial segmentation of GPS trajectories into trip legs, followed by a segment-based classification problem. In some cases, GIS-related features are included in the classification instance, but usually in terms of distance to transport networks or to specific points of interest (POIs). The aim of this MSc Thesis is to investigate a novel transport mode inference procedure based on the generation of topological features from a multimodal map matching instance. We define topological features as the topological context of each point of a GPS trajectory. Further utilization of these features as part of a sequence classification problem leads to mode prediction and to the implicit definition of the trip legs. In addition to not depending on an explicit segmentation step, the proposed routine also has less requirements in terms of the complexity of the required GIS features: there is no need to consider distance features, and the proposed map matching implementation does not require the usage of one unified multimodal network —as other multimodal map matching approaches do. The procedure was tested with a travel diary data set collected in Stockholm, containing 4246 trips from 368 different commuters. The transport modes considered were walk, subway, commuter train, bus and tram. In order to assess the impact of the topological context, different feature set compositions were investigated, including topological and conventional movement and motion features. Three different classifiers —decision tree, support vector machine and conditional random field— were evaluated as well. The results show that the proposed procedure reached high accuracy, with a performance that is similar to the one offered by current approaches; and that the most performant feature set composition was the one that included both topological and movement and motion features. The best evaluation measures were obtained with decision tree and conditional random field classifiers, but with some differences: while both of the them presented similar recall, the former yielded better precision and the latter achieved a higher segmentation quality.
7

Common Features in lncRNA Annotation and Classification: A Survey

Klapproth, Christopher, Sen, Rituparno, Stadler, Peter F., Findeiß, Sven, Fallmann, Jörg 05 May 2023 (has links)
Long non-coding RNAs (lncRNAs) are widely recognized as important regulators of gene expression. Their molecular functions range from miRNA sponging to chromatin-associated mechanisms, leading to effects in disease progression and establishing them as diagnostic and therapeutic targets. Still, only a few representatives of this diverse class of RNAs are well studied, while the vast majority is poorly described beyond the existence of their transcripts. In this review we survey common in silico approaches for lncRNA annotation. We focus on the well-established sets of features used for classification and discuss their specific advantages and weaknesses. While the available tools perform very well for the task of distinguishing coding sequence from other RNAs, we find that current methods are not well suited to distinguish lncRNAs or parts thereof from other non-protein-coding input sequences. We conclude that the distinction of lncRNAs from intronic sequences and untranslated regions of coding mRNAs remains a pressing research gap.
8

SEQUENCE CLASSIFICATION USING HIDDEN MARKOV MODELS

DESAI, PRANAY A. 13 July 2005 (has links)
No description available.
9

Process pattern mining: identifying sources of assignable error using event logs

Shetty, Bhupesh 01 December 2018 (has links)
This thesis examines the problem of identifying patterns in process event logs that are correlated with binary events that are undetected until the end of the process. Specifically, we consider the task of identifying patterns in a machine shop manufacturing process that are correlated with product defect. We introduce a pattern mining algorithm based on Apriori to identify frequent patterns, and use binary correlation measures to identify patterns associated with elevated defect rate. We design a simulation model to generate synthetic datasets to test our algorithm. We compare the effectiveness of different correlation measures, target pattern complexities, and sample sizes with and without knowledge of the underlying process. We show that knowledge of the underlying process helps in identifying the pattern that is associated with defects. We also develop a decision support tool based on p-value simulation to help managers identify sources of error in real-life settings. Finally, we apply our method to real world data and extract useful information from the data to help plant managers make decisions related to investments and workforce planning. This thesis also explores the problem of predicting the defect probability given an ordered list of events and its defect status. We develop a supervised learning model using the frequency of patterns deduced from the event log as the feature set. We discuss the challenges faced in this approach and conclude that random forest algorithm performs better than other methods. We apply this approach to a real world case study and discuss the applications in the machine shop. Finally, the thesis explores the order-bidding process in the machine shop industry, and proposes an optimization-based model to maximize the profit of the machine shop. Through a case study example, we show the advantages of using the defect probability in the proposed optimization model to determine the machine-worker schedule to execute job orders in a machine shop.
10

Réseaux de neurones récurrents pour la classification de séquences dans des flux audiovisuels parallèles / Recurrent neural networks for sequence classification in parallel TV streams

Bouaziz, Mohamed 06 December 2017 (has links)
Les flux de contenus audiovisuels peuvent être représentés sous forme de séquences d’événements (par exemple, des suites d’émissions, de scènes, etc.). Ces données séquentielles se caractérisent par des relations chronologiques pouvant exister entre les événements successifs. Dans le contexte d’une chaîne TV, la programmation des émissions suit une cohérence définie par cette même chaîne, mais peut également être influencée par les programmations des chaînes concurrentes. Dans de telles conditions,les séquences d’événements des flux parallèles pourraient ainsi fournir des connaissances supplémentaires sur les événements d’un flux considéré.La modélisation de séquences est un sujet classique qui a été largement étudié, notamment dans le domaine de l’apprentissage automatique. Les réseaux de neurones récurrents de type Long Short-Term Memory (LSTM) ont notamment fait leur preuve dans de nombreuses applications incluant le traitement de ce type de données. Néanmoins,ces approches sont conçues pour traiter uniquement une seule séquence d’entrée à la fois. Notre contribution dans le cadre de cette thèse consiste à élaborer des approches capables d’intégrer conjointement des données séquentielles provenant de plusieurs flux parallèles.Le contexte applicatif de ce travail de thèse, réalisé en collaboration avec le Laboratoire Informatique d’Avignon et l’entreprise EDD, consiste en une tâche de prédiction du genre d’une émission télévisée. Cette prédiction peut s’appuyer sur les historiques de genres des émissions précédentes de la même chaîne mais également sur les historiques appartenant à des chaînes parallèles. Nous proposons une taxonomie de genres adaptée à de tels traitements automatiques ainsi qu’un corpus de données contenant les historiques parallèles pour 4 chaînes françaises.Deux méthodes originales sont proposées dans ce manuscrit, permettant d’intégrer les séquences des flux parallèles. La première, à savoir, l’architecture des LSTM parallèles(PLSTM) consiste en une extension du modèle LSTM. Les PLSTM traitent simultanément chaque séquence dans une couche récurrente indépendante et somment les sorties de chacune de ces couches pour produire la sortie finale. Pour ce qui est de la seconde proposition, dénommée MSE-SVM, elle permet de tirer profit des avantages des méthodes LSTM et SVM. D’abord, des vecteurs de caractéristiques latentes sont générés indépendamment, pour chaque flux en entrée, en prenant en sortie l’événement à prédire dans le flux principal. Ces nouvelles représentations sont ensuite fusionnées et données en entrée à un algorithme SVM. Les approches PLSTM et MSE-SVM ont prouvé leur efficacité dans l’intégration des séquences parallèles en surpassant respectivement les modèles LSTM et SVM prenant uniquement en compte les séquences du flux principal. Les deux approches proposées parviennent bien à tirer profit des informations contenues dans les longues séquences. En revanche, elles ont des difficultés à traiter des séquences courtes.L’approche MSE-SVM atteint globalement de meilleures performances que celles obtenues par l’approche PLSTM. Cependant, le problème rencontré avec les séquences courtes est plus prononcé pour le cas de l’approche MSE-SVM. Nous proposons enfin d’étendre cette approche en permettant d’intégrer des informations supplémentaires sur les événements des séquences en entrée (par exemple, le jour de la semaine des émissions de l’historique). Cette extension, dénommée AMSE-SVM améliore remarquablement la performance pour les séquences courtes sans les baisser lorsque des séquences longues sont présentées. / In the same way as TV channels, data streams are represented as a sequence of successive events that can exhibit chronological relations (e.g. a series of programs, scenes, etc.). For a targeted channel, broadcast programming follows the rules defined by the channel itself, but can also be affected by the programming of competing ones. In such conditions, event sequences of parallel streams could provide additional knowledge about the events of a particular stream. In the sphere of machine learning, various methods that are suited for processing sequential data have been proposed. Long Short-Term Memory (LSTM) Recurrent Neural Networks have proven its worth in many applications dealing with this type of data. Nevertheless, these approaches are designed to handle only a single input sequence at a time. The main contribution of this thesis is about developing approaches that jointly process sequential data derived from multiple parallel streams. The application task of our work, carried out in collaboration with the computer science laboratory of Avignon (LIA) and the EDD company, seeks to predict the genre of a telecast. This prediction can be based on the histories of previous telecast genres in the same channel but also on those belonging to other parallel channels. We propose a telecast genre taxonomy adapted to such automatic processes as well as a dataset containing the parallel history sequences of 4 French TV channels. Two original methods are proposed in this work in order to take into account parallel stream sequences. The first one, namely the Parallel LSTM (PLSTM) architecture, is an extension of the LSTM model. PLSTM simultaneously processes each sequence in a separate recurrent layer and sums the outputs of each of these layers to produce the final output. The second approach, called MSE-SVM, takes advantage of both LSTM and Support Vector Machines (SVM) methods. Firstly, latent feature vectors are independently generated for each input stream, using the output event of the main one. These new representations are then merged and fed to an SVM algorithm. The PLSTM and MSE-SVM approaches proved their ability to integrate parallel sequences by outperforming, respectively, the LSTM and SVM models that only take into account the sequences of the main stream. The two proposed approaches take profit of the information contained in long sequences. However, they have difficulties to deal with short ones. Though MSE-SVM generally outperforms the PLSTM approach, the problem experienced with short sequences is more pronounced for MSE-SVM. Finally, we propose to extend this approach by feeding additional information related to each event in the input sequences (e.g. the weekday of a telecast). This extension, named AMSE-SVM, has a remarkably better behavior with short sequences without affecting the performance when processing long ones.

Page generated in 0.1201 seconds