Global ETD Search

1	Dynamic protein classification: Adaptive models based on incremental learning strategies Mohamed, Shakir 18 March 2008 (has links) Abstract One of the major problems in computational biology is the inability of existing classification models to incorporate expanding and new domain knowledge. This problem of static classification models is addressed in this thesis by the introduction of incremental learning for problems in bioinformatics. The tools which have been developed are applied to the problem of classifying proteins into a number of primary and putative families. The importance of this type of classification is of particular relevance due to its role in drug discovery programs and the benefit it lends to this process in terms of cost and time saving. As a secondary problem, multi–class classification is also addressed. The standard approach to protein family classification is based on the creation of committees of binary classifiers. This one-vs-all approach is not ideal, and the classification systems presented here consists of classifiers that are able to do all-vs-all classification. Two incremental learning techniques are presented. The first is a novel algorithm based on the fuzzy ARTMAP classifier and an evolutionary strategy. The second technique applies the incremental learning algorithm Learn++. The two systems are tested using three datasets: data from the Structural Classification of Proteins (SCOP) database, G-Protein Coupled Receptors (GPCR) database and Enzymes from the Protein Data Bank. The results show that both techniques are comparable with each other, giving classification abilities which are comparable to that of the single batch trained classifiers, with the added ability of incremental learning. Both the techniques are shown to be useful to the problem of protein family classification, but these techniques are applicable to problems outside this area, with applications in proteomics including the predictions of functions, secondary and tertiary structures, and applications in genomics such as promoter and splice site predictions and classification of gene microarrays. bioinformatics protein classification neural networks fuzzy ARTMAP incremental learning
2	A Clustering Method For The Problem Of Protein Subcellular Localization Bezek, Perit 01 December 2006 (has links) (PDF) In this study, the focus is on predicting the subcellular localization of a protein, since subcellular localization is helpful in understanding a protein&rsquo / s functions. Function of a protein may be estimated from its sequence. Motifs or conserved subsequences are strong indicators of function. In a given sample set of protein sequences known to perform the same function, a certain subsequence or group of subsequences should be common / that is, occurrence (frequency) of common subsequences should be high. Our idea is to find the common subsequences through clustering and use these common groups (implicit motifs) to classify proteins. To calculate the distance between two subsequences, traditional string edit distance is modified so that only replacement is allowed and the cost of replacement is related to an amino acid substitution matrix. Based on the modified string edit distance, spectral clustering embeds the subsequences into some transformed space for which the clustering problem is expected to become easier to solve. For a given protein sequence, distribution of its subsequences over the clusters is the feature vector which is subsequently fed to a classifier. The most important aspect if this approach is the use of spectral clustering based on modified string edit distance. QA Computer Software 76.75-76.765
3	A Classification System For The Problem Of Protein Subcellular Localization Alay, Gokcen 01 September 2007 (has links) (PDF) The focus of this study is on predicting the subcellular localization of a protein. Subcellular localization information is important for protein function annotation which is a fundamental problem in computational biology. For this problem, a classification system is built that has two main parts: a predictor that is based on a feature mapping technique to extract biologically meaningful information from protein sequences and a client/server architecture for searching and predicting subcellular localizations. In the first part of the thesis, we describe a feature mapping technique based on frequent patterns. In the feature mapping technique we describe, frequent patterns in a protein sequence dataset were identified using a search technique based on a priori property and the distribution of these patterns over a new sample is used as a feature vector for classification. The effect of a number of feature selection methods on the classification performance is investigated and the best one is applied. The method is assessed on the subcellular localization prediction problem with 4 compartments (Endoplasmic reticulum (ER) targeted, cytosolic, mitochondrial, and nuclear) and the dataset is the same used in P2SL. Our method improved the overall accuracy to 91.71% which was originally 81.96% by P2SL. In the second part of the thesis, a client/server architecture is designed and implemented based on Simple Object Access Protocol (SOAP) technology which provides a user-friendly interface for accessing the protein subcellular localization predictions. Client part is in fact a Cytoscape plug-in that is used for functional enrichment of biological networks. Instead of the individual use of subcellular localization information, this plug-in lets biologists to analyze a set of genes/proteins under system view. QA Computer Software 76.75-76.765
4	Subsequence Feature Maps For Protein Function Annotation Sarac, Omer Sinan 01 August 2008 (has links) (PDF) With the advances in sequencing technologies, the number of protein sequences with unknown function increases rapidly. Hence, computational methods for functional annotation of these protein sequences become of the upmost importance. In this thesis, we first defined a feature space mapping of protein primary sequences to fixed dimensional numerical vectors. This mapping, which is called the Subsequence Profile Map (SPMap), takes into account the models of the subsequences of protein sequences. The resulting vectors were used as an input to support vector machines (SVM) for functional classification of proteins. Second, we defined the protein functional annotation problem as a classification problem and construct a classification framework defined on Gene Ontology (GO) terms. Dierent classification methods as well as their combinations are assessed on this framework which is based on 300 GO molecular function terms. The reiv sults showed that combination enhances the classification accuracy. The resultant system is made publicly available as an online function annotation tool. QA Computer Software 76.75-76.765
5	A workflow for the modeling and analysis of biomedical data Marsolo, Keith Allen 22 June 2007 (has links) No description available. Computer Science Biomedical Data Modeling Spatial Modeling Biomedical Knowledge Discovery Classification of Structure-based Data. Bioinformatics Protein Modeling Protein Classification
6	Cadeias estocásticas parcimoniosas com aplicações à classificação e filogenia das seqüências de proteínas. / Parsimonious stochastic chains with applications to classification and phylogeny of protein sequences. Leonardi, Florencia Graciela 19 January 2007 (has links) Nesta tese apresentamos alguns resultados teóricos e práticos da modelagem de seqüências simbólicas com cadeias estocásticas parcimoniosas. As cadeias estocásticas parcimoniosas, que incluem as cadeias estocásticas de memória variável, constituem uma generalização das cadeias de Markov de alcance fixo. As seqüências simbólicas às quais foram aplicadas as ferramentas desenvolvidas são as cadeias de aminoácidos. Primeiramente, introduzimos um novo algoritmo, chamado de SPST, para selecionar o modelo de cadeia estocástica parcimoniosa mais ajustado a uma amostra de seqüências. Em seguida, utilizamos esse algoritmo para estudar dois importantes problemas da genômica; a saber, a classificação de proteínas em famílias e o estudo da evolução das seqüências biológicas. Finalmente, estudamos a velocidade de convergência de algoritmos relacionados com a estimação de uma subclasse das cadeias estocásticas parcimoniosas, as cadeias estocásticas de memória variável. Assim, generalizamos um resultado prévio de velocidade exponencial de convergência para o algoritmo PST, no caso de cadeias de memória ilimitada. Além disso, obtemos um resultado de velocidade de convergência para uma versão generalizada do Critério da Informação Bayesiana (BIC), também conhecido como Critério de Schwarz. / In this thesis we present some theoretical and practical results, concerning symbolic sequence modeling with parsimonious stochastic chains. Parsimonious stochastic chains, which include variable memory stochastic chains, constitute a generalization of fixed order Markov chains. The symbolic sequences modeled with parsimonious stochastic chains were the sequences of amino acids. First, we introduce a new algorithm, called SPST, to select the model of parsimonious stochastic chain that fits better to a sample of sequences. Then, we use the SPST algorithm to study two important problems of genomics. These problems are the classification of proteins into families and the study of the evolution of biological sequences. Finally, we find upper bounds for the rate of convergence of some algorithms related with the estimation of a subclass of parsimonious stochastic chains; namely, the variable memory stochastic chains. In consequence, we generalize a previous result about the exponential rate of convergence of the PST algorithm, in the case of unbounded variable memory stochastic chains. On the other hand, we prove a result about the rate of convergence of a generalized version of the Bayesian Information Criterion (BIC), also known as Schwarz\' Criterion. análise filogenética de proteínas cadeias estocásticas parcimoniosas classificação de proteínas parsimonious stochastic chains phylogenetic analysis of proteins protein classification rate of convergence of algorithms
7	Motif extraction from complex data : case of protein classification / Extraction de motifs des données complexes : cas de la classification des protéines Saidi, Rabie 03 October 2012 (has links) La classification est l’un des défis important en bioinformatique, aussi bien pour les données protéiques que nucléiques. La présence de ces données en grandes masses, leur ambiguïté et en particulier les coûts élevés de l’analyse in vitro en termes de temps et d’argent, rend l’utilisation de la fouille de données plutôt une nécessité qu’un choix rationnel. Cependant, les techniques fouille de données, qui traitent souvent des données sous le format relationnel, sont confrontés avec le format inapproprié des données biologiques. Par conséquent, une étape inévitable de prétraitement doit être établie. Cette thèse traite du prétraitement de données protéiques comme une étape de préparation avant leur classification. Nous présentons l’extraction de motifs comme un moyen fiable pour répondre à cette tâche. Les motifs extraits sont utilisés comme descripteurs, en vue de coder les protéines en vecteurs d’attributs. Cela permet l’utilisation des classifieurs connus. Cependant, la conception d’un espace appropié d’attributs, n’est pas une tâche triviale. Nous traitons deux types de données protéiques à savoir les séquences et les structures 3D. Dans le premier axe, i:e:; celui des séquences, nous proposons un nouveau procédé de codage qui utilise les matrices de substitution d’acides aminés pour définir la similarité entre les motifs lors de l’étape d’extraction. En utilisant certains classifieurs, nous montrons l’efficacité de notre approche en la comparant avec plusieurs autres méthodes de codage. Nous proposons également de nouvelles métriques pour étudier la robustesse de certaines de ces méthodes lors de la perturbation des données d’entrée. Ces métriques permettent de mesurer la capacité d’une méthode de révéler tout changement survenant dans les données d’entrée et également sa capacité à cibler les motifs intéressants. Le second axe est consacré aux structures protéiques 3D, qui ont été récemment considérées comme graphes d’acides aminés selon différentes représentations. Nous faisons un bref survol sur les représentations les plus utilisées et nous proposons une méthode naïve pour aider à la construction de graphes d’acides aminés. Nous montrons que certaines méthodes répandues présentent des faiblesses remarquables et ne reflètent pas vraiment la conformation réelle des protéines. Par ailleurs, nous nous intéressons à la découverte, des sous-structures récurrentes qui pourraient donner des indications fonctionnelles et structurelles. Nous proposons un nouvel algorithme pour trouver des motifs spatiaux dans les protéines. Ces motifs obéissent à un format défini sur la base d’une argumentation biologique. Nous comparons avec des motifs séquentiels et spatiaux de certains travaux reliés. Pour toutes nos contributions, les résultats expérimentaux confirment l’efficacité de nos méthodes pour représenter les séquences et les structures protéiques, dans des tâches de classification. Les programmes développés sont disponibles sur ma page web http://fc.isima.fr/~saidi. / The classification of biological data is one of the significant challenges inbioinformatics, as well for protein as for nucleic data. The presence of these data in hugemasses, their ambiguity and especially the high costs of the in vitro analysis in terms oftime and resources, make the use of data mining rather a necessity than a rational choice.However, the data mining techniques, which often process data under the relational format,are confronted with the inappropriate format of the biological data. Hence, an inevitablestep of pre-processing must be established.This thesis deals with the protein data preprocessing as a preparation step before theirclassification. We present motif extraction as a reliable way to address that task. The extractedmotifs are used as descriptors to encode proteins into feature vectors. This enablesthe use of known data mining classifiers which require this format. However, designing asuitable feature space, for a set of proteins, is not a trivial task.We deal with two kinds of protein data i:e:, sequences and tri-dimensional structures. In thefirst axis i:e:, protein sequences, we propose a novel encoding method that uses amino-acidsubstitution matrices to define similarity between motifs during the extraction step. Wedemonstrate the efficiency of such approach by comparing it with several encoding methods,using some classifiers. We also propose new metrics to study the robustness of some ofthese methods when perturbing the input data. These metrics allow to measure the abilityof the method to reveal any change occurring in the input data and also its ability to targetthe interesting motifs. The second axis is dedicated to 3D protein structures which are recentlyseen as graphs of amino acids. We make a brief survey on the most used graph-basedrepresentations and we propose a naïve method to help with the protein graph making. Weshow that some existing and widespread methods present remarkable weaknesses and do notreally reflect the real protein conformation. Besides, we are interested in discovering recurrentsub-structures in proteins which can give important functional and structural insights.We propose a novel algorithm to find spatial motifs from proteins. The extracted motifsmatch a well-defined shape which is proposed based on a biological basis. We compare withsequential motifs and spatial motifs of recent related works. For all our contributions, theoutcomes of the experiments confirm the efficiency of our proposed methods to representboth protein sequences and protein 3D structures in classification tasks.Software programs developed during this research work are available on my home page http://fc.isima.fr/~saidi. Prétraitement Extraction de motif Classification des protéines Structure protéique Motif séquentiel Motif spatial Preprocessing Motif/feature extraction Protein classification Protein structures Sequential motif Spatial motif
8	Probabilistic Methods for Computational Annotation of Genomic Sequences / Probabilistische Methoden für computergestützte Genom-Annotation Keller, Oliver 26 January 2011 (has links) No description available. Genvorhersage Protein-Klassifikation Hidden-Markov-Modelle semi-Markov-Ketten Genomannotation gene prediction protein classification hidden Markov models semi-Markov chains genome annotation
9	Cadeias estocásticas parcimoniosas com aplicações à classificação e filogenia das seqüências de proteínas. / Parsimonious stochastic chains with applications to classification and phylogeny of protein sequences. Florencia Graciela Leonardi 19 January 2007 (has links) Nesta tese apresentamos alguns resultados teóricos e práticos da modelagem de seqüências simbólicas com cadeias estocásticas parcimoniosas. As cadeias estocásticas parcimoniosas, que incluem as cadeias estocásticas de memória variável, constituem uma generalização das cadeias de Markov de alcance fixo. As seqüências simbólicas às quais foram aplicadas as ferramentas desenvolvidas são as cadeias de aminoácidos. Primeiramente, introduzimos um novo algoritmo, chamado de SPST, para selecionar o modelo de cadeia estocástica parcimoniosa mais ajustado a uma amostra de seqüências. Em seguida, utilizamos esse algoritmo para estudar dois importantes problemas da genômica; a saber, a classificação de proteínas em famílias e o estudo da evolução das seqüências biológicas. Finalmente, estudamos a velocidade de convergência de algoritmos relacionados com a estimação de uma subclasse das cadeias estocásticas parcimoniosas, as cadeias estocásticas de memória variável. Assim, generalizamos um resultado prévio de velocidade exponencial de convergência para o algoritmo PST, no caso de cadeias de memória ilimitada. Além disso, obtemos um resultado de velocidade de convergência para uma versão generalizada do Critério da Informação Bayesiana (BIC), também conhecido como Critério de Schwarz. / In this thesis we present some theoretical and practical results, concerning symbolic sequence modeling with parsimonious stochastic chains. Parsimonious stochastic chains, which include variable memory stochastic chains, constitute a generalization of fixed order Markov chains. The symbolic sequences modeled with parsimonious stochastic chains were the sequences of amino acids. First, we introduce a new algorithm, called SPST, to select the model of parsimonious stochastic chain that fits better to a sample of sequences. Then, we use the SPST algorithm to study two important problems of genomics. These problems are the classification of proteins into families and the study of the evolution of biological sequences. Finally, we find upper bounds for the rate of convergence of some algorithms related with the estimation of a subclass of parsimonious stochastic chains; namely, the variable memory stochastic chains. In consequence, we generalize a previous result about the exponential rate of convergence of the PST algorithm, in the case of unbounded variable memory stochastic chains. On the other hand, we prove a result about the rate of convergence of a generalized version of the Bayesian Information Criterion (BIC), also known as Schwarz\' Criterion. análise filogenética de proteínas cadeias estocásticas parcimoniosas classificação de proteínas parsimonious stochastic chains phylogenetic analysis of proteins protein classification rate of convergence of algorithms
10	Topological data analysis: applications in machine learning / Análise topológica de dados: aplicações em aprendizado de máquina Calcina, Sabrina Graciela Suárez 05 December 2018 (has links) Recently computational topology had an important development in data analysis giving birth to the field of Topological Data Analysis. Persistent homology appears as a fundamental tool based on the topology of data that can be represented as points in metric space. In this work, we apply techniques of Topological Data Analysis, more precisely, we use persistent homology to calculate topological features more persistent in data. In this sense, the persistence diagrams are processed as feature vectors for applying Machine Learning algorithms. In order to classification, we used the following classifiers: Partial Least Squares-Discriminant Analysis, Support Vector Machine, and Naive Bayes. For regression, we used Support Vector Regression and KNeighbors. Finally, we will give a certain statistical approach to analyze the accuracy of each classifier and regressor. / Recentemente a topologia computacional teve um importante desenvolvimento na análise de dados dando origem ao campo da Análise Topológica de Dados. A homologia persistente aparece como uma ferramenta fundamental baseada na topologia de dados que possam ser representados como pontos num espaço métrico. Neste trabalho, aplicamos técnicas da Análise Topológica de Dados, mais precisamente, usamos homologia persistente para calcular características topológicas mais persistentes em dados. Nesse sentido, os diagramas de persistencia são processados como vetores de características para posteriormente aplicar algoritmos de Aprendizado de Máquina. Para classificação, foram utilizados os seguintes classificadores: Análise de Discriminantes de Minimos Quadrados Parciais, Máquina de Vetores de Suporte, e Naive Bayes. Para a regressão, usamos a Regressão de Vetores de Suporte e KNeighbors. Finalmente, daremos uma certa abordagem estatística para analisar a precisão de cada classificador e regressor. Betti numbers Classificação de proteínas Classificador Naive Bayes Classificador PLS-DA Classificador SVM Diagramas de persistencia Homologia persistente KNeighbors regressor Naive Bayes classifier Números de Betti Persistence diagrams Persistent homology PLS-DA classifier Protein classification Regressor KNeighbors Regressor SVR SVM classifier SVR regressor

Search results