Global ETD Search

1	ModuleInducer: Automating the Extraction of Knowledge from Biological Sequences Korol, Oksana 14 October 2011 (has links) In the past decade, fast advancements have been made in the sequencing, digitalization and collection of the biological data. However the bottleneck remains at the point of analysis and extraction of patterns from the data. We have developed a method that is aimed at widening this bottleneck by automating the knowledge extraction from the biological data. Our approach is aimed at discovering patterns in a set of DNA sequences based on the location of transcription factor binding sites or any other biological markers with the emphasis of discovering relationships. A variety of statistical and computational methods exists to analyze such data. However, they either require an initial hypothesis, which is later tested, or classify the data based on its attributes. Our approach does not require an initial hypothesis and the classification it produces is based on the relationships between attributes. The value of such approach is that is is able to uncover new knowledge about the data by inducing a general theory based on basic known rules. The core of our approach lies in an inductive logic programming engine, which, based on positive and negative examples as well as background knowledge, is able to induce a descriptive, human-readable theory, describing the data. An application provides an end-to-end analysis of DNA sequences. A simple to use Web interface accepts a set of related sequences to be analyzed, set of negative example sequences to contrast the main set (optional), and a set of possible genetic markers as position-specific scoring matrices. A Java-based backend formats the sequences, determines the location of the genetic markers inside them and passes the information to the ILP engine, which induces the theory. The model, assumed in our background knowledge, is a set of basic interactions between biological markers in any DNA sequence. This makes our approach applicable to analyze a wide variety of biological problems, including detection of cis-regulatory modules and analysis of ChIP-Sequencing experiments. We have evaluated our method in the context of such applications on two real world datasets as well as a number of specially designed synthetic datasets. The approach has shown to have merit even in situations when no significant classification could be determined. inductive logic programming cis-regulatory modules ChIP-Sequencing bioinformatics
2	ModuleInducer: Automating the Extraction of Knowledge from Biological Sequences Korol, Oksana 14 October 2011 (has links) In the past decade, fast advancements have been made in the sequencing, digitalization and collection of the biological data. However the bottleneck remains at the point of analysis and extraction of patterns from the data. We have developed a method that is aimed at widening this bottleneck by automating the knowledge extraction from the biological data. Our approach is aimed at discovering patterns in a set of DNA sequences based on the location of transcription factor binding sites or any other biological markers with the emphasis of discovering relationships. A variety of statistical and computational methods exists to analyze such data. However, they either require an initial hypothesis, which is later tested, or classify the data based on its attributes. Our approach does not require an initial hypothesis and the classification it produces is based on the relationships between attributes. The value of such approach is that is is able to uncover new knowledge about the data by inducing a general theory based on basic known rules. The core of our approach lies in an inductive logic programming engine, which, based on positive and negative examples as well as background knowledge, is able to induce a descriptive, human-readable theory, describing the data. An application provides an end-to-end analysis of DNA sequences. A simple to use Web interface accepts a set of related sequences to be analyzed, set of negative example sequences to contrast the main set (optional), and a set of possible genetic markers as position-specific scoring matrices. A Java-based backend formats the sequences, determines the location of the genetic markers inside them and passes the information to the ILP engine, which induces the theory. The model, assumed in our background knowledge, is a set of basic interactions between biological markers in any DNA sequence. This makes our approach applicable to analyze a wide variety of biological problems, including detection of cis-regulatory modules and analysis of ChIP-Sequencing experiments. We have evaluated our method in the context of such applications on two real world datasets as well as a number of specially designed synthetic datasets. The approach has shown to have merit even in situations when no significant classification could be determined. inductive logic programming cis-regulatory modules ChIP-Sequencing bioinformatics
3	ModuleInducer: Automating the Extraction of Knowledge from Biological Sequences Korol, Oksana 14 October 2011 (has links) In the past decade, fast advancements have been made in the sequencing, digitalization and collection of the biological data. However the bottleneck remains at the point of analysis and extraction of patterns from the data. We have developed a method that is aimed at widening this bottleneck by automating the knowledge extraction from the biological data. Our approach is aimed at discovering patterns in a set of DNA sequences based on the location of transcription factor binding sites or any other biological markers with the emphasis of discovering relationships. A variety of statistical and computational methods exists to analyze such data. However, they either require an initial hypothesis, which is later tested, or classify the data based on its attributes. Our approach does not require an initial hypothesis and the classification it produces is based on the relationships between attributes. The value of such approach is that is is able to uncover new knowledge about the data by inducing a general theory based on basic known rules. The core of our approach lies in an inductive logic programming engine, which, based on positive and negative examples as well as background knowledge, is able to induce a descriptive, human-readable theory, describing the data. An application provides an end-to-end analysis of DNA sequences. A simple to use Web interface accepts a set of related sequences to be analyzed, set of negative example sequences to contrast the main set (optional), and a set of possible genetic markers as position-specific scoring matrices. A Java-based backend formats the sequences, determines the location of the genetic markers inside them and passes the information to the ILP engine, which induces the theory. The model, assumed in our background knowledge, is a set of basic interactions between biological markers in any DNA sequence. This makes our approach applicable to analyze a wide variety of biological problems, including detection of cis-regulatory modules and analysis of ChIP-Sequencing experiments. We have evaluated our method in the context of such applications on two real world datasets as well as a number of specially designed synthetic datasets. The approach has shown to have merit even in situations when no significant classification could be determined. inductive logic programming cis-regulatory modules ChIP-Sequencing bioinformatics
4	ModuleInducer: Automating the Extraction of Knowledge from Biological Sequences Korol, Oksana January 2011 (has links) In the past decade, fast advancements have been made in the sequencing, digitalization and collection of the biological data. However the bottleneck remains at the point of analysis and extraction of patterns from the data. We have developed a method that is aimed at widening this bottleneck by automating the knowledge extraction from the biological data. Our approach is aimed at discovering patterns in a set of DNA sequences based on the location of transcription factor binding sites or any other biological markers with the emphasis of discovering relationships. A variety of statistical and computational methods exists to analyze such data. However, they either require an initial hypothesis, which is later tested, or classify the data based on its attributes. Our approach does not require an initial hypothesis and the classification it produces is based on the relationships between attributes. The value of such approach is that is is able to uncover new knowledge about the data by inducing a general theory based on basic known rules. The core of our approach lies in an inductive logic programming engine, which, based on positive and negative examples as well as background knowledge, is able to induce a descriptive, human-readable theory, describing the data. An application provides an end-to-end analysis of DNA sequences. A simple to use Web interface accepts a set of related sequences to be analyzed, set of negative example sequences to contrast the main set (optional), and a set of possible genetic markers as position-specific scoring matrices. A Java-based backend formats the sequences, determines the location of the genetic markers inside them and passes the information to the ILP engine, which induces the theory. The model, assumed in our background knowledge, is a set of basic interactions between biological markers in any DNA sequence. This makes our approach applicable to analyze a wide variety of biological problems, including detection of cis-regulatory modules and analysis of ChIP-Sequencing experiments. We have evaluated our method in the context of such applications on two real world datasets as well as a number of specially designed synthetic datasets. The approach has shown to have merit even in situations when no significant classification could be determined. inductive logic programming cis-regulatory modules ChIP-Sequencing bioinformatics
5	Developing the Cis-Regulatory Association Model (CRAM) to Identify Combinations of Transcription Factors in ChIP-Seq Data Kennedy, Brian Alexander 17 December 2010 (has links) No description available. Bioinformatics Computer Science Genetics transcription factor cis-regulation expression genes cis-regulatory modules regulatory modules position weight matrix chip-seq chromatin immunoprecipitation apriori neural networks item-set mining
6	Développement et évaluation de méthodes bioinformatiques pour la détection de séquences cis-régulatrices impliquées dans le développement de la drosophile Turatsinze, Jean Valery 23 November 2009 (has links) L'objectif de ce travail est de développer et d'évaluer des approches méthodologiques pour la prédiction de séquences cis-régulatrices. Ces approches ont été intégrées dans la suite logicielle RSAT (Regulatory Sequences Analysis Tools). Ces séquences jouent un rôle important dans la régulation de l'expression des gènes. Cette régulation, au niveau transcriptionnel, s'effectue à travers la reconnaissance spécifique entre les facteurs de transcription et leurs sites de fixation (TFBS) au niveau de l'ADN. Nous avons développé et évalué une série d'outils bioinformatiques qui utilisent les matrices position-poids pour prédire les TFBS ainsi que les modules cis-régulateurs (CRM). Nos outils présentent l'avantage d'intégrer les différentes approches déjà proposées par d'autres auteurs tout en proposant des fonctionnalités innovantes. Nous proposons notamment une nouvelle approche pour la prédiction de CRM basé sur la détection de régions significativement enrichies en TFBS. Nous les avons appelés les CRER (pour Cis-Regulatory Elements Enriched Regions). Un autre aspect essentiel de toute notre approche réside dans le fait que nous proposons des mesures statistiques rigoureuses pour estimer théoriquement et empiriquement le risque associé aux différentes prédictions. Les méthodes de prédictions de séquences cis-regulatrices prédisent en effet un taux de fausses prédictions généralement élevé. Nous intégrons un calcul des P-valeurs associées à toutes les prédictions. Nous proposons ainsi une mesure fiable de la probabilité de faux positifs. Nous avons appliqué nos outils pour une évaluation systématique de l'effet du modèle de background sur la précision des prédictions à partir de la base de données de TRANSFAC. Nos résultats suggèrent une grande variabilité pour les modèles qui optimisent la précision des prédictions. Il faut choisir le modèle de background au cas par cas selon la matrice considérée. Nous avons ensuite évalué la qualité des matrices de tous les facteurs de transcription de drosophile de la base de données ORegAnno, c'est à dire leur pouvoir de discrimination entre les TFBS et les séquences génomiques. Nous avons ainsi collecté des matrices des facteurs de transcription de drosophile de bonne qualité. A partir des matrices de drosophile que nous avons collectées, nous avons entamé une analyse préliminaire multi-genome de prédictions de TFBS et de CRM dans la région de lʼenhancer dorsocentral (DCE) du complexe achaete-scute de drosophile. Les gènes de ce complexe jouent un rôle important dans la détermination des cellules système nerveux périphérique de drosophile. Il a été prouvé expérimentalement qu'il existe un lien direct entre le phénotype du système nerveux périphérique et les séquences cis-régulateurs des gènes de ce complexe. Les outils que nous avons développés durant ce projet peuvent s'appliquer à la prédiction des séquences de régulation dans les génomes de tous les organismes. position specific scoring matrix cis-regulatory modules matrix-scan pattern matching
7	Computational Methods for Cis-Regulatory Module Discovery Liang, Xiaoyu January 2010 (has links) No description available. Computer Science gene regulation network transcription factor non-coding sequence gener expression genomic sequences cis-regulatory modules
8	Predição computacional de sítios de ligação de fatores de transcrição baseada em gramáticas regulares estocásticas / Computational prediction of transcription factor binding sites based on stochastic regular grammars Ferrão Neto, Antonio 27 October 2017 (has links) Fatores de transcrição (FT) são proteínas que se ligam em sequências específicas e bem conservadas de nucleotídeos no DNA, denominadas sítios de ligação dos fatores de transcrição (SLFT), localizadas em regiões de regulação gênica conhecidas como módulos cis-reguladores (CRM). Ao reconhecer o SLFT, o fator de transcrição se liga naquele sítio e influencia a transcrição gênica positiva ou negativamente. Existem técnicas experimentais para a identificação dos locais dos SLFTs em um genoma, como footprinting, ChIP-chip ou ChIP-seq. Entretanto, a execução de tais técnicas implica em custos e tempo elevados. Alternativamente, pode-se utilizar as sequências de SLFTs já conhecidas para um determinado fator de transcrição e aplicar técnicas de aprendizado computacional supervisionado para criar um modelo computacional para tal sítio e então realizar a predição computacional no genoma. Entretanto, a maioria das ferramentas computacionais existentes para esse fim considera independência entre as posições entre os nucleotídeos de um sítio - como as baseadas em PWMs (position weight matrix) - o que não é necessariamente verdade. Este projeto teve como objetivo avaliar a utilização de gramáticas regulares estocásticas (GRE) como técnica alternativa às PWMs neste problema, uma vez que GREs são capazes de caracterizar dependências entre posições consecutivas dos sítios. Embora as diferenças de desempenho tenham sido sutis, GREs parecem mesmo ser mais adequadas do que PWMs na presença de valores mais altos de dependência de bases, e PWMs nos demais casos. Por fim, uma ferramenta de predição computacional de SLFTs foi criada baseada tanto em GREs quanto em PWMs. / Transcription factors (FT) are proteins that bind to specific and well-conserved sequences of nucleotides in the DNA, called transcription factor binding sites (TFBS), contained in regions of gene regulation known as cis-regulatory modules (CRM). By recognizing TFBA, the transcription factor binds to that site and positively or negatively influence the gene transcription. There are experimental procedures for the identification of TFBS in a genome such as footprinting, ChIP-chip or ChIP-Seq. However, the implementation of these techniques involves high costs and time. Alternatively, one may utilize the TFBS sequences already known for a particular transcription factor and applying computational supervised learning techniques to create a computational model for that site and then perform the computational prediction in the genome. However, most existing software tools for this purpose considers independence between nucleotide positions in the site - such as those based on PWMs (position weight matrix) - which is not necessarily true. This project aimed to evaluate the use of stochastic regular grammars (SRG) as an alternative technique to PWMs in this problem, since SRGs are able to characterize dependencies between consecutive positions in the sites. Although differences in performance have been subtle, SRGs appear to be more suitable than PWMs in the presence of higher base dependency values, and PWMs in other cases. Finally, a computational TFBS prediction tool was created based on both SRGs and PWMs. cis-regulatory modules CRM CRM Enhancer Enhancer Fator de transcrição Gramáticas regulares Módulos cis-regulatórios Motifs Motivos PWM PWM Regular grammars Transcription factor Transcription factor binding sites
9	Développement et évaluation de méthodes bioinformatiques pour la détection de séquences cis-régulatrices impliquées dans le développement de la drosophile Turatsinze, Jean Valéry 23 November 2009 (has links) L'objectif de ce travail est de développer et d'évaluer des approches méthodologiques pour la<p>prédiction de séquences cis-régulatrices. Ces approches ont été intégrées dans la suite logicielle<p>RSAT (Regulatory Sequences Analysis Tools). Ces séquences jouent un rôle important dans la<p>régulation de l'expression des gènes. Cette régulation, au niveau transcriptionnel, s'effectue à<p>travers la reconnaissance spécifique entre les facteurs de transcription et leurs sites de fixation<p>(TFBS) au niveau de l'ADN.<p>Nous avons développé et évalué une série d'outils bioinformatiques qui utilisent les matrices<p>position-poids pour prédire les TFBS ainsi que les modules cis-régulateurs (CRM). Nos outils<p>présentent l'avantage d'intégrer les différentes approches déjà proposées par d'autres auteurs tout<p>en proposant des fonctionnalités innovantes.<p>Nous proposons notamment une nouvelle approche pour la prédiction de CRM basé sur la<p>détection de régions significativement enrichies en TFBS. Nous les avons appelés les CRER (pour<p>Cis-Regulatory Elements Enriched Regions). Un autre aspect essentiel de toute notre approche<p>réside dans le fait que nous proposons des mesures statistiques rigoureuses pour estimer<p>théoriquement et empiriquement le risque associé aux différentes prédictions. Les méthodes de<p>prédictions de séquences cis-regulatrices prédisent en effet un taux de fausses prédictions<p>généralement élevé. Nous intégrons un calcul des P-valeurs associées à toutes les prédictions.<p>Nous proposons ainsi une mesure fiable de la probabilité de faux positifs.<p>Nous avons appliqué nos outils pour une évaluation systématique de l'effet du modèle de<p>background sur la précision des prédictions à partir de la base de données de TRANSFAC. Nos<p>résultats suggèrent une grande variabilité pour les modèles qui optimisent la précision des<p>prédictions. Il faut choisir le modèle de background au cas par cas selon la matrice considérée.<p>Nous avons ensuite évalué la qualité des matrices de tous les facteurs de transcription de<p>drosophile de la base de données ORegAnno, c'est à dire leur pouvoir de discrimination entre les<p>TFBS et les séquences génomiques. Nous avons ainsi collecté des matrices des facteurs de<p>transcription de drosophile de bonne qualité.<p>A partir des matrices de drosophile que nous avons collectées, nous avons entamé une analyse<p>préliminaire multi-genome de prédictions de TFBS et de CRM dans la région de lʼenhancer dorsocentral<p>(DCE) du complexe achaete-scute de drosophile. Les gènes de ce complexe jouent un<p>rôle important dans la détermination des cellules système nerveux périphérique de drosophile. Il a<p>été prouvé expérimentalement qu'il existe un lien direct entre le phénotype du système nerveux<p>périphérique et les séquences cis-régulateurs des gènes de ce complexe.<p>Les outils que nous avons développés durant ce projet peuvent s'appliquer à la prédiction des<p>séquences de régulation dans les génomes de tous les organismes. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished Biologie Sciences exactes et naturelles Bioinformatics Drosophila Gene expression Genetic regulation Bio-informatique Drosophiles Expression génique Régulation génétique RSAT pattern matching matrix-scan cis-regulatory modules position specific scoring matrix regulatory sequences
10	Predição computacional de sítios de ligação de fatores de transcrição baseada em gramáticas regulares estocásticas / Computational prediction of transcription factor binding sites based on stochastic regular grammars Antonio Ferrão Neto 27 October 2017 (has links) Fatores de transcrição (FT) são proteínas que se ligam em sequências específicas e bem conservadas de nucleotídeos no DNA, denominadas sítios de ligação dos fatores de transcrição (SLFT), localizadas em regiões de regulação gênica conhecidas como módulos cis-reguladores (CRM). Ao reconhecer o SLFT, o fator de transcrição se liga naquele sítio e influencia a transcrição gênica positiva ou negativamente. Existem técnicas experimentais para a identificação dos locais dos SLFTs em um genoma, como footprinting, ChIP-chip ou ChIP-seq. Entretanto, a execução de tais técnicas implica em custos e tempo elevados. Alternativamente, pode-se utilizar as sequências de SLFTs já conhecidas para um determinado fator de transcrição e aplicar técnicas de aprendizado computacional supervisionado para criar um modelo computacional para tal sítio e então realizar a predição computacional no genoma. Entretanto, a maioria das ferramentas computacionais existentes para esse fim considera independência entre as posições entre os nucleotídeos de um sítio - como as baseadas em PWMs (position weight matrix) - o que não é necessariamente verdade. Este projeto teve como objetivo avaliar a utilização de gramáticas regulares estocásticas (GRE) como técnica alternativa às PWMs neste problema, uma vez que GREs são capazes de caracterizar dependências entre posições consecutivas dos sítios. Embora as diferenças de desempenho tenham sido sutis, GREs parecem mesmo ser mais adequadas do que PWMs na presença de valores mais altos de dependência de bases, e PWMs nos demais casos. Por fim, uma ferramenta de predição computacional de SLFTs foi criada baseada tanto em GREs quanto em PWMs. / Transcription factors (FT) are proteins that bind to specific and well-conserved sequences of nucleotides in the DNA, called transcription factor binding sites (TFBS), contained in regions of gene regulation known as cis-regulatory modules (CRM). By recognizing TFBA, the transcription factor binds to that site and positively or negatively influence the gene transcription. There are experimental procedures for the identification of TFBS in a genome such as footprinting, ChIP-chip or ChIP-Seq. However, the implementation of these techniques involves high costs and time. Alternatively, one may utilize the TFBS sequences already known for a particular transcription factor and applying computational supervised learning techniques to create a computational model for that site and then perform the computational prediction in the genome. However, most existing software tools for this purpose considers independence between nucleotide positions in the site - such as those based on PWMs (position weight matrix) - which is not necessarily true. This project aimed to evaluate the use of stochastic regular grammars (SRG) as an alternative technique to PWMs in this problem, since SRGs are able to characterize dependencies between consecutive positions in the sites. Although differences in performance have been subtle, SRGs appear to be more suitable than PWMs in the presence of higher base dependency values, and PWMs in other cases. Finally, a computational TFBS prediction tool was created based on both SRGs and PWMs. CRM Enhancer Fator de transcrição Gramáticas regulares Módulos cis-regulatórios Motivos PWM cis-regulatory modules CRM Enhancer Motifs PWM Regular grammars Transcription factor Transcription factor binding sites

Search results