Return to search

Efficient algorithms for the identification of miRNA motifs in DNA sequences

Unravelling biological processes is dependent on the adequate modelling of regulatory mechanisms that determine the timing and spatial patterns of gene expression. In the last decade, a novel regulatory mechanism has been discovered and its biological importance has been increasingly recognised. This mechanism is mediated by RNA molecules named miRNAs that are the product of the maturation of non-coding gene transcripts and act post- transcriptionally usually to dampen or abolish the expression of protein-coding genes. Despite having eluded detection for such a long time, it is now clear that the elucidation of the expression pattern of many genes cannot be achieved without incorporating the effects of miRNA-mediated regulation. The technical difficulties that the experimental detection of these regulators entailed prompted the development of increasingly sophisticated computational approaches. Gene finding strategies originally developed for coding genes cannot be applied since these non- coding molecules are subject to very different sequence restraints and are too short to exhibit statistical properties that can be easily distinguished from the background. As a result, com- putational tools came to rely heavily on the identification of conserved sequences, distant homologs and machine learning techniques. Recent developments in sequencing technology have overcome some of the limitations of earlier experimental approaches, but pose new computational challenges. At present, the identification of new miRNA genes is therefore the result of the use of several approaches, both computational and experimental. In spite of the advancement that this research field has known in the last several years, we are still not able to formally and rigourously characterise miRNA genes in order to identify whichever sequence, structure or contextual requirements are needed to turn a DNA sequence into a functional miRNA. Efforts using computational algorithms towards the enumeration of the full set of miRNAs of an organism have been limited by strong reliance on arguments of precursor conservation and feature similarity. However, miRNA precursors may arise anew or be lost across the evolutionary history of a species and a newly-sequenced genome may be evolutionarily too distant from other genomes for an adequate comparative analysis. In addition, the learning of intricate classification rules based purely on features shared by miRNA precursors that are currently known may reflect a perpetuating identification bias rather than a sound means to tell true miRNAs from other genomic stem-loops. In this thesis, we present a strategy to sieve through the vast amount of stem-loops found in metazoan genomes in search of pre-miRNAs, significantly reducing the set of candidates while retaining most known miRNA precursors. Our approach relies on precursor properties derived from the current knowledge of miRNA biogenesis, analysis of the precursor structure and incorporation of information about the transcription potential of each candidate. i Our approach has been applied to the genomes of Drosophila melanogaster and Anophe- les gambiae, which has allowed us to show that there is a strong bias amongst annotated pre-miRNAs towards robust stem-loops in these genomes and to propose a scoring scheme for precursor candidates which combines four robustness measures. Additionally, we have identified several known pre-miRNA homologs in the newly-sequenced Anopheles darlingi and shown that most are found amongst the top-scoring precursor candidates for that or- ganism, with respect to the combined score. The structural analysis of our candidates and the identification of the region of the structural space where known precursors are usually found allowed us to eliminate several candidates, but also showed that there is a staggering number of genomic stem-loops which seem to fulfil the stability, robustness and structural requirements indicating that additional evidence is needed to identify functional precursors. To this effect, we have introduced different strategies to evaluate the transcription potential of the remaining candidates which vary according to the information which is available for the dataset under study.

Identiferoai:union.ndltd.org:CCSD/oai:tel.archives-ouvertes.fr:tel-00750693
Date06 June 2011
CreatorsMendes, Nuno D
Source SetsCCSD theses-EN-ligne, France
LanguageEnglish
Detected LanguageEnglish
TypePhD thesis

Page generated in 0.0022 seconds