Return to search

Computational Characterization of Long Non-Coding RNAs

In a cell, the DNA undergoes transcription to form mature transcripts, some of which
in turn undergo translation to form proteins. Although over 85% of the human genome is transcribed, it comprises only about 2% protein-coding genes, the rest being noncoding. One of the non-coding gene elements, called long non-coding RNAs (lncRNAs), are emerging as key players in various regulatory roles in the human genome. The generally accepted theory posits lncRNAs to be over 200 nucleotides long and to be able to grow over 10 kilobases, bearing a similarity with mRNAs. The majority of lncRNAs undergo alternative splicing and are weakly polyadenylated in combination with complex secondary structures. Among the annotated lncRNAs, so far it has been only a meagre portion for which functional roles have been detected, while functions of the vast majority remain to be discovered. Observed functional roles include thus far gene expression regulation through various mechanisms at transcriptional and post-transcriptional levels. With the advent of next-generation sequencing (NGS) and advances in RNA sequencing technology (RNA-Seq), it is easier to reconstruct the transcriptome by extracting information about the splicing machinery. RNA-Seq has helped consortia like GENCODE, ENCODE, and others to curate their annotation catalogues. In this PhD thesis, certain aspects of the human lncRNA transcriptome will be explored, such as the challenges in lncRNA annotation. Those challenges stem from the lack of signals that are common in mRNAs and make them easier to detect, for instance signals of ORFs and transcription start sites. Concurrently, owing to a lack of understanding of the connection between sequence and function, lncRNAs have been typically annotated based upon their location in relation to mRNAs and their functions have been predicted through a guilt-by-association approach. In the first part of the PhD research work, the splice junctions in the lncRNA transcriptome were mapped in an attempt to explore the isoform diversity of lncRNAs by using sequencing data from B-cell lymphoma. In this phase of the research work, multiple junction-spanning reads from the sequencing data with a very large read depth were found to represent the splice junctions. Using GENCODE v19 as a reference it was found that the human transcriptome harbours a large number of rare exons and introns that have remained unannotated. Concomitantly, it can be inferred that the current human transcriptome annotation is confined to a very well-defined set of splice variants. However, although the isoforms are well-defined, the same cannot be said about their biological functions and it remains to be explored why the processing machinery of lncRNAs is restricted to a set of very few splice sites.

In the human genome, small regulatory RNAs like miRNAs and small nucleolar
RNAs (snoRNAs) overlap with lncRNAs in their genomic loci. To further understand the human transcriptome, in the second part of the PhD research work, a study was undertaken in an attempt to distinguish the miRNA and snoRNA hosting lncRNAs from the lncRNAs that did not have any overlaps with the smaller RNAs. To this end, machine learning techniques were implemented on curated datasets employing features inspired by a few of the prevalent features used in published lncRNA detection tools encompassing not just sequence information, but also secondary structure and conservation information. Classification was attempted through supervised as well as unsupervised learning approaches; random forests for the former, PCA and k-means for the latter. In the end, the three RNA classes could not be separated with certitude, especially when the hosted RNA was not supplied to the classifier, however, this lack of detectable association can be confirmed to be of biological interest. It suggests that the function of host genes is not closely tied to the function of the hosted genes at least in this case. Nevertheless, understanding the dynamics of snoRNA and miRNA host genes can improve the knowledge of functional evolution of lncRNAs, as the fact that the smaller RNA genes are conserved makes it comparably easier to trace the host lncRNAs over much larger evolutionary timescales than most other lncRNAs. With the accelerated availability of sequencing techniques it can be expected that expanded investigation into conservation patterns and host gene functions will be possible in the near future.

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:75200
Date23 June 2021
CreatorsSen, Rituparno
ContributorsUniversität Leipzig
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/publishedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0017 seconds