Return to search

Detection of frameshifts and improving genome annotation

We developed a new program called GeneTack for ab initio frameshift detection in intronless protein-coding nucleotide sequences. The GeneTack program uses
a hidden Markov model (HMM) of a genomic sequence with possibly frameshifted
protein-coding regions. The Viterbi algorithm nds the maximum likelihood path
that discriminates between true adjacent genes and a single gene with a frameshift.
We tested GeneTack as well as two other earlier developed programs FrameD and
FSFind on 17 prokaryotic genomes with frameshifts introduced randomly into known
genes. We observed that the average frameshift prediction accuracy of GeneTack, in
terms of (Sn+Sp)/2 values, was higher by a signicant margin than the accuracy of
the other two programs.
GeneTack was used to screen 1,106 complete prokaryotic genomes and 206,991
genes with frameshifts (fs-genes) were identifed. Our goal was to determine if a
frameshift transition was due to (i) a sequencing error, (ii) an indel mutation or (iii)
a recoding event. We grouped 102,731 genes with frameshifts (fs-genes) into 19,430
clusters based on sequence similarity between their protein products (fs-proteins),
conservation of predicted frameshift position, and its direction. While fs-genes in
2,810 clusters were classied as conserved pseudogenes and fs-genes in 1,200 clusters
were classied as hypothetical pseudogenes, 5,632 fs-genes from 239 clusters pos-
sessing conserved motifs near frameshifts were predicted to be recoding candidates.
Experiments were performed for sequences derived from 20 out of the 239 clusters;
programmed ribosomal frameshifting with eciency higher than 10% was observed
for four clusters.
GeneTack was also applied to 1,165,799 mRNAs from 100 eukaryotic species and 45,295 frameshifts were identied. A clustering approach similar to the one used for
prokaryotic fs-genes allowed us to group 12,103 fs-genes into 4,087 clusters. Known
programmed frameshift genes were among the obtained clusters. Several clusters may
correspond to new examples of dual coding genes.
We developed a web interface to browse a database containing all the fs-genes
predicted by GeneTack in prokaryotic genomes and eukaryotic mRNA sequences.
The fs-genes can be retrieved by similarity search to a given query sequence, by fs-
gene cluster browsing, etc. Clusters of fs-genes are characterized with respect to their
likely origin, such as pseudogenization, phase variation, programmed frameshifts etc.
All the tools and the database of fs-genes are available at the GeneTack web site
http://topaz.gatech.edu/GeneTack/

Identiferoai:union.ndltd.org:GATECH/oai:smartech.gatech.edu:1853/45923
Date12 November 2012
CreatorsAntonov, Ivan Valentinovich
PublisherGeorgia Institute of Technology
Source SetsGeorgia Tech Electronic Thesis and Dissertation Archive
Detected LanguageEnglish
TypeDissertation

Page generated in 0.0021 seconds