Spelling suggestions: "subject:"[een] BIOINFORMATICS"" "subject:"[enn] BIOINFORMATICS""
541 |
Genome-Wide Prediction of Intrinsic Disorder; Sequence Alignment of Intrinsically Disordered ProteinsMidic, Uros January 2012 (has links)
Intrinsic disorder (ID) is defined as a lack of stable tertiary and/or secondary structure under physiological conditions in vitro. Intrinsically disordered proteins (IDPs) are highly abundant in nature. IDPs possess a number of crucial biological functions, being involved in regulation, recognition, signaling and control, e.g. their functional repertoire complements the functions of ordered proteins. Intrinsically disordered regions (IDRs) of IDPs have a different amino-acid composition than structured regions and proteins. This fact has been exploited for development of predictors of ID; the best predictors currently achieve around 80% per-residue accuracy. Earlier studies revealed that some IDPs are associated with various human diseases, including cancer, cardiovascular disease, amyloidoses, neurodegenerative diseases, diabetes and others. We developed a methodology for prediction and analysis of abundance of intrinsic disorder on the genome scale, which combines data from various gene and protein databases, and utilizes several ID prediction tools. We used this methodology to perform a large-scale computational analysis of the abundance of (predicted) ID in transcripts of various classes of disease-related genes. We further analyzed the relationships between ID and the occurrence of alternative splicing and Molecular Recognition Features (MoRFs) in human disease classes. An important, never before addressed issue with such genome-wide applications of ID predictors is that - for less-studied organisms - in addition to the experimentally confirmed protein sequences, there is a large number of putative sequences, which have been predicted with automated annotation procedures and lack experimental confirmation. In the human genome, these predicted sequences have significantly higher predicted disorder content. I investigated a hypothesis that this discrepancy is not correct, and that it is due to incorrectly annotated parts of the putative protein sequences that exhibit some similarities to confirmed IDRs, which lead to high predicted ID content. I developed a procedure to create synthetic nonsense peptide sequences by translation of non-coding regions of genomic sequences and translation of coding regions with incorrect codon alignment. I further trained several classifiers to discriminate between confirmed sequences and synthetic nonsense sequences, and used these predictors to estimate the abundance of incorrectly annotated regions in putative sequences, as well as to explore the link between such regions and intrinsic disorder. Sequence alignment is an essential tool in modern bioinformatics. Substitution matrices - such as the BLOSUM family - contain 20x20 parameters which are related to the evolutionary rates of amino acid substitutions. I explored various strategies for extension of sequence alignment to utilize the (predicted) disorder/structure information about the sequences being aligned. These strategies employ an extended 40 symbol alphabet which contains 20 symbols for amino acids in ordered regions and 20 symbols for amino acids in IDRs, as well as expanded 40x40 and 40x20 matrices. The new matrices exhibit significant and substantial differences in the substitution scores for IDRs and structured regions. Tests on a reference dataset show that 40x40 matrices perform worse than the standard 20x20 matrices, while 40x20 matrices - used in a scenario where ID is predicted for a query sequence but not for the target sequences - have at least comparable performance. However, I also demonstrate that the variations in performance between 20x20 and 20x40 matrices are insignificant compared to the variation in obtained matrices that occurs when the underlying algorithm for calculation of substitution matrices is changed. / Computer and Information Science
|
542 |
Repeats in Strings and Application in BioinformaticsIslam, A S M Sohidull 11 1900 (has links)
A string is a sequence of symbols, usually called letters, drawn from some alphabet.
It is one of the most fundamental and important structures in computing, bioinformatics and mathematics. Computer files, contents of a computer memory, network
and satellite signals are all instances of strings. The genome of every living thing
can be represented by a string drawn from the alphabet {a, c, g, t}. The algorithms
processing strings have a wide range of applications such as information retrieval,
search engines, data compression, cryptography and bioinformatics. In a DNA sequence the indeterminate symbol {a, c} is used when it is unclear whether a given nucleotide is a or c, We could then say that {a, c} matches
another symbol {c, g} which in turn matches {g, t}, but {a, c} certainly does not
match {g, t}. The processing of indeterminate strings is much more difficult because
of this nontransitivity of matching. Thus a combinatorial understanding of indeterminate strings becomes essential to the development of efficient methods for their
processing. With indeterminate strings, as with ordinary ones, the main task is the
recognition/computation of patterns called regularities . We are particularly interested in regularities called repeats, whether tandem such as acgacg or nontandem
(acgtacg). In this thesis we focus on newly-discovered regularities in strings, especially the enhanced cover array and the Lyndon array, with attention paid to extending the
computations to indeterminate strings. Much of this work is necessarily abstract in
nature, because the intention is to produce results that are applicable over a wide
range of application areas. We will focus on finding algorithms to construct different
data structures to represent strings such as cover arrays and Lyndon arrays. The
idea of cover comes from strings which are not truly periodic but "almost" periodic
in nature. For example abaababa is covered by aba but is not periodic. Similarly the
Lyndon array describes the string in another unique way and is used in many fields of
string algorithms. These data structures will help us in the field of string processing.
As one application of these data structures we will work on "Reverse Engineering";
that is, given data structures derived from of a string, how can we get the string back. Since DNA, RNA and peptide sequences are effectively "strings" with unique
properties, we will adapt our algorithms for regular or indeterminate strings to these
sequences. Sequence analysis can be used to assign function to genes and proteins
by observing the similarities between the compared sequences. Identifying unusual
repetitive patterns will aid in the identification of intrinsic features of the sequence
such as active sites, gene-structures and regulatory elements. As an application of
periodic strings we investigate microsatellites which are short repetitive DNA patterns where repeated substrings are of length 2 to 5. Microsatellites are used in a
wide range of studies due to their small size and repetitive nature, and they have
played an important role in the identification of numerous important genetic loci. A
deeper understanding of the evolutionary and mutational properties of microsatellites
is needed, not only to understand how the genome is organized, but also to correctly
interpret and use microsatellite data in population genetics studies. / Thesis / Doctor of Philosophy (PhD)
|
543 |
Estimating Prevalence of Human Traits Among Populations from Polygenic Risk ScoresGraham, Britney Elizabeth 25 January 2022 (has links)
No description available.
|
544 |
Architecture and Evolution of Xylem-related Gene Coexpression Networks in PoplarsSuren, Haktan 24 May 2013 (has links)
With the advent of sequencing technologies, a variety of methods have been available day by day. Each of these methods have helped scientists to for a deeper understanding of the biological function and evolutionary constraints on the relevant genes, which can be gained through the use of modern computational approaches. Numerous approaches have being developed to advance these goals, and interaction network mapping is one of them. This method has been employed to study a variety of organisms to illustrate shared (conserved) or individual (unique) properties, and is mainly based on identifying and visualizing modules of co-expressed genes. As being a very strong candidate for such tools, co-expression gene network was used in this study to indentify the genes in wood formation of Populus trichocarpa with the help of the other novel bioinformatics tools such as Gene Ontology and Cytoscape.
In order to booster the accuracy of the findings, we have combined it with an evolutionary approach, synonymous and non-synonymous ratio (dN/dS) of the proteins to show the selective patterns of the genes in a comparative fashion between woody and non-woody plants.
This thesis is proposed to help plant scientists to gain insights into the genes that are involved in wood formation. By taking advantage of the computational studies have been done on this paper, one can validate the experiments along with reducing the cumbersomeness of the lab trials on the topic of wood formation in plants / Master of Science
|
545 |
Investigation of Pantoea stewartii Quorum-Sensing Controlled Regulators and Genes Important for Infection of CornDuong, An Duy 27 February 2018 (has links)
Bacteria interact with their eukaryotic hosts using a variety of mechanisms that range from being beneficial to detrimental. This dissertation focuses on Pantoea stewartii subspecies stewartii (P. stewartii), an endosymbiont in the corn flea beetle gut that causes Stewart's wilt disease in corn. Gaining insights into the interactions occurring between this bacterial pathogen and its plant host may lead to informed intervention strategies. This phytopathogen uses quorum sensing (QS) to coordinate cell density-dependent gene expression and successfully colonize corn leading to wilt disease. Prior to the research presented in this dissertation, the QS master regulator EsaR was shown to regulate two major virulence factors of P. stewartii, capsule production and surface motility. However, the function and integration of EsaR downstream targets in P. stewartii were still largely undefined. Moreover, only a draft genome of a reference strain of P. stewartii was publicly available for researchers, limiting bioinformatics and genome-scale genetic approaches with the organism. The work described in this dissertation has now addressed these important issues.
The function of two EsaR direct targets, LrhA and RcsA, was explored (Chapter Two) and the existence of integration in the regulation between them was discovered (Chapters Two and Four). RcsA and LrhA are transcription factors controlling capsule production and surface motility in P. stewartii, respectively. In Chapter Two, the RcsA and LrhA regulons were investigated using RNA-Seq. This led to the discovery of a potential regulatory interaction between them that was confirmed by qRT-PCR and transcriptional gene fusion assays. The involvement of LrhA in surface motility and virulence was also established in this project. A direct interaction between LrhA and promoter of rcsA was defined in Chapter Four. Additional direct regulatory targets of LrhA were also identified.
A project to generate a complete assembly of the P. stewartii genome (Chapter Three) enabled more thorough genome-wide analysis and revealed the existence of a previous unknown 66-kb region in the P. stewartii genome believed to contain genes important for motility and virulence. In addition, completion of the genome sequence permitted genes for two distinctive Type III secretion systems, used for interactions with corn or the corn flea beetle, to be placed on two mega-plasmids. Furthermore, the complete genome sequence facilitated a Tn-Seq approach (Chapter Five). Tn-Seq is a potent tool used to identify bacterial genes required for certain environmental test conditions. This project is a pioneering utilization of a Tn-Seq analysis in planta to investigate genes important for colonization and survival of P. stewartii within its corn host. It was discovered that OmpC and Lon are important to in planta growth and OmpA plays a role in plant virulence.
In conclusion, these studies have broadened our understanding about the role of the QS regulon and other genes important for the pathogenesis of this phytopathogen. This knowledge may now be applied toward the development of future disease intervention strategies against P. stewartii and other wilt-disease causing plant pathogens. / PHD
|
546 |
Methods for Analysis of Prokaryotic Genome ArchitectureWarren, Andrew S. 19 July 2017 (has links)
Research in comparative microbial genomics has largely been organized around the concept of reference genomes. Reference genomes provide a useful comparative touchstone for closely related organisms. However, they do not necessarily represent the biological diversity in a group of genomes. Currently there are more than 96,000 bacterial genomes sequenced and this number is rapidly increasing. Some closely related groups have large numbers of genomes sequenced creating interesting comparative challenges: E. coli more than 5,400 isolates, S. aureus almost 9,000. As this sampling through sequencing becomes both deeper and broader, reference genome based methods become less effective at characterizing groups of organisms.
Functional motifs can help explain the organizing principles behind cellular systems in bacteria which have yet to be well understood. Currently there are relatively few bioinformatic tools for analyzing potential patterns at the level of genome organization that do not depend directly on sequence similarity. We present a framework for conducting genomic data mining to look for patterns that currently require human expert designation. We establish new computational methods for identifying patterns in prokaryotic genome construction through a mapping of genomic features, using semantic similarity, independent of a particular corpus to better approximate functional similarity.
We also present an algorithm for creating whole genome multiple sequence comparisons and a model for representing the similarities and di erences among sequences as a graph of syntenic gene families. This e ort touches on several di erent research fronts: graph representation of genomes and their alignments, synteny block analysis, whole genome sequence alignment, pan-genome analysis, multiple sequence alignment, and genome rearrangement analysis. Though our approach was originally developed from a pan-genome perspective for prokaryotes, the methods involved have the potential to speed up more expensive computation such as phylogenetic tree construction and SNP analysis. Novel elements include the contextualization of synteny analysis both between and within multi-contig genomes and an analytical framework for detecting genome level evolutionary events such as insertions, inversions, translocations, and fusions. / Ph. D. / Research in comparative microbial genomics has largely been organized around the concept of reference genomes. Reference genomes provide a useful comparative touchstone for closely related organisms. However, they do not necessarily well represent the biological diversity in a group of genomes. As sampling through sequencing becomes both deeper and broader, reference genome based methods become less effective at characterizing groups of organisms. We present an algorithm for creating whole genome multiple sequence comparisons and a model for representing the similarities and differences among sequences as a graph of syntenic gene families called a pan-synteny graph.
As the evolutionary distance between organisms increase sequence similarity and homology detection tend to break down. However, similarities in the functional characteristics of certain genes and gene modules may persist or have converged over time. Detecting and defining patterns in these functional similarities, in relation to conserved gene order, is a largely unexplored problem. To create a model for representing the architectural similarity of functional modules, using ontologies and semantic similarity, we present a corpus independent semantic similarity method, and describe a computational framework for using semantic similarity and pan-synteny graphs.
|
547 |
A Bioinformatics Approach to Identifying Radical SAM (S-Adenosyl-L-Methionine) EnzymesGagliano, Elisa 03 June 2020 (has links)
Radical SAM enzymes are ancient, essential enzymes. They perform radical chemical reactions in virtually all living organisms and are involved in producing antibiotics, generating greenhouse gases, human health, and likely many other essential roles that have yet to be established. A wide variety of reactions have been characterized from this group of enzymes, including hydrogen abstractions, the transferring of methylthio groups, complex cyclization and rearrangement reactions, and others. However, many radical SAM enzymes have yet to be identified or characterized. There have been great leaps forward in the amount of enzyme sequences that are available in public databases, but experiments to investigate what chemical reactions the enzymes perform take a great deal of time. In our work, we utilize Hidden Markov Models to identify possible radical SAM enzymes and predict their possible functions through BLAST alignments and homology modelling. We also explore their distribution across the tree of life and determine how it is correlated with organism oxygen tolerances, because the core iron-sulfur cluster is oxygen sensitive. Trends in the abundances of radical SAM enzymes depending on oxygen tolerances were more apparent in prokaryotes than in eukaryotes. Although eukaryotes tend to have fewer radical SAM enzymes than prokaryotes, we were able to analyze uncharacterized radical SAM enzymes from both an aerobic eukaryote (Entamoeba histolytica) and a eukaryote capable of oxygenic photosynthesis (Gossypium barbadense), and predict the reactions they catalyze. This work sets the stage for the functional characterization of these essential yet elusive enzymes in future laboratory experiments. / Master of Science in Life Sciences / Radical SAM enzymes are ancient, essential enzymes that perform chemical reactions in virtually all living organisms. We do know that they are involved in producing antibiotics, human health, and generating greenhouse gases. We also know that there are many radical SAM enzymes whose functions remain a mystery. There have been great leaps forward in the amount of enzyme sequences that are available in public databases, but experiments to investigate what chemical reactions enzymes perform take a great deal of time. The experiments are especially difficult for radical SAM enzymes because the oxygen we breathe can break the enzymes down in a laboratory. In our work, we utilize computational techniques to identify possible radical SAM enzymes and predict what reactions they might catalyze. Because these enzymes are vulnerable to oxygen in laboratory environments, we also explore whether organisms that breathe oxygen have fewer of these enzymes than organisms that perform anaerobic respiration instead. We found that does seem to be the case in microbes like bacteria and archaea, but the results were not as consistent for eukaryotes. We then chose radical SAM enzymes we had identified from both an aerobic eukaryote (Entamoeba histolytica) and a eukaryote capable of producing oxygen (Gossypium barbadense), and predicted the reactions they catalyze. This work sets the stage for the functional characterization of these essential yet elusive enzymes in future laboratory experiments.
|
548 |
Isolation and Bioinformatic Characterization of Four Novel Bacteriophages from Streptomyces toxytriciniAlzaid, Hessah 05 1900 (has links)
Six initial phage isolates with high titer lysates were obtained using Streptomyces toxytricini B-5426 as the host bacterium. These isolates were named Goby, Toma, Yosif, Yara, Deema, and Hsoos. However, upon completion of the sequencing, it was found that the Yara and Hsoos isolates were identical, as were Goby and Deema. As a result, final analysis was completed on only the four unique isolates. All of the phages mentioned above were isolated from soil samples from different locations. Also, they had different sizes of plaques, ranging from 0.3 – 0.9mm. Yosif had the largest plaque size. Yara's head diameter was 79nm with tail diameter of 94nm.
|
549 |
Probabilistic models of RNA secondary structureAnderson, James William Justin January 2013 (has links)
This thesis develops probabilistic models of RNA secondary structure. The first chapter introduces RNA secondary structure prediction, in particular stochastic context-free grammars (SCFGs), and considers a novel method for automated design of SCFGs. Many SCFGs are found with a similar predictive quality as those commonly used for RNA secondary structure prediction. The second chapter discusses the effect alignment quality, evolutionary distance between sequences, and number of sequences in an alignment have on RNA secondary structure prediction. By combining statistical alignment and SCFG models we can, in a statistically sound setting, average structure predictions over the space of alignments to decrease loss created by poor alignments. The third chapter incorporates additional biological information about RNA secondary structure formation into the decoding of the SCFG posterior distribution. Combining iterative helix formation, phylogenetic modelling, and a distance function between alignment columns leads to the an improvement in the accuracy of comparative RNA secondary structure prediction. Finally, appendices briefly discuss further work concerning probabilistic models of RNA secondary structure which may be of interest to the reader.
|
550 |
Integrative Computational Genomics Defines the Molecular Origins and Outcomes of LymphomaMoffitt, Andrea Barrett January 2016 (has links)
<p>Lymphomas are a heterogeneous group of hematological malignancies composed of diseases with diverse molecular origins and clinical outcomes. Derived from immune cells of lymphoid origin, lymphoma can arise from lymphoid cells present anywhere in the body, from the spleen and lymph nodes to peripheral sites like the liver and intestines. Current strategies for lymphoma diagnosis involve primarily histopathological examinations of the tumor biopsy, including cytogenetics and immunophenotyping. As more data becomes available, diagnoses may increasingly depend on genomic features that define each disease. Classification of lymphoid neoplasms is generally based on the cell of origin, or the lineage of the normal cell that the cancer is thought to arise from. Lymphomas can be classified into dozens of distinct diagnostic entities, though any two patients with the same diagnosis may have very different outcomes and molecular underpinnings, so we need to understand both the commonalities of patients with the same disease and the unique features that may require personalized treatment strategies. Patient prognosis in lymphoma depends greatly on the type of lymphoma, ranging from nearly curable diseases with over 90% five-year survival rates, to most patients dying in the first year in the worse entities. Greater clarity is needed in the role of the underlying genomics that contribute to these variable treatment responses and clinical outcomes. </p><p>Next-generation sequencing approaches allow us to delve into the molecular underpinnings of lymphomas, in order to gain insight about the origin and evolution of these diseases. High-throughput sequencing protocols allow us to examine the whole genome, exome, epigenome, or transcriptome of cancer cells in tens to hundreds of patients for each disease. As cost of sequencing is reduced, and the ability to generate more data increases, we face increasing computational challenges to both process and interpret the wealth of data available in cancer genomics. Developing efficient and effective bioinformatics tools is necessary to transform billions of sequencing reads into actionable hypotheses on the role of certain genes or biological pathways in a specific cancer type or patient. </p><p>In this dissertation, I present several strategies and applications of integrative computational genomics in lymphoma, with contributions throughout the research process, from development of initial assays and quality control strategies for the sequencing data, to joint analysis of clinical and genomic data, and finally through follow-up experimental models for lymphoma. </p><p>First, I focus on two rare T cell lymphomas, hepatosplenic T cell lymphoma (HSTL) and enteropathy associated T cell lymphoma (EATL), which are both diseases with very poor clinical outcomes and a previous dearth of knowledge on the genetic basis of the diseases. We define the somatic mutation landscape of HSTL, through application of exome sequencing and find SETD2 to be the most highly mutated gene. We further utilize the exome sequencing data to investigate copy number alterations and show a significant survival difference between cases with and without certain arm-level copy number alterations. Knockdown of SETD2 in an HSTL cell line, followed by RNA sequencing, demonstrates the role of SETD2 loss in proliferation and cell cycle changes, linking the SETD2 mutations to a potential oncogenic mechanism. Furthermore, we investigate the potentially targetable mutations in the JAK-STAT pathway and demonstrate oncogenic downstream molecular phenotypes and potential druggability of these mutations. In the enteropathy associated T cell lymphoma study, we apply exome and RNA sequencing to a large EATL cohort. Our findings show a significant role for loss of function mutations in chromatin modifiers and JAK-STAT signaling genes. EATL can be separated into two subtypes, Type I and Type II, which we show to have convergent genomic features, in the face of divergent gene expression. RNA sequencing data defines a distinct separation between the two subtypes. Delving further into the role of SETD2 in these T cell lymphomas, we generate a mouse model with a conditional knockout of SETD2 in T cells and demonstrate a role for SETD2 in altering the lineage development of T cells. </p><p>To understand more about why certain genetic abnormalities are recurrent in some disease entities and not others, we turn to the cell of origin for clues. We pair two different lymphomas, Burkitt lymphoma and mantle cell lymphoma, with their associated cells of origin, germinal center B cells and naive B cells. These closely related cell types have much in common as B cells, but from studies of their transcriptomes, we know that there are many molecular differences that distinguish the two. In this work, after looking more closely at mantle cell lymphoma genomics, we look at the underlying chromatin markers that define the epigenomes of these B cells. We test the association between chromatin markers and mutation rates of genes between these two cell types and lymphomas, and find that genes with more open chromatin may have a higher mutation rate, when comparing closely related cells and lymphomas. Finally, I present my work on developing an RNA sequencing based strategy for defining the complete transcriptome of diffuse large B cell lymphoma (DLBCL). Gene expression profiling with microarray has shown the existence of two subtypes in DLBCL, activated B cell like (ABC) and germinal center B cell like (GCB). However, the role for non-coding RNAs, alternative splicing, and mutations, in these two subtypes and the larger group is previously not well understood. We develop a strand-specific RNA sequencing strategy that will allow the investigation of the total RNA transcriptome in DLBCL, including microRNAs, lncRNAs, and other important non-coding RNAs. Furthermore, we show that RNA sequencing can be used to distinguish the two subtypes, including through RNA sequencing based mutation calls, as well as through differentially expressed lncRNAs that we define for the first time in DLBCL.</p><p>Broadly, this dissertation contributes novel findings in the field of lymphoma genomics, as well as presenting a framework for computational integrative genomics that can guide future studies. The heterogeneity of lymphoma across cases requires us to dive deep into individual diseases, even rare ones, as well as appreciate the similarities and differences across lymphomas. To improve diagnoses, prognoses, and treatment options, we need to understand the molecular origins of lymphoma. Using a range of molecular and computational approaches, we can move closer to true personalized medicine at the genomic level.</p> / Dissertation
|
Page generated in 0.0577 seconds