Global ETD Search

631	Detection and management of redundancy for information retrieval Bernstein, Yaniv, ybernstein@gmail.com January 2006 (has links) The growth of the web, authoring software, and electronic publishing has led to the emergence of a new type of document collection that is decentralised, amorphous, dynamic, and anarchic. In such collections, redundancy is a significant issue. Documents can spread and propagate across such collections without any control or moderation. Redundancy can interfere with the information retrieval process, leading to decreased user amenity in accessing information from these collections, and thus must be effectively managed. The precise definition of redundancy varies with the application. We restrict ourselves to documents that are co-derivative: those that share a common heritage, and hence contain passages of common text. We explore document fingerprinting, a well-known technique for the detection of co-derivative document pairs. Our new lossless fingerprinting algorithm improves the effectiveness of a range of document fingerprinting approaches. We empirically show that our algorithm can be highly effective at discovering co-derivative document pairs in large collections. We study the occurrence and management of redundancy in a range of application domains. On the web, we find that document fingerprinting is able to identify widespread redundancy, and that this redundancy has a significant detrimental effect on the quality of search results. Based on user studies, we suggest that redundancy is most appropriately managed as a postprocessing step on the ranked list and explain how and why this should be done. In the genomic area of sequence homology search, we explain why the existing techniques for redundancy discovery are increasingly inefficient, and present a critique of the current approaches to redundancy management. We show how document fingerprinting with a modified version of our algorithm provides significant efficiency improvements, and propose a new approach to redundancy management based on wildcards. We demonstrate that our scheme provides the benefits of existing techniques but does not have their deficiencies. Redundancy in distributed information retrieval systems - where different parts of the collection are searched by autonomous servers - cannot be effectively managed using traditional fingerprinting techniques. We thus propose a new data structure, the grainy hash vector, for redundancy detection and management in this environment. We show in preliminary tests that the grainy hash vector is able to accurately detect a good proportion of redundant document pairs while maintaining low resource usage. information retrieval redundancy shingling bioinformatics
632	New Proteomics Methods and Fundamental Aspects of Peptide Fragmentation / Nya Proteomik Metoder och Fundamentala Aspekter av Peptid Fragmentering Savitski, Mikhail January 2007 (has links) <p>The combination of collision-activated dissociation, (CAD) and electron capture dissociation, (ECD) yielded a 125% increase in protein identification. The S-score was developed for measuring the information content in MS/MS spectra. This measure made it possible to single out good quality spectra that were not identified by a search engine. Poor quality MS/MS data was filtered out, streamlining the identification process.</p><p>A proteomics grade de novo sequencing approach was developed enabling to almost completely sequence 19% of all MS/MS data with 95% reliability in a typical proteomics experiment.</p><p>A new tool, Modificomb, for identifying all types of modifications in a fast, reliable way was developed. New types of modifications have been discovered and the extent of modifications in gel based proteomics turned out to be greater than expected.</p><p>PhosTShunter was developed for sensitive identification of all phosphorylated peptides in an MS/MS dataset.</p><p>Application of these programs to human milk samples led to identification of a previously unreported and potentially biologically important phosphorylation site.</p><p>Peptide fragmentation has been studied. It was shown emphatically on a dataset of 15.000 MS/MS spectra that CAD and ECD have different cleavage preferences with respect to the amino acid context.</p><p>Hydrogen rearrangement involving z• species has been investigated. Clear trends have been unveiled. This information elucidated the mechanism of hydrogen transfer.</p><p>Partial side-chain losses in ECD have been studied. The potential of these ions for reliably distinguishing Leu/Iso residues was shown. Partial sidechain losses occurring far away from the cleavage site have been detected. </p><p>A strong correlation was found between the propensities of amino acids towards peptide bond cleavage employing CAD and the propensity of amino acids to accept in solution backbone-backbone H-bonds and form stable motifs. This indicated that the same parameter governs formation of secondary structures in solution and directs fragmentation in peptide ions by CAD.</p> Bioinformatics Proteomics Peptide fragmentation Bioinformatik
633	An object-oriented framework to organize genomic data Wei, Ning 15 May 2009 (has links) Bioinformatics resources should provide simple and flexible support for genomics research. A huge amount of gene mapping data, micro-array expression data, expressed sequence tags (EST), BAC sequence data and genome sequence data are already, or will soon be available for a number of livestock species. These species will have different requirements compared to typical biomedical model organisms and will need an informatics framework to deal with the data. In term of exploring complex-intertwined genomic data, the way to organize them will be addressed in this study. Therefore, we investigated two issues in this study: one is an independent informatics framework including both back end and front end; another is how an informatics framework simplifies the user interface to explore data. We have developed a fundamental informatics framework that makes it easy to organize and manipulate the complex relations between genomic data, and allow for query results to be presented via a user friendly web interface. A genome object-oriented framework (GOOF) was proposed with object-oriented Java technology and is independent of any database system. This framework seamlessly links the database system and web presentation components. The data models of GOOF collect the data relationships in order to provide users with access to relations across different types of data, meaning that users avoid constructing queries within the interface layer. Moreover, the module-based interface provided by GOOF could allow different users to access data in different interfaces and ways. In another words, GOOF not only gives a whole solution to informatics infrastructure, but also simplifies the organization of data modeling and presentation. In order to be a fast development solution, GOOF provides an automatic code engine by using meta-programming facilities in Java, which could allow users to generate a large amount of routine program codes. Moreover, the pre-built data layer in GOOF connecting with Chado simplifies the process to manage genomic data in the Chado schema. In summary, we studied the way to model genomic data into an informatics framework, a one-stop approach, to organize the data and addressed how GOOF constructs a bioinformatics infrastructure for users to access genomic data. Bioinformatics Informatics Database Equine Genetics
634	Role of the non-catalytic triad in alpha-amylases Marx, Jean-Claude 28 February 2007 (has links) La triade non-catalytique est un motif strictement conservé des alpha-amylases chlorure-dépendentes qui est parfaitement superposable avec la triade catalytique des protéases à sérine active. Le but de ce travail était de déterminer le rôle de cette triade. Par des expériences de mutagenèse, nous avons pu montrer que ce rôle est de nature structurale. Des expériences de RMN nous ont permis de démontrer la présence d'un pont H anormalement fort dans ces enzymes, ce qui pourrait expliquer l'instabilité très marquée des mutants de la triade. Malheureusement, nous n'avons pas pu attribuer sans ambiguité ce pont H à la triade non-catalytique. La dernière partie de ce travail décrit la recherche de la triade non-catalytique dans des protéines autres que les amylases. NMR bioinformatics triad alpha-amylases
635	Bioinformatics Approaches to Biomarker and Drug Discovery in Aging and Disease Fortney, Kristen 11 December 2012 (has links) Over the past two decades, high-throughput (HTP) technologies such as microarrays and mass spectrometry have fundamentally changed the landscape of aging and disease biology. They have revealed novel molecular markers of aging, disease state, and drug response. Some have been translated into the clinic as tools for early disease diagnosis, prognosis, and individualized treatment and response monitoring. Despite these successes, many challenges remain: HTP platforms are often noisy and suffer from false positives and false negatives; optimal analysis and successful validation require complex workflows; and the underlying biology of aging and disease is heterogeneous and complex. Methods from integrative computational biology can help diminish these challenges by creating new analytical methods and software tools that leverage the large and diverse quantity of publicly available HTP data. In this thesis I report on four projects that develop and apply strategies from integrative computational biology to identify improved biomarkers and therapeutics for aging and disease. In Chapter 2, I proposed a new network analysis method to identify gene expression biomarkers of aging, and applied it to study the pathway-level effects of aging and infer the functions of poorly-characterized longevity genes. In Chapter 4, I adapted gene-level HTP chemogenomic data to study drug response at the systems level; I connected drugs to pathways, phenotypes and networks, and built the NetwoRx web portal to make these data publicly available. And in Chapters 3 and 5, I developed a novel meta-analysis pipeline to identify new drugs that mimic the beneficial gene expression changes seen with calorie restriction (Chapter 3), or that reverse the pathological gene changes associated with lung cancer (Chapter 5). The projects described in this thesis will help provide a systems-level understanding of the causes and consequences of aging and disease, as well as new tools for diagnosis (biomarkers) and treatment (therapeutics). bioinformatics aging biomarkers genomics 0715
636	New Proteomics Methods and Fundamental Aspects of Peptide Fragmentation / Nya Proteomik Metoder och Fundamentala Aspekter av Peptid Fragmentering Savitski, Mikhail January 2007 (has links) The combination of collision-activated dissociation, (CAD) and electron capture dissociation, (ECD) yielded a 125% increase in protein identification. The S-score was developed for measuring the information content in MS/MS spectra. This measure made it possible to single out good quality spectra that were not identified by a search engine. Poor quality MS/MS data was filtered out, streamlining the identification process. A proteomics grade de novo sequencing approach was developed enabling to almost completely sequence 19% of all MS/MS data with 95% reliability in a typical proteomics experiment. A new tool, Modificomb, for identifying all types of modifications in a fast, reliable way was developed. New types of modifications have been discovered and the extent of modifications in gel based proteomics turned out to be greater than expected. PhosTShunter was developed for sensitive identification of all phosphorylated peptides in an MS/MS dataset. Application of these programs to human milk samples led to identification of a previously unreported and potentially biologically important phosphorylation site. Peptide fragmentation has been studied. It was shown emphatically on a dataset of 15.000 MS/MS spectra that CAD and ECD have different cleavage preferences with respect to the amino acid context. Hydrogen rearrangement involving z• species has been investigated. Clear trends have been unveiled. This information elucidated the mechanism of hydrogen transfer. Partial side-chain losses in ECD have been studied. The potential of these ions for reliably distinguishing Leu/Iso residues was shown. Partial sidechain losses occurring far away from the cleavage site have been detected. A strong correlation was found between the propensities of amino acids towards peptide bond cleavage employing CAD and the propensity of amino acids to accept in solution backbone-backbone H-bonds and form stable motifs. This indicated that the same parameter governs formation of secondary structures in solution and directs fragmentation in peptide ions by CAD. Bioinformatics Proteomics Peptide fragmentation Bioinformatik
637	Models for the preprocessing of reverse phase protein arrays January 2009 (has links) Reverse-phase protein lysate arrays (RPPA) are becoming important tools for the analysis of proteins in biological systems. RPPAs combine current assays for detecting and measuring proteins with the high-throughput technology of microarrays. Protein level assays have the ability to address questions about signaling pathways and post translational modifications that genomic assays alone cannot answer. The importance of preprocessing microarray data has been shown in a variety of contexts over the years and many of the same issues carry over to RPPAs including spot level correction, quantification, and normalization. In this thesis, we develop models and tools to improve upon the standard methods for preprocessing RPPA data. In particular, at the spot level, we suggest alternative methods for estimating background signal when the default estimates are compromised. Further, we introduce a multiplicative adjustment at the spot level, modeled with a smoothed surface of the positive control spots, that removes spatial bias better than additive-only models. When mutli-level information is available for the positive controls, a method that builds nested surfaces at the positive control levels further decreases spatial bias. At the quantification level, we outline a newly developed R-package called SuperCurve. This package uses a model that borrows strength from all samples on an array to estimate both an over all dose-response curve and individuals estimates of relative sample protein expression. SuperCurve is easy to implement and is compatible with the latest version of R. Finally, we introduce a normalization model called Variable Slope (VS) normalization that corrects for sample loading bias, taking into account the fact that expression estimates are computed separately for each array. Previous normalization models fail to account for this feature, potentially adding more variability to the expression measurements. VS normalization is shown to recover true correlation structure better than standard methods. As processing methods for RPPA data improve, this technology helps identify proteomic signatures that are unique to subtypes of disease and can eventually be applied to personalized therapy. Biology Biostatistics Statistics Biology Bioinformatics
638	Towards integrated computational models of cellular networks Berestovsky, Natalie 16 September 2013 (has links) The whole-cell behavior arises from the interplay among signaling, metabolic, and regulatory processes, which differ not only in their mechanisms, but also in the time scale of their execution. Proper modeling of the overall function of the cell requires development of a new modeling approach that accurately integrates these three types of processes, using the representation that best captures each one of them, and the interconnections between them. Traditionally, signaling networks have been modeled with ordinary differential equations (ODEs), regulation with Boolean networks, and metabolic pathways with Petri nets – these approaches are widely accepted and extensively used. Nonetheless, each of these methods, while being effective, have had limitations pointed out to them. Particularly, ODEs generally require very thorough parameterization, which is difficult to acquire, Boolean networks have been argued to be not capable of capturing complex systems dynamics, and the effectiveness of Petri nets when comparing to other, steady-state methods, have been debated. The main goal of this dissertation is to devise an integrated model that capture the whole-cell behavior and accurately combines these three components in the interplay between them. I provide a systematic study on using particle swarm optimization (PSO) as an effective approach for parameterizing ODEs. I survey different inference method for Boolean networks on the sets of complex dynamic data and demonstrate that they are, in fact, capable of capturing a variety of different systems. I review the existing use of Petri nets in modeling of biochemical system to show their effectiveness and, particularly, the ease for their integration with other methods. Finally, I propose an integrated hybrid model (IHM) that uses Petri nets to represent metabolic and signaling components, and Boolean networks to model regulation. The interconnections between these models allow to overcome the time scale differences of the processes by adding appropriate delay mechanisms. I validate IHM on two data sets. The significant advantage of IHM over other models is that it is able to capture the dynamics of all three components and can potentially identify novel and important cross-talk within the cell. bioinformatics cellular networks modeling integration
639	A Genomic Definition of Centromeres in Complex Genomes Hayden, Karen Elizabeth January 2011 (has links) <p>Centromeres, or sites of chromosomal spindle attachment during mitosis and meiosis, are non-randomly distributed in complex genomes and are largely associated with expansive, near-identical satellite DNA arrays. While the sequence basis of centromere identity remains a subject of considerable debate, one approach is to examine the genomic organization of satellite DNA arrays and their potential function. Current genome assembly and sequence annotation strategies, however, are dependent on robust sequence variation, and, as a result, these regions of near sequence identity remain absent from current genome reference sequences and thus are detached from explorations of centromere biology. This dissertation is designed as a foundational study for centromere genomics, providing the initial steps to characterize those sequences at endogenous centromeres, while further classifying `functional' sequences that directly interact with, or are capable of recruiting proteins involved in, centromere function. These studies build on and take advantage of the limited sequence variation in centromeric satellite DNA, providing the necessary genomic scope to promote biologically meaningful characterization of endogenous centromere sequences in both human and non-human genomes. As a result, this thesis demonstrates possible genomic standards for future studies in the emerging field of satellite biology, which is now positioned to address functional centromere sequence variation across evolutionary time.</p> / Dissertation Genetics Bioinformatics CENP-A centromere satellite
640	An object-oriented framework to organize genomic data Wei, Ning 15 May 2009 (has links) Bioinformatics resources should provide simple and flexible support for genomics research. A huge amount of gene mapping data, micro-array expression data, expressed sequence tags (EST), BAC sequence data and genome sequence data are already, or will soon be available for a number of livestock species. These species will have different requirements compared to typical biomedical model organisms and will need an informatics framework to deal with the data. In term of exploring complex-intertwined genomic data, the way to organize them will be addressed in this study. Therefore, we investigated two issues in this study: one is an independent informatics framework including both back end and front end; another is how an informatics framework simplifies the user interface to explore data. We have developed a fundamental informatics framework that makes it easy to organize and manipulate the complex relations between genomic data, and allow for query results to be presented via a user friendly web interface. A genome object-oriented framework (GOOF) was proposed with object-oriented Java technology and is independent of any database system. This framework seamlessly links the database system and web presentation components. The data models of GOOF collect the data relationships in order to provide users with access to relations across different types of data, meaning that users avoid constructing queries within the interface layer. Moreover, the module-based interface provided by GOOF could allow different users to access data in different interfaces and ways. In another words, GOOF not only gives a whole solution to informatics infrastructure, but also simplifies the organization of data modeling and presentation. In order to be a fast development solution, GOOF provides an automatic code engine by using meta-programming facilities in Java, which could allow users to generate a large amount of routine program codes. Moreover, the pre-built data layer in GOOF connecting with Chado simplifies the process to manage genomic data in the Chado schema. In summary, we studied the way to model genomic data into an informatics framework, a one-stop approach, to organize the data and addressed how GOOF constructs a bioinformatics infrastructure for users to access genomic data. Bioinformatics Informatics Database Equine Genetics

Search results