Global ETD Search

91	The mapping task and its various applications in next-generation sequencing Otto, Christian 23 March 2015 (has links) (PDF) The aim of this thesis is the development and benchmarking of computational methods for the analysis of high-throughput data from tiling arrays and next-generation sequencing. Tiling arrays have been a mainstay of genome-wide transcriptomics, e.g., in the identification of functional elements in the human genome. Due to limitations of existing methods for the data analysis of this data, a novel statistical approach is presented that identifies expressed segments as significant differences from the background distribution and thus avoids dataset-specific parameters. This method detects differentially expressed segments in biological data with significantly lower false discovery rates and equivalent sensitivities compared to commonly used methods. In addition, it is also clearly superior in the recovery of exon-intron structures. Moreover, the search for local accumulations of expressed segments in tiling array data has led to the identification of very large expressed regions that may constitute a new class of macroRNAs. This thesis proceeds with next-generation sequencing for which various protocols have been devised to study genomic, transcriptomic, and epigenomic features. One of the first crucial steps in most NGS data analyses is the mapping of sequencing reads to a reference genome. This work introduces algorithmic methods to solve the mapping tasks for three major NGS protocols: DNA-seq, RNA-seq, and MethylC-seq. All methods have been thoroughly benchmarked and integrated into the segemehl mapping suite. First, mapping of DNA-seq data is facilitated by the core mapping algorithm of segemehl. Since the initial publication, it has been continuously updated and expanded. Here, extensive and reproducible benchmarks are presented that compare segemehl to state-of-the-art read aligners on various data sets. The results indicate that it is not only more sensitive in finding the optimal alignment with respect to the unit edit distance but also very specific compared to most commonly used alternative read mappers. These advantages are observable for both real and simulated reads, are largely independent of the read length and sequencing technology, but come at the cost of higher running time and memory consumption. Second, the split-read extension of segemehl, presented by Hoffmann, enables the mapping of RNA-seq data, a computationally more difficult form of the mapping task due to the occurrence of splicing. Here, the novel tool lack is presented, which aims to recover missed RNA-seq read alignments using de novo splice junction information. It performs very well in benchmarks and may thus be a beneficial extension to RNA-seq analysis pipelines. Third, a novel method is introduced that facilitates the mapping of bisulfite-treated sequencing data. This protocol is considered the gold standard in genome-wide studies of DNA methylation, one of the major epigenetic modifications in animals and plants. The treatment of DNA with sodium bisulfite selectively converts unmethylated cytosines to uracils, while methylated ones remain unchanged. The bisulfite extension developed here performs seed searches on a collapsed alphabet followed by bisulfite-sensitive dynamic programming alignments. Thus, it is insensitive to bisulfite-related mismatches and does not rely on post-processing, in contrast to other methods. In comparison to state-of-the-art tools, this method achieves significantly higher sensitivities and performs time-competitive in mapping millions of sequencing reads to vertebrate genomes. Remarkably, the increase in sensitivity does not come at the cost of decreased specificity and thus may finally result in a better performance in calling the methylation rate. Lastly, the potential of mapping strategies for de novo genome assemblies is demonstrated with the introduction of a new guided assembly procedure. It incorporates mapping as major component and uses the additional information (e.g., annotation) as guide. With this method, the complete mitochondrial genome of Eulimnogammarus verrucosus has been successfully assembled even though the sequencing library has been heavily dominated by nuclear DNA. In summary, this thesis introduces algorithmic methods that significantly improve the analysis of tiling array, DNA-seq, RNA-seq, and MethylC-seq data, and proposes standards for benchmarking NGS read aligners. Moreover, it presents a new guided assembly procedure that has been successfully applied in the de novo assembly of a crustacean mitogenome. / Diese Arbeit befasst sich mit der Entwicklung und dem Benchmarken von Verfahren zur Analyse von Daten aus Hochdurchsatz-Technologien, wie Tiling Arrays oder Hochdurchsatz-Sequenzierung. Tiling Arrays bildeten lange Zeit die Grundlage für die genomweite Untersuchung des Transkriptoms und kamen beispielsweise bei der Identifizierung funktioneller Elemente im menschlichen Genom zum Einsatz. In dieser Arbeit wird ein neues statistisches Verfahren zur Auswertung von Tiling Array-Daten vorgestellt. Darin werden Segmente als exprimiert klassifiziert, wenn sich deren Signale signifikant von der Hintergrundverteilung unterscheiden. Dadurch werden keine auf den Datensatz abgestimmten Parameterwerte benötigt. Die hier vorgestellte Methode erkennt differentiell exprimierte Segmente in biologischen Daten bei gleicher Sensitivität mit geringerer Falsch-Positiv-Rate im Vergleich zu den derzeit hauptsächlich eingesetzten Verfahren. Zudem ist die Methode bei der Erkennung von Exon-Intron Grenzen präziser. Die Suche nach Anhäufungen exprimierter Segmente hat darüber hinaus zur Entdeckung von sehr langen Regionen geführt, welche möglicherweise eine neue Klasse von macroRNAs darstellen. Nach dem Exkurs zu Tiling Arrays konzentriert sich diese Arbeit nun auf die Hochdurchsatz-Sequenzierung, für die bereits verschiedene Sequenzierungsprotokolle zur Untersuchungen des Genoms, Transkriptoms und Epigenoms etabliert sind. Einer der ersten und entscheidenden Schritte in der Analyse von Sequenzierungsdaten stellt in den meisten Fällen das Mappen dar, bei dem kurze Sequenzen (Reads) auf ein großes Referenzgenom aligniert werden. Die vorliegende Arbeit stellt algorithmische Methoden vor, welche das Mapping-Problem für drei wichtige Sequenzierungsprotokolle (DNA-Seq, RNA-Seq und MethylC-Seq) lösen. Alle Methoden wurden ausführlichen Benchmarks unterzogen und sind in der segemehl-Suite integriert. Als Erstes wird hier der Kern-Algorithmus von segemehl vorgestellt, welcher das Mappen von DNA-Sequenzierungsdaten ermöglicht. Seit der ersten Veröffentlichung wurde dieser kontinuierlich optimiert und erweitert. In dieser Arbeit werden umfangreiche und auf Reproduzierbarkeit bedachte Benchmarks präsentiert, in denen segemehl auf zahlreichen Datensätzen mit bekannten Mapping-Programmen verglichen wird. Die Ergebnisse zeigen, dass segemehl nicht nur sensitiver im Auffinden von optimalen Alignments bezüglich der Editierdistanz sondern auch sehr spezifisch im Vergleich zu anderen Methoden ist. Diese Vorteile sind in realen und simulierten Daten unabhängig von der Sequenzierungstechnologie oder der Länge der Reads erkennbar, gehen aber zu Lasten einer längeren Laufzeit und eines höheren Speicherverbrauchs. Als Zweites wird das Mappen von RNA-Sequenzierungsdaten untersucht, welches bereits von der Split-Read-Erweiterung von segemehl unterstützt wird. Aufgrund von Spleißen ist diese Form des Mapping-Problems rechnerisch aufwendiger. In dieser Arbeit wird das neue Programm lack vorgestellt, welches darauf abzielt, fehlende Read-Alignments mit Hilfe von de novo Spleiß-Information zu finden. Es erzielt hervorragende Ergebnisse und stellt somit eine sinnvolle Ergänzung zu Analyse-Pipelines für RNA-Sequenzierungsdaten dar. Als Drittes wird eine neue Methode zum Mappen von Bisulfit-behandelte Sequenzierungsdaten vorgestellt. Dieses Protokoll gilt als Goldstandard in der genomweiten Untersuchung der DNA-Methylierung, einer der wichtigsten epigenetischen Modifikationen in Tieren und Pflanzen. Dabei wird die DNA vor der Sequenzierung mit Natriumbisulfit behandelt, welches selektiv nicht methylierte Cytosine zu Uracilen konvertiert, während Methylcytosine davon unberührt bleiben. Die hier vorgestellte Bisulfit-Erweiterung führt die Seed-Suche auf einem reduziertem Alphabet durch und verifiziert die erhaltenen Treffer mit einem auf dynamischer Programmierung basierenden Bisulfit-sensitiven Alignment-Algorithmus. Das verwendete Verfahren ist somit unempfindlich gegenüber Bisulfit-Konvertierungen und erfordert im Gegensatz zu anderen Verfahren keine weitere Nachverarbeitung. Im Vergleich zu aktuell eingesetzten Programmen ist die Methode sensitiver und benötigt eine vergleichbare Laufzeit beim Mappen von Millionen von Reads auf große Genome. Bemerkenswerterweise wird die erhöhte Sensitivität bei gleichbleibend guter Spezifizität erreicht. Dadurch könnte diese Methode somit auch bessere Ergebnisse bei der präzisen Bestimmung der Methylierungsraten erreichen. Schließlich wird noch das Potential von Mapping-Strategien für Assemblierungen mit der Einführung eines neuen, Kristallisation-genanntes Verfahren zur unterstützten Assemblierung aufgezeigt. Es enthält Mapping als Hauptbestandteil und nutzt Zusatzinformation (z.B. Annotationen) als Unterstützung. Dieses Verfahren ermöglichte die erfolgreiche Assemblierung des kompletten mitochondrialen Genoms von Eulimnogammarus verrucosus trotz einer vorwiegend aus nukleärer DNA bestehenden genomischen Bibliothek. Zusammenfassend stellt diese Arbeit algorithmische Methoden vor, welche die Analysen von Tiling Array, DNA-Seq, RNA-Seq und MethylC-Seq Daten signifikant verbessern. Es werden zudem Standards für den Vergleich von Programmen zum Mappen von Daten der Hochdurchsatz-Sequenzierung vorgeschlagen. Darüber hinaus wird ein neues Verfahren zur unterstützten Genom-Assemblierung vorgestellt, welches erfolgreich bei der de novo-Assemblierung eines mitochondrialen Krustentier-Genoms eingesetzt wurde. Genominformatik Tiling Arrays Mapping Split-Read Mapping Bisulfite-behandelte Sequenzierungsdaten unterstützte Assemblierung Genome informatics mapping split-read mapping bisulfite-treated sequencing data guided assembly ddc:500
92	Monomino-Domino Tatami Coverings Erickson, Alejandro 03 September 2013 (has links) We present several new results on the combinatorial properties of a locally restricted version of monomino-domino coverings of rectilinear regions. These are monomino-domino tatami coverings, and the restriction is that no four tiles may meet at any point. The global structure that the tatami restriction induces has numerous implications, and provides a powerful tool for solving enumeration problems on tatami coverings. Among these we address the enumeration of coverings of rectangles, with various parameters, and we develop algorithms for exhaustive generation of coverings, in constant amortised time per covering. We also con- sider computational complexity on two fronts; firstly, the structure shows that the space required to store a covering of the rectangle is linear in its longest dimension, and secondly, it is NP-complete to decide whether an arbitrary polyomino can be tatami-covered only with dominoes. / Graduate / 0984 / 0405 / alejandro.erickson@gmail.com combinatorics cstheory math.CO cs.CC computer science theory mathematics tiling covering tatami restriction monomino domino enumeration combinatorial algorithms combinatorial generation computational complexity
93	Cohomology and K-theory of aperiodic tilings Savinien, Jean P.X. 19 May 2008 (has links) We study the K-theory and cohomology of spaces of aperiodic and repetitive tilings with finite local complexity. Given such a tiling, we build a spectral sequence converging to its K-theory and define a new cohomology (PV cohomology) that appears naturally in the second page of this spectral sequence. This spectral sequence can be seen as a generalization of the Leray-Serre spectral sequence and the PV cohomology generalizes the cohomology of the base space of a Serre fibration with local coefficients in the K-theory of its fiber. We prove that the PV cohomology of such a tiling is isomorphic to the Cech cohomology of its hull. We give examples of explicit calculations of PV cohomology for a class of 1-dimensional tilings (obtained by cut-and-projection of a 2-dimensional lattice). We also study the groupoid of the transversal of the hull of such tilings and show that they can be recovered: 1) from inverse limit of simpler groupoids (which are quotients of free categories generated by finite graphs), and 2) from an inverse semi group that arises from PV cohomology. The underslying Delone set of punctures of such tilings modelizes the atomics positions in an aperiodic solid at zero temperature. We also present a study of (classical and harmonic) vibrational waves of low energy on such solids (acoustic phonons). We establish that the energy functional (the "matrix of spring constants" which describes the vibrations of the atoms around their equilibrium positions) behaves like a Laplacian at low energy. K-theory Tilings Cohomology Homology theory K-theory Aperiodic tilings Algebraic topology Tiling (Mathematics) Combinatorial designs and configurations Mathematics Spectral sequences (Mathematics)
94	Optimisation de la localité des données sur architectures manycœurs / Data locality on manycore architectures Amstel, Duco van 18 July 2016 (has links) L'évolution continue des architectures des processeurs a été un moteur important de la recherche en compilation. Une tendance dans cette évolution qui existe depuis l'avènement des ordinateurs modernes est le rapport grandissant entre la puissance de calcul disponible (IPS, FLOPS, ...) et la bande-passante correspondante qui est disponible entre les différents niveaux de la hiérarchie mémoire (registres, cache, mémoire vive). En conséquence la réduction du nombre de communications mémoire requis par un code donnée a constitué un sujet de recherche important. Un principe de base en la matière est l'amélioration de la localité temporelle des données: regrouper dans le temps l'ensemble des accès à une donnée précise pour qu'elle ne soit requise que pendant peu de temps et pour qu'elle puisse ensuite être transféré vers de la mémoire lointaine (mémoire vive) sans communications supplémentaires.Une toute autre évolution architecturale a été l'arrivée de l'ère des multicoeurs et au cours des dernières années les premières générations de processeurs manycoeurs. Ces architectures ont considérablement accru la quantité de parallélisme à la disposition des programmes et algorithmes mais ceci est à nouveau limité par la bande-passante disponible pour les communications entres coeurs. Ceci a amené dans le monde de la compilation et des techniques d'optimisation des problèmes qui étaient jusqu'à là uniquement connus en calcul distribué.Dans ce texte nous présentons les premiers travaux sur une nouvelle technique d'optimisation, le pavage généralisé qui a l'avantage d'utiliser un modèle abstrait pour la réutilisation des données et d'être en même temps utilisable dans un grand nombre de contextes. Cette technique trouve son origine dans le pavage de boucles, une techniques déjà bien connue et qui a été utilisée avec succès pour l'amélioration de la localité des données dans les boucles imbriquées que ce soit pour les registres ou pour le cache. Cette nouvelle variante du pavage suit une vision beaucoup plus large et ne se limite pas au cas des boucles imbriquées. Elle se base sur une nouvelle représentation, le graphe d'utilisation mémoire, qui est étroitement lié à un nouveau modèle de besoins en termes de mémoire et de communications et qui s'applique à toute forme de code exécuté itérativement. Le pavage généralisé exprime la localité des données comme un problème d'optimisation pour lequel plusieurs solutions sont proposées. L'abstraction faite par le graphe d'utilisation mémoire permet la résolution du problème d'optimisation dans différents contextes. Pour l'évaluation expérimentale nous montrons comment utiliser cette nouvelle technique dans le cadre des boucles, imbriquées ou non, ainsi que dans le cas des programmes exprimés dans un langage à flot-de-données. En anticipant le fait d'utiliser le pavage généralisé pour la distribution des calculs entre les cœurs d'une architecture manycoeurs nous donnons aussi des éléments de réponse pour modéliser les communications et leurs caractéristiques sur ce genre d'architectures. En guise de point final, et pour montrer l'étendue de l'expressivité du graphe d'utilisation mémoire et le modèle de besoins en mémoire et communications sous-jacent, nous aborderons le sujet du débogage de performances et l'analyse des traces d'exécution. Notre but est de fournir un retour sur le potentiel d'amélioration en termes de localité des données du code évalué. Ce genre de traces peut contenir des informations au sujet des communications mémoire durant l'exécution et a de grandes similitudes avec le problème d'optimisation précédemment étudié. Ceci nous amène à une brève introduction dans le monde de l'algorithmique des graphes dirigés et la mise-au-point de quelques nouvelles heuristiques pour le problème connu de joignabilité mais aussi pour celui bien moins étudié du partitionnement convexe. / The continuous evolution of computer architectures has been an important driver of research in code optimization and compiler technologies. A trend in this evolution that can be traced back over decades is the growing ratio between the available computational power (IPS, FLOPS, ...) and the corresponding bandwidth between the various levels of the memory hierarchy (registers, cache, DRAM). As a result the reduction of the amount of memory communications that a given code requires has been an important topic in compiler research. A basic principle for such optimizations is the improvement of temporal data locality: grouping all references to a single data-point as close together as possible so that it is only required for a short duration and can be quickly moved to distant memory (DRAM) without any further memory communications.Yet another architectural evolution has been the advent of the multicore era and in the most recent years the first generation of manycore designs. These architectures have considerably raised the bar of the amount of parallelism that is available to programs and algorithms but this is again limited by the available bandwidth for communications between the cores. This brings some issues thatpreviously were the sole preoccupation of distributed computing to the world of compiling and code optimization techniques.In this document we present a first dive into a new optimization technique which has the promise of offering both a high-level model for data reuses and a large field of potential applications, a technique which we refer to as generalized tiling. It finds its source in the already well-known loop tiling technique which has been applied with success to improve data locality for both register and cache-memory in the case of nested loops. This new "flavor" of tiling has a much broader perspective and is not limited to the case of nested loops. It is build on a new representation, the memory-use graph, which is tightly linked to a new model for both memory usage and communication requirements and which can be used for all forms of iterate code.Generalized tiling expresses data locality as an optimization problem for which multiple solutions are proposed. With the abstraction introduced by the memory-use graph it is possible to solve this optimization problem in different environments. For experimental evaluations we show how this new technique can be applied in the contexts of loops, nested or not, as well as for computer programs expressed within a dataflow language. With the anticipation of using generalized tiling also to distributed computations over the cores of a manycore architecture we also provide some insight into the methods that can be used to model communications and their characteristics on such architectures.As a final point, and in order to show the full expressiveness of the memory-use graph and even more the underlying memory usage and communication model, we turn towards the topic of performance debugging and the analysis of execution traces. Our goal is to provide feedback on the evaluated code and its potential for further improvement of data locality. Such traces may contain information about memory communications during an execution and show strong similarities with the previously studied optimization problem. This brings us to a short introduction to the algorithmics of directed graphs and the formulation of some new heuristics for the well-studied topic of reachability and the much less known problem of convex partitioning. Optimisation de code Localité des données Ordonnancement Pavage de boucles Langages dataflow Architectures manycœurs Code optimisation Data locality Scheduling Loop tiling Dataflow languages Manycore architectures 004
95	Polymage : Automatic Optimization for Image Processing Pipelines Mullapudi, Ravi Teja January 2015 (has links) (PDF) Image processing pipelines are ubiquitous. Every image captured by a camera and every image uploaded on social networks like Google+or Facebook is processed by a pipeline. Applications in a wide range of domains like computational photography, computer vision and medical imaging use image processing pipelines. Many of these applications demand high-performance which requires effective utilization of modern architectures. Given the proliferation of camera enabled devices and social networks optimizing these emerging workloads has become important both at the data center and the embedded device scales. An image processing pipeline can be viewed as a graph of interconnected stages which process images successively. Each stage typically performs one of point-wise, stencil, sam-pling, reduction or data-dependent operations on image pixels. Individual stages in a pipeline typically exhibit abundant data parallelism that can be exploited with relative ease. However, the stages also require high memory bandwidth preventing effective uti-lization of parallelism available on modern architectures. The traditional options are using optimized libraries like OpenCV or to optimize manually. While using libraries precludes optimization across library routines, manual optimization accounting for both parallelism and locality is very tedious. Inthisthesis,wepresentthedesignandimplementationofPolyMage,adomain-speciﬁc language and compiler for image processing pipelines. The focus of the system is on au-tomatically generating high-performance implementations of image processing pipelines expressed in a high-level declarative language. We achieve such automation with: • tiling techniques to improve parallelism and locality by introducing redundant computation, v a model-driven fusion heuristic which enables a trade-off between locality and re-dundant computations, and anautotuner whichleveragesthefusionheuristictoexploreasmallsubsetofpipeline implementations and ﬁnd the best performing one. Our optimization approach primarily relies on the transformation and code generation ca-pabilities of the polyhedral compiler framework. To the best of our knowledge, this is the ﬁrst model-driven compiler for image processing pipelines that performs complex fusion, tiling, and storage optimization fully automatically. We evaluate our framework on a modern multicore system using a set of seven benchmarks which vary widely in structure and complexity. Experimental results show that the performance of pipeline implementations generated by our approach is: • up to 1.81× better than pipeline implementations manually tuned using Halide, a state-of-the-art language and compiler for image processing pipelines, • on average 5.39× better than pipeline implementations automatically tuned using Halide and OpenTuner, and • on average 3.3× better than naive pipeline implementations which only exploit par-allelism without optimizing for locality. We also demonstrate that the performance of PolyMage generated code is better or compa-rable to implementations using OpenCV, a state-of-the-art image processing and computer vision library. Polymage Image Processing Polyhydral Optimization Image Processing Pipelines Domain Specific Languages Compiler Optimiation Multicore Processing Parallel Computing Code Generation Tiling Pipeline Graphs Computer Science
96	Polysimplices in euclidean spaces and the enumeration of domino tilings of rectangles Michel, Jean-Luc 15 June 2011 (has links) Nous étudions, dans la première partie de notre thèse, les polysimplexes d’un espace euclidien de dimension quelconque, c’est-à-dire les objets consistant en une juxtaposition de simplexes réguliers (de tétraèdres si la dimension est 3) accolés le long de leurs faces. Nous étudions principalement le groupe des symétries de ces polysimplexes. Nous présentons une façon de représenter un polysimplexe à l’aide d’un diagramme. Ceci fournit une classification complète des polysimplexes à similitude près. De plus, le groupe des symétries se déduit du groupe des automorphismes du diagramme. Il découle en particulier de notre étude qu’en dimension supérieure à 2, une telle structure ne possède jamais deux faces parallèles et ne contient jamais de circuit fermé de simplexes.<p><p>Dans la seconde partie de notre thèse, nous abordons un problème classique de combinatoire :l’énumération des pavages d’un rectangle mxn à l’aide de dominos. Klarner et Pollack ont montré qu’en fixant m la suite obtenue vérifie une relation de récurrence linéaire à coefficients constants. Nous établissons une nouvelle méthode nous permettant d’obtenir la fonction génératrice correspondante et la calculons pour m <= 16, alors qu’elle n’était connue que pour m <= 10.<p> / Doctorat en Sciences / info:eu-repo/semantics/nonPublished Mathématiques Simplexes (Mathematics) Symmetry (Mathematics) Automorphisms Simplexes (Mathématiques) Symétrie (Mathématiques) Automorphismes symmetry group domino tiling perfect matching dimer generating function regular simplex regular tetrahedron polysimplex
97	Identification of Heat Shock Factor Binding Sites in the Drosophila Genome Gonsalves, Sarah E. 12 December 2012 (has links) The heat shock response (HSR) is a highly conserved mechanism that enables organisms to survive environmental and pathophysiological stress. In Drosophila, the HSR is regulated by a single transcription factor, heat shock factor (HSF). During stress, HSF trimerizes and binds to over 200 loci on Drosophila polytene chromosomes with only nine mapping to major heat shock (HS) inducible gene loci. The function of HSF binding to the other sites in the genome is currently unknown. Some of these sites may contain yet unidentified “minor” HS genes. Interestingly, the binding of HSF also coincides with puff regression at some sites. Two such sites contain the major developmentally regulated genes Eip74 and Eip75: key regulators in the response to 20-hydroxyecdysone (20E), the main hormone responsible for the temporal co-ordination of post-embryonic development in Drosophila. Previous work in our and other labs indicates that the regression of non-HS puffs during the HSR is dependent on the presence of functional HSF. Using chromatin immunoprecipitation (ChIP) followed by hybridization to genome tiling arrays (Chip), I have identified 434 regions in the Drosophila Kc cell genome that are bound by HSF during HS, and have determined that 57% of these sites are located within the transcribed regions of genes. By examining the transcriptional response to HS in Kc cells and third instar larvae using expression microarrays, I found that only about 10% of all genes within 1250 bp of an HSF binding site are transcriptionally regulated by HS and many genes whose transcript levels change during HS do not appear to be near an HSF binding site. Furthermore, genes with an HSF binding site within their introns are significantly enriched (modified Fisher Exact p-value between 2.0x10-3 and 1.5x10-6) in gene ontology terms related to developmental processes and reproduction. Using expression microarray technology, I characterized the transcriptional response to 20E and its structural analog ponasterone A. I have identified multiple HSF binding sites within Eip74 and Eip75, and show that induction of the HSR correlates with repression of these genes and all other 20E-inducible genes. Taken together, this work provides a basis for further investigation into the role of HSF binding to sites not associated with HS genes and its possible function as a repressor of gene transcription during conditions of stress and as a regulator of developmental genes under stress and non-stress conditions. HSF Heat Shock Factor Heat Shock Response transcription gene expression genomics ChIP-on-chip ecdysone Eip75B Ponasterone A 20-hydroxyecdysone ecdysone response hsf4 transcription factor models of transcription genome tiling arrays Heat Shock Protein Heat Shock Gene 0307
98	Identification of Heat Shock Factor Binding Sites in the Drosophila Genome Gonsalves, Sarah E. 12 December 2012 (has links) The heat shock response (HSR) is a highly conserved mechanism that enables organisms to survive environmental and pathophysiological stress. In Drosophila, the HSR is regulated by a single transcription factor, heat shock factor (HSF). During stress, HSF trimerizes and binds to over 200 loci on Drosophila polytene chromosomes with only nine mapping to major heat shock (HS) inducible gene loci. The function of HSF binding to the other sites in the genome is currently unknown. Some of these sites may contain yet unidentified “minor” HS genes. Interestingly, the binding of HSF also coincides with puff regression at some sites. Two such sites contain the major developmentally regulated genes Eip74 and Eip75: key regulators in the response to 20-hydroxyecdysone (20E), the main hormone responsible for the temporal co-ordination of post-embryonic development in Drosophila. Previous work in our and other labs indicates that the regression of non-HS puffs during the HSR is dependent on the presence of functional HSF. Using chromatin immunoprecipitation (ChIP) followed by hybridization to genome tiling arrays (Chip), I have identified 434 regions in the Drosophila Kc cell genome that are bound by HSF during HS, and have determined that 57% of these sites are located within the transcribed regions of genes. By examining the transcriptional response to HS in Kc cells and third instar larvae using expression microarrays, I found that only about 10% of all genes within 1250 bp of an HSF binding site are transcriptionally regulated by HS and many genes whose transcript levels change during HS do not appear to be near an HSF binding site. Furthermore, genes with an HSF binding site within their introns are significantly enriched (modified Fisher Exact p-value between 2.0x10-3 and 1.5x10-6) in gene ontology terms related to developmental processes and reproduction. Using expression microarray technology, I characterized the transcriptional response to 20E and its structural analog ponasterone A. I have identified multiple HSF binding sites within Eip74 and Eip75, and show that induction of the HSR correlates with repression of these genes and all other 20E-inducible genes. Taken together, this work provides a basis for further investigation into the role of HSF binding to sites not associated with HS genes and its possible function as a repressor of gene transcription during conditions of stress and as a regulator of developmental genes under stress and non-stress conditions. HSF Heat Shock Factor Heat Shock Response transcription gene expression genomics ChIP-on-chip ecdysone Eip75B Ponasterone A 20-hydroxyecdysone ecdysone response hsf4 transcription factor models of transcription genome tiling arrays Heat Shock Protein Heat Shock Gene 0307
99	An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations Pananilath, Irshad Muhammed January 2014 (has links) (PDF) Lattice-Boltzmann method(LBM), a promising new particle-based simulation technique for complex and multiscale fluid flows, has seen tremendous adoption in recent years in computational fluid dynamics. Even with a state-of-the-art LBM solver such as Palabos, a user still has to manually write his program using the library-supplied primitives. We propose an automated code generator for a class of LBM computations with the objective to achieve high performance on modern architectures. Tiling is a very important loop transformation used to improve the performance of stencil computations by exploiting locality and parallelism. In the first part of the work, we explore diamond tiling, a new tiling technique to exploit the inherent ability of most stencils to allow tile-wise concurrent start. This enables perfect load-balance during execution and reduces the frequency of synchronization required. Few studies have looked at time tiling for LBM codes. We exploit a key similarity between stencils and LBM to enable polyhedral optimizations and in turn time tiling for LBM. Besides polyhedral transformations, we also describe a number of other complementary transformations and post processing necessary to obtain good parallel and SIMD performance on modern architectures. We also characterize the performance of LBM with the Roofline performance model. Experimental results for standard LBM simulations like Lid Driven Cavity, Flow Past Cylinder, and Poiseuille Flow show that our scheme consistently outperforms Palabos–on average by3 x while running on 16 cores of a n Intel Xeon Sandy bridge system. We also obtain a very significant improvement of 2.47 x over the native production compiler on the SPECLBM benchmark. Lattice-Boltzmann Computations Computational Fluid Dynamics Tiling Stencil Computations Single Instruction Multiple Data (SIMD) Parallel Computers Parallel Processing Loop Transformations Lattice-Boltzman Method (LBM) Lattice Boltzman Method Lattice-Boltzmann Equation Computer Science
100	The mapping task and its various applications in next-generation sequencing Otto, Christian 27 February 2015 (has links) The aim of this thesis is the development and benchmarking of computational methods for the analysis of high-throughput data from tiling arrays and next-generation sequencing. Tiling arrays have been a mainstay of genome-wide transcriptomics, e.g., in the identification of functional elements in the human genome. Due to limitations of existing methods for the data analysis of this data, a novel statistical approach is presented that identifies expressed segments as significant differences from the background distribution and thus avoids dataset-specific parameters. This method detects differentially expressed segments in biological data with significantly lower false discovery rates and equivalent sensitivities compared to commonly used methods. In addition, it is also clearly superior in the recovery of exon-intron structures. Moreover, the search for local accumulations of expressed segments in tiling array data has led to the identification of very large expressed regions that may constitute a new class of macroRNAs. This thesis proceeds with next-generation sequencing for which various protocols have been devised to study genomic, transcriptomic, and epigenomic features. One of the first crucial steps in most NGS data analyses is the mapping of sequencing reads to a reference genome. This work introduces algorithmic methods to solve the mapping tasks for three major NGS protocols: DNA-seq, RNA-seq, and MethylC-seq. All methods have been thoroughly benchmarked and integrated into the segemehl mapping suite. First, mapping of DNA-seq data is facilitated by the core mapping algorithm of segemehl. Since the initial publication, it has been continuously updated and expanded. Here, extensive and reproducible benchmarks are presented that compare segemehl to state-of-the-art read aligners on various data sets. The results indicate that it is not only more sensitive in finding the optimal alignment with respect to the unit edit distance but also very specific compared to most commonly used alternative read mappers. These advantages are observable for both real and simulated reads, are largely independent of the read length and sequencing technology, but come at the cost of higher running time and memory consumption. Second, the split-read extension of segemehl, presented by Hoffmann, enables the mapping of RNA-seq data, a computationally more difficult form of the mapping task due to the occurrence of splicing. Here, the novel tool lack is presented, which aims to recover missed RNA-seq read alignments using de novo splice junction information. It performs very well in benchmarks and may thus be a beneficial extension to RNA-seq analysis pipelines. Third, a novel method is introduced that facilitates the mapping of bisulfite-treated sequencing data. This protocol is considered the gold standard in genome-wide studies of DNA methylation, one of the major epigenetic modifications in animals and plants. The treatment of DNA with sodium bisulfite selectively converts unmethylated cytosines to uracils, while methylated ones remain unchanged. The bisulfite extension developed here performs seed searches on a collapsed alphabet followed by bisulfite-sensitive dynamic programming alignments. Thus, it is insensitive to bisulfite-related mismatches and does not rely on post-processing, in contrast to other methods. In comparison to state-of-the-art tools, this method achieves significantly higher sensitivities and performs time-competitive in mapping millions of sequencing reads to vertebrate genomes. Remarkably, the increase in sensitivity does not come at the cost of decreased specificity and thus may finally result in a better performance in calling the methylation rate. Lastly, the potential of mapping strategies for de novo genome assemblies is demonstrated with the introduction of a new guided assembly procedure. It incorporates mapping as major component and uses the additional information (e.g., annotation) as guide. With this method, the complete mitochondrial genome of Eulimnogammarus verrucosus has been successfully assembled even though the sequencing library has been heavily dominated by nuclear DNA. In summary, this thesis introduces algorithmic methods that significantly improve the analysis of tiling array, DNA-seq, RNA-seq, and MethylC-seq data, and proposes standards for benchmarking NGS read aligners. Moreover, it presents a new guided assembly procedure that has been successfully applied in the de novo assembly of a crustacean mitogenome. / Diese Arbeit befasst sich mit der Entwicklung und dem Benchmarken von Verfahren zur Analyse von Daten aus Hochdurchsatz-Technologien, wie Tiling Arrays oder Hochdurchsatz-Sequenzierung. Tiling Arrays bildeten lange Zeit die Grundlage für die genomweite Untersuchung des Transkriptoms und kamen beispielsweise bei der Identifizierung funktioneller Elemente im menschlichen Genom zum Einsatz. In dieser Arbeit wird ein neues statistisches Verfahren zur Auswertung von Tiling Array-Daten vorgestellt. Darin werden Segmente als exprimiert klassifiziert, wenn sich deren Signale signifikant von der Hintergrundverteilung unterscheiden. Dadurch werden keine auf den Datensatz abgestimmten Parameterwerte benötigt. Die hier vorgestellte Methode erkennt differentiell exprimierte Segmente in biologischen Daten bei gleicher Sensitivität mit geringerer Falsch-Positiv-Rate im Vergleich zu den derzeit hauptsächlich eingesetzten Verfahren. Zudem ist die Methode bei der Erkennung von Exon-Intron Grenzen präziser. Die Suche nach Anhäufungen exprimierter Segmente hat darüber hinaus zur Entdeckung von sehr langen Regionen geführt, welche möglicherweise eine neue Klasse von macroRNAs darstellen. Nach dem Exkurs zu Tiling Arrays konzentriert sich diese Arbeit nun auf die Hochdurchsatz-Sequenzierung, für die bereits verschiedene Sequenzierungsprotokolle zur Untersuchungen des Genoms, Transkriptoms und Epigenoms etabliert sind. Einer der ersten und entscheidenden Schritte in der Analyse von Sequenzierungsdaten stellt in den meisten Fällen das Mappen dar, bei dem kurze Sequenzen (Reads) auf ein großes Referenzgenom aligniert werden. Die vorliegende Arbeit stellt algorithmische Methoden vor, welche das Mapping-Problem für drei wichtige Sequenzierungsprotokolle (DNA-Seq, RNA-Seq und MethylC-Seq) lösen. Alle Methoden wurden ausführlichen Benchmarks unterzogen und sind in der segemehl-Suite integriert. Als Erstes wird hier der Kern-Algorithmus von segemehl vorgestellt, welcher das Mappen von DNA-Sequenzierungsdaten ermöglicht. Seit der ersten Veröffentlichung wurde dieser kontinuierlich optimiert und erweitert. In dieser Arbeit werden umfangreiche und auf Reproduzierbarkeit bedachte Benchmarks präsentiert, in denen segemehl auf zahlreichen Datensätzen mit bekannten Mapping-Programmen verglichen wird. Die Ergebnisse zeigen, dass segemehl nicht nur sensitiver im Auffinden von optimalen Alignments bezüglich der Editierdistanz sondern auch sehr spezifisch im Vergleich zu anderen Methoden ist. Diese Vorteile sind in realen und simulierten Daten unabhängig von der Sequenzierungstechnologie oder der Länge der Reads erkennbar, gehen aber zu Lasten einer längeren Laufzeit und eines höheren Speicherverbrauchs. Als Zweites wird das Mappen von RNA-Sequenzierungsdaten untersucht, welches bereits von der Split-Read-Erweiterung von segemehl unterstützt wird. Aufgrund von Spleißen ist diese Form des Mapping-Problems rechnerisch aufwendiger. In dieser Arbeit wird das neue Programm lack vorgestellt, welches darauf abzielt, fehlende Read-Alignments mit Hilfe von de novo Spleiß-Information zu finden. Es erzielt hervorragende Ergebnisse und stellt somit eine sinnvolle Ergänzung zu Analyse-Pipelines für RNA-Sequenzierungsdaten dar. Als Drittes wird eine neue Methode zum Mappen von Bisulfit-behandelte Sequenzierungsdaten vorgestellt. Dieses Protokoll gilt als Goldstandard in der genomweiten Untersuchung der DNA-Methylierung, einer der wichtigsten epigenetischen Modifikationen in Tieren und Pflanzen. Dabei wird die DNA vor der Sequenzierung mit Natriumbisulfit behandelt, welches selektiv nicht methylierte Cytosine zu Uracilen konvertiert, während Methylcytosine davon unberührt bleiben. Die hier vorgestellte Bisulfit-Erweiterung führt die Seed-Suche auf einem reduziertem Alphabet durch und verifiziert die erhaltenen Treffer mit einem auf dynamischer Programmierung basierenden Bisulfit-sensitiven Alignment-Algorithmus. Das verwendete Verfahren ist somit unempfindlich gegenüber Bisulfit-Konvertierungen und erfordert im Gegensatz zu anderen Verfahren keine weitere Nachverarbeitung. Im Vergleich zu aktuell eingesetzten Programmen ist die Methode sensitiver und benötigt eine vergleichbare Laufzeit beim Mappen von Millionen von Reads auf große Genome. Bemerkenswerterweise wird die erhöhte Sensitivität bei gleichbleibend guter Spezifizität erreicht. Dadurch könnte diese Methode somit auch bessere Ergebnisse bei der präzisen Bestimmung der Methylierungsraten erreichen. Schließlich wird noch das Potential von Mapping-Strategien für Assemblierungen mit der Einführung eines neuen, Kristallisation-genanntes Verfahren zur unterstützten Assemblierung aufgezeigt. Es enthält Mapping als Hauptbestandteil und nutzt Zusatzinformation (z.B. Annotationen) als Unterstützung. Dieses Verfahren ermöglichte die erfolgreiche Assemblierung des kompletten mitochondrialen Genoms von Eulimnogammarus verrucosus trotz einer vorwiegend aus nukleärer DNA bestehenden genomischen Bibliothek. Zusammenfassend stellt diese Arbeit algorithmische Methoden vor, welche die Analysen von Tiling Array, DNA-Seq, RNA-Seq und MethylC-Seq Daten signifikant verbessern. Es werden zudem Standards für den Vergleich von Programmen zum Mappen von Daten der Hochdurchsatz-Sequenzierung vorgeschlagen. Darüber hinaus wird ein neues Verfahren zur unterstützten Genom-Assemblierung vorgestellt, welches erfolgreich bei der de novo-Assemblierung eines mitochondrialen Krustentier-Genoms eingesetzt wurde. info:eu-repo/classification/ddc/500 ddc:500

Search results