Algorithmes et structures de données efficaces pour l’indexation de séquences d’ADN / Efficient algorithms and data structures for indexing DNA sequence dataSalikhov, Kamil 17 November 2017 (has links)
Les volumes des données générées par les technologies de séquençage haut débit augmentent exponentiellement ce dernier temps. Le stockage, le traitement et le transfertdeviennent des défis de plus en plus sérieux. Pour les affronter, les scientifiques doivent élaborer des approches et des algorithmes de plus en plus efficaces.Dans cette thèse, nous présentons des structures de données efficaces etdes algorithmes pour des problèmes de recherche approchée de chaînes de caractères, d'assemblagedu génome, de compression de séquences d’ADN et de classificationmétagénomique de lectures d’ADN.Le problème de recherche approchée a été bien étudié, avec un grandnombre de travaux publiés. Dans ledomaine de bioinformatique, le problème d’alignement de séquences peut être considéré comme unproblème de recherche approchée de chaînes de caractères. Dans notre travail, nousétudions une stratégie de recherche basée sur une structure d'indexation ditebidirectionnelle. D’abord, nous définissons un formalisme des schémas de recherche pour travailleravec les stratégies de recherche de ce type, ensuite nous fixons une mesure probabiliste del’efficacité de schémas de recherche et démontrons quelques propriétés combinatoires de schémasde recherche efficaces. Finalement, nous présentons des calculs expérimentaux quivalident la supériorité de nos stratégies. L’assemblage du génome est un des problèmes clefs en bioinformatique.Dans cette thèse, nous présentons une structure de données — filtre de Bloom en Cascade— qui améliore le filtre de Bloom standard et peut être utilisé pour larésolution de certains problèmes, y compris pour l’assemblage du génome. Nousdémontrons ensuite des résultats analytiques et expérimentaux sur les propriétés du filtre deBloom en Cascade. Nous présentons également comment le filtre de Bloom en Cascade peut être appliqué au problèmede compression de séquences d’ADN.Un autre problème que nous étudions dans cette thèse est la classificationmétagénomique de lectures d’ADN. Nous présentons une approche basée sur la transforméede Burrows-Wheeler pour la recherche efficace et rapide de k-mers (mots de longueur k).Cette étude est centrée sur les structures des données qui améliorent lavitesse et la consommation de mémoire par rapport à l'index classique de Burrows-Wheeler, dans le cadre de notre application / Amounts of data generated by Next Generation Sequencing technologies increase exponentially in recent years. Storing, processing and transferring this data become more and more challenging tasks. To be able to cope with them, data scientists should develop more and more efficient approaches and techniques.In this thesis we present efficient data structures and algorithmic methods for the problems of approximate string matching, genome assembly, read compression and taxonomy based metagenomic classification.Approximate string matching is an extensively studied problem with countless number of published papers, both theoretical and practical. In bioinformatics, read mapping problem can be regarded as approximate string matching. Here we study string matching strategies based on bidirectional indices. We define a framework, called search schemes, to work with search strategies of this type, then provide a probabilistic measure for the efficiency of search schemes, prove several combinatorial properties of efficient search schemes and provide experimental computations supporting the superiority of our strategies.Genome assembly is one of the basic problems of bioinformatics. Here we present Cascading Bloom filter data structure, that improves standard Bloom filter and can be applied to several problems like genome assembly. We provide theoretical and experimental results proving properties of Cascading Bloom filter. We also show how Cascading Bloom filter can be used for solving another important problem of read compression.Another problem studied in this thesis is metagenomic classification. We present a BWT-based approach that improves the BWT-index for quick and memory-efficient k-mer search. We mainly focus on data structures that improve speed and memory usage of classical BWT-index for our application
Bioinformatics analyses for next-generation sequencing of plasma DNA.January 2012 (has links)
1997年,Dennis等證明胚胎DNA在孕婦母體中存在的事實開啟了產前無創診斷的大門。起初的應用包括性別鑒定和恒河猴血型系統的識別。隨著二代測序的出現和發展,對外周血游離DNA更加成熟的分析和應用應運而生。例如當孕婦懷孕十二周時, 應用二代測序技術在母體外周血DNA中預測胎兒21號染色體是否是三倍體, 其準確性達到98%。本論文的第一部分介紹如何應用母體外周血DNA構建胎兒的全基因組遺傳圖譜。這項研究極具挑戰,原因是孕後12周,胎兒對外周血DNA貢獻很小,大多數在10%左右,另外外周血中的胎兒DNA大多數短於200 bp。目前的演算法和程式都不適合於從母體外周血DNA中構建胎兒的遺傳圖譜。在這項研究中,根據母親和父親的基因型,用生物資訊學手段先構建胎兒可能有的遺傳圖譜,然後將母體外周血DNA的測序資訊比對到這張可能的遺傳圖譜上。如果在母親純和遺傳背景下,決定父親的特異遺傳片段,只要定性檢測父親的特異遺傳片段是否在母體外周血中存在。如果在母親雜合遺傳背景下,決定母親的遺傳特性,就要進行定量分析。我開發了單倍型相對劑量分析方案,統計學上判斷母親外周血中的兩條單倍型相對劑量水準,顯著增加的單倍型即為最大可能地遺傳給胎兒的單倍型。單倍型相對劑量分析方案可以加強測序資訊的分析效率,降低測序數據波動,比單個位點分析更加穩定,強壯。 / 隨著靶標富集測序出現,測序價格急劇下降。第一部分運用母親父親的多態位點基因型的組合加上測序的資訊可以計算出胎兒DNA在母體外周血中的濃度。但是該方法的局限是要利用母親父親的多態位點的基因型,而不能直接從測序的資訊中推測胎兒DNA在母體外周血中的濃度。本論文的第二部分,我開發了基於二項分佈的混合模型直接預測胎兒DNA在母體外周血中的濃度。當混合模型的似然值達到最大的時候,胎兒DNA在母體外周血中的濃度得到最優估算。由於靶標富集測序可以提供高倍覆蓋的測序資訊,從而有機會直接根據概率模型識別出母親是純和而且胎兒是雜合的有特異信息量的位點。 / 除了母體外周血DNA水準分析推動產前無創診斷外,表觀遺傳學的分析也不容忽視。 在本論文的第三部分,我開發了Methyl-Pipe軟體,專門用於全基因組的甲基化的分析。甲基化測序數據分析比一般的基因組測序分析更加複雜。由於重亞硫酸鹽測序文庫的沒有甲基化的胞嘧啶轉化成尿嘧啶,最後以胸腺嘧啶的形式存在PCR產物中, 但是對於甲基化的胞嘧啶則保持不變。 因此,為了實現將重亞硫酸鹽處理過的測序序列比對到參考基因組。首先,分別將Watson和Crick鏈的參考基因組中胞嘧啶轉化成全部轉化為胸腺嘧啶,同時也將測序序列中的胞嘧啶轉化成胸腺嘧啶。然後將轉化後的測序序列比對到參考基因組上。最後根據比對到基因組上的測序序列中的胞嘧啶和胸腺嘧啶的含量推到全基因組的甲基化水準和甲基化特定模式。Methyl-Pipe可以用於識別甲基化水平顯著性差異的基因組區別,因此它可以用於識別潛在的胎兒特異的甲基化位點用於產前無創診斷。 / The presence of fetal DNA in the cell-free plasma of pregnant women was first described in 1997. The initial clinical applications of this phenomenon focused on the detection of paternally inherited traits such as sex and rhesus D blood group status. The development of massively parallel sequencing technologies has allowed more sophisticated analyses on circulating cell-free DNA in maternal plasma. For example, through the determination of the proportional representation of chromosome 21 sequences in maternal plasma, noninvasive prenatal diagnosis of fetal Down syndrome can be achieved with an accuracy of >98%. In the first part of my thesis, I have developed bioinformatics algorithms to perform genome-wide construction of the fetal genetic map from the massively parallel sequencing data of the maternal plasma DNA sample of a pregnant woman. The construction of the fetal genetic map through the maternal plasma sequencing data is very challenging because fetal DNA only constitutes approximately 10% of the maternal plasma DNA. Moreover, as the fetal DNA in maternal plasma exists as short fragments of less than 200 bp, existing bioinformatics techniques for genome construction are not applicable for this purpose. For the construction of the genome-wide fetal genetic map, I have used the genome of the father and the mother as scaffolds and calculated the fractional fetal DNA concentration. First, I looked at the paternal specific sequences in maternal plasma to determine which portions of the father’s genome had been passed on to the fetus. For the determination of the maternal inheritance, I have developed the Relative Haplotype Dosage (RHDO) approach. This method is based on the principle that the portion of maternal genome inherited by the fetus would be present in slightly higher concentration in the maternal plasma. The use of haplotype information can enhance the efficacy of using the sequencing data. Thus, the maternal inheritance can be determined with a much lower sequencing depth than just looking at individual loci in the genome. This algorithm makes it feasible to use genome-wide scanning to diagnose fetal genetic disorders prenatally in a noninvasive way. / As the emergence of targeted massively parallel sequencing, the sequencing cost per base is reducing dramatically. Even though the first part of the thesis has already developed a method to estimate fractional fetal DNA concentration using parental genotype informations, it still cannot be used to deduce the fractional fetal DNA concentration directly from sequencing data without prior knowledge of genotype information. In the second part of this thesis, I propose a statistical mixture model based method, FetalQuant, which utilizes the maximum likelihood to estimate the fractional fetal DNA concentration directly from targeted massively parallel sequencing of maternal plasma DNA. This method allows fetal DNA concentration estimation superior to the existing methods in term of obviating the need of genotype information without loss of accuracy. Furthermore, by using Bayes’ rule, this method can distinguish the informative SNPs where mother is homozygous and fetus is heterozygous, which is potential to detect dominant inherited disorder. / Besides the genetic analysis at the DNA level, epigenetic markers are also valuable for noninvasive diagnosis development. In the third part of this thesis, I have also developed a bioinformatics algorithm to efficiently analyze genomewide DNA methylation status based on the massively parallel sequencing of bisulfite-converted DNA. DNA methylation is one of the most important mechanisms for regulating gene expression. The study of DNA methylation for different genes is important for the understanding of the different physiological and pathological processes. Currently, the most popular method for analyzing DNA methylation status is through bisulfite sequencing. The principle of this method is based on the fact that unmethylated cytosine residues would be chemically converted to uracil on bisulfite treatment whereas methylated cytosine would remain unchanged. The converted uracil and unconverted cytosine can then be discriminated on sequencing. With the emergence of massively parallel sequencing platforms, it is possible to perform this bisulfite sequencing analysis on a genome-wide scale. However, the bioinformatics analysis of the genome-wide bisulfite sequencing data is much more complicated than analyzing the data from individual loci. Thus, I have developed Methyl-Pipe, a bioinformatics program for analyzing the DNA methylation status of genome-wide methylation status of DNA samples based on massively parallel sequencing. In the first step of this algorithm, an in-silico converted reference genome is produced by converting all the cytosine residues to thymine residues. Then, the sequenced reads of bisulfite-converted DNA sequences are aligned to this modified reference sequence. Finally, post-processing of the alignments removes non-unique and low-quality mappings and characterizes the methylation pattern in genome-wide manner. Making use of this new program, potential fetal-specific hypomethylated regions which can be used as blood biomarkers can be identified in a genome-wide manner. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Jiang, Peiyong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 100-105). / Abstracts also in Chinese. / Chapter SECTION I : --- BACKGROUND --- p.1 / Chapter CHAPTER 1: --- Circulating nucleic acids and Next-generation sequencing --- p.2 / Chapter 1.1 --- Circulating nucleic acids --- p.2 / Chapter 1.2 --- Next-generation sequencing --- p.3 / Chapter 1.3 --- Bioinformatics analyses --- p.9 / Chapter 1.4 --- Applications of the NGS --- p.11 / Chapter 1.5 --- Aims of this thesis --- p.12 / Chapter SECTION II : --- Mathematically decoding fetal genome in maternal plasma --- p.14 / Chapter CHAPTER 2: --- Characterizing the maternal and fetal genome in plasma at single base resolution --- p.15 / Chapter 2.1 --- Introduction --- p.15 / Chapter 2.2 --- SNP categories and principle --- p.17 / Chapter 2.3 --- Clinical cases and SNP genotyping --- p.20 / Chapter 2.4 --- Sequencing depth and fractional fetal DNA concentration determination --- p.24 / Chapter 2.5 --- Filtering of genotyping errors for maternal genotypes --- p.26 / Chapter 2.6 --- Constructing fetal genetic map in maternal plasma --- p.27 / Chapter 2.7 --- Sequencing error estimation --- p.36 / Chapter 2.8 --- Paternal-inherited alleles --- p.38 / Chapter 2.9 --- Maternally-derived alleles by RHDO analysis --- p.39 / Chapter 2.1 --- Recombination breakpoint simulation and detection --- p.49 / Chapter 2.11 --- Prenatal diagnosis of β- thalassaemia --- p.51 / Chapter 2.12 --- Discussion --- p.53 / Chapter SECTION III : --- Statistical model for fractional fetal DNA concentration estimation --- p.56 / Chapter CHAPTER 3: --- FetalQuant: deducing the fractional fetal DNA concentration from massively parallel sequencing of maternal plasma DNA --- p.57 / Chapter 3.1 --- Introduction --- p.57 / Chapter 3.2 --- Methods --- p.60 / Chapter 3.2.1 --- Maternal-fetal genotype combinations --- p.60 / Chapter 3.2.2 --- Binomial mixture model and likelihood --- p.64 / Chapter 3.2.3 --- Fractional fetal DNA concentration fitting --- p.66 / Chapter 3.3 --- Results --- p.71 / Chapter 3.3.1 --- Datasets --- p.71 / Chapter 3.3.2 --- Evaluation of FetalQuant algorithm --- p.75 / Chapter 3.3.3 --- Simulation --- p.78 / Chapter 3.3.4 --- Sequencing depth and the number of SNPs required by FetalQuant --- p.81 / Chapter 3.5 --- Discussion --- p.85 / Chapter SECTION IV : --- NGS-based data analysis pipeline development --- p.88 / Chapter CHAPTER 4: --- Methyl-Pipe: Methyl-Seq bioinformatics analysis pipeline --- p.89 / Chapter 4.1 --- Introduction --- p.89 / Chapter 4.2 --- Methods --- p.89 / Chapter 4.2.1 --- Overview of Methyl-Pipe --- p.90 / Chapter 4.3 --- Results and discussion --- p.96 / Chapter SECTION V : --- CONCLUDING REMARKS --- p.97 / Chapter CHAPTER 5: --- Conclusion and future perspectives --- p.98 / Chapter 5.1 --- Conclusion --- p.98 / Chapter 5.2 --- Future perspectives --- p.99 / Reference --- p.100
STORI: selectable taxon ortholog retrieval iterativelyStern, Joshua Gallant 08 June 2015 (has links)
Speciation and gene duplication are fundamental evolutionary processes that enable biological innovation. For over a decade, biologists have endeavored to distinguish orthology (homology caused by speciation) from paralogy (homology caused by duplication). Disentangling orthology and paralogy is useful to diverse fields such as phylogenetics, protein engineering, and genome content comparison.
A common step in ortholog detection is the computation of Bidirectional Best Hits (BBH). However, we found this computation impractical for more than 24 Eukaryotic proteomes. Attempting to retrieve orthologs in less time than previous methods require, we developed a novel algorithm and implemented it as a suite of Perl scripts. This software, Selectable Taxon Ortholog Retrieval Iteratively (STORI), retrieves orthologous protein sequences for a set of user-defined proteomes and query sequences. While the time complexity of the BBH method is O(#taxa^2), we found that the average CPU time used by STORI may increase linearly with the number of taxa.
To demonstrate one aspect of STORI’s usefulness, we used this software to infer the orthologous sequences of 26 ribosomal proteins (rProteins) from the large ribosomal subunit (LSU), for a set of 115 Bacterial and 94 Archaeal proteomes. Next, we used established tree-search methods to seek the most probable evolutionary explanation of these data. The current implementation of STORI runs on Red Hat Enterprise Linux 6.0 with installations of Moab 5.3.7, Perl 5 and several Perl modules. STORI is available at: <http://github.com/jgstern/STORI>.
Medium chain dehydrogenases/reductases : alcohol dehydrogenases of novel types /Norin, Annika, January 1900 (has links)
Härtill 7 uppsatser.
Nuclear receptors studied by molecular dynamics computer simulations /Carlsson, Peter, January 2004 (has links)
Härtill 4 uppsatser.
Molecular mechanisms of local anaesthetic action on voltage-gated ion channels /Nilsson, Johanna, January 2004 (has links)
Härtill 4 uppsatser.
Biological activities of novel platelet-derived growth factors, PDGF-C and PDGF-D /Pontén, Annica, January 2004 (has links)
Härtill 4 uppsatser.
Characterization of a putative tumor suppressor region identified by the elimination test on human 3p21.3 /Kiss, Hajnalka, January 2003 (has links)
Härtill 6 uppsatser.
Genome-plasticity and adaptation in Helicobacter pylori /Nilsson, Christina, January 2005 (has links)
Härtill 4 uppsatser.
Structure and function of the yeast mediator tail domain /Béve, Jenny, January 2006 (has links)
Härtill 5 uppsatser.
