Spelling suggestions: "subject:"bioinformatics (computational biology)"" "subject:"bioinformatics (computational ciology)""
51 |
Haploid Selection in AnimalsNettelblad, Jessica January 2018 (has links)
Haploid selection in animal sperm is a somewhat controversial topic, but recentevidence might shed experimental light on the matter. This thesis investigates thepossibility to detect any genetic selection in an artificial setting for zebrafish spermfrom a single individual. I analyse pooled data acquired from whole-genomesequencing for two distinct groups of short- and long-lived sperm, trying to identifyshifts in allele frequencies. I augment this by designing an accurate computersimulation of selection, that manipulates selection strength and takes biologicalaspects like linkage and sequence coverage into account. This allows large scaletesting and the generation of null distributions for any test metric. The mainconclusion is that selection has to be extremely strong to be detectable unless onewould explicitly account for genetic linkage, as opposed to the straightforwardper-marker approaches that formed the initial basis for our analyses.
|
52 |
A Rank Score Model of Variants Prioritization for Rare DiseaseLiu, Nanxing January 2023 (has links)
The diagnosis of genetic illnesses has undergone a revolution with advancements in sequencing technology. Next-generation sequencing (NGS) has become a standard practice in genetic diagnostics, enabling the identification of various genetic variations. However, distinguishing causative variants from a vast number of benign background variants presents a significant challenge. This study focuses on improving the rank score model used in genetic rare-disease diagnostics at a clinical genomics facility in Stockholm. The objective is to develop a more effective and optimized model through the utilization of exploratory data analysis techniques and machine learning methods, investigating the strengths and weaknesses of various existing annotation scores to identify suitable features and enhance the model's classification performance. The research methodology involved analyzing publicly available ClinVar data, utilizing statistical methods such as principal component analysis (PCA), heatmap, Welch's t-test, and Chi-Square test to evaluate the correlation, patterns, and classification abilities of different variant types. In addition, the study employed a machine learning approach that combines allele frequency filtering and logistic regression trained on both public and in-house datasets to prioritize single nucleotide variants (SNVs) and insertions/deletions (InDels). The resulting model assigns binary class labels (benign or pathogenic) and provides scores for variant classification. Promising performance was observed in both the ClinVar dataset and the unique patient datasets, demonstrating the model's potential for clinical application. The findings of this study hold the potential to enhance genetic rare-disease diagnostics and contribute to advancements in rare disease research.
|
53 |
Identifying Graph Characteristics in Growing Vascular NetworksPlummer, Christopher Finn January 2024 (has links)
One of the ways that a vascular network grows is through the process of angiogenesis, wherebya new blood vessel forms as a branch from an existing vessel towards an area which isstimulating vascular growth. Due to the demands for nutrients and waste transport, growingtumour cells will access the surrounding vascular network by inducing angiogenesis. Once thetumour is connected with the vascular system it can grow further and colonize distant organs.Given the critical nature of this step in tumour development, there is a demand for mathematicaland computational models to provide an understanding of the process for treatment in predictivemedicine. These models allow us to generate vascular networks that demonstrate similarbehaviour to that of the observed networks; however, there is a lack of quantifiable measures ofsimilarity between generated networks, or, of a generated and real network. Furthermore, thereis not an established way to determine which measures hold the most relevance todistinguishing similarity. To construct such a measure we transform our generated vascularnetworks into an abstract graph representation which allows exploration of the plethora of graphcentralities. We propose to determine the relevance of a centrality by finding one that acts as asynthetic likelihood function for estimating the model's parameters with minimal error.Evaluating the relevance of many centralities, it is then possible to suggest which centralitiesshould be used to quantitatively determine similarity. This allows for a way to measure howrealistic a model's growth is, and if given sufficient data, to distinguish between regular andtumour-induced angiogenesis and use it within cancer screening.
|
54 |
From Sequence to Structure : Using predicted residue contacts to facilitate template-free protein structure predictionMichel, Mirco January 2017 (has links)
Despite the fundamental role of experimental protein structure determination, computational methods are of essential importance to bridge the ever growing gap between available protein sequence and structure data. Common structure prediction methods rely on experimental data, which is not available for about half of the known protein families. Recent advancements in amino acid contact prediction have revolutionized the field of protein structure prediction. Contacts can be used to guide template-free structure predictions that do not rely on experimentally solved structures of homologous proteins. Such methods are now able to produce accurate models for a wide range of protein families. We developed PconsC2, an approach that improved existing contact prediction methods by recognizing intra-molecular contact patterns and noise reduction. An inherent problem of contact prediction based on maximum entropy models is that large alignments with over 1000 effective sequences are needed to infer contacts accurately. These are however not available for more than 80% of all protein families that do not have a representative structure in PDB. With PconsC3, we could extend the applicability of contact prediction to families as small as 100 effective sequences by combining global inference methods with machine learning based on local pairwise measures. By introducing PconsFold, a pipeline for contact-based structure prediction, we could show that improvements in contact prediction accuracy translate to more accurate models. Finally, we applied a similar technique to Pfam, a comprehensive database of known protein families. In addition to using a faster folding protocol we employed model quality assessment methods, crucial for estimating the confidence in the accuracy of predicted models. We propose models tobe accurate for 558 families that do not have a representative known structure. Out of those, over 75% have not been reported before. / <p>At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 2: Submitted. Paper 4: In press.</p><p> </p>
|
55 |
Computational discovery of DNA methylation patterns as biomarkers of ageing, cancer, and mental disorders : Algorithms and ToolsTorabi Moghadam, Behrooz January 2017 (has links)
Epigenetics refers to the mitotically heritable modifications in gene expression without a change in the genetic code. A combination of molecular, chemical and environmental factors constituting the epigenome is involved, together with the genome, in setting up the unique functionality of each cell type. DNA methylation is the most studied epigenetic mark in mammals, where a methyl group is added to the cytosine in a cytosine-phosphate-guanine dinucleotides or a CpG site. It has been shown to have a major role in various biological phenomena such as chromosome X inactivation, regulation of gene expression, cell differentiation, genomic imprinting. Furthermore, aberrant patterns of DNA methylation have been observed in various diseases including cancer. In this thesis, we have utilized machine learning methods and developed new methods and tools to analyze DNA methylation patterns as a biomarker of ageing, cancer subtyping and mental disorders. In Paper I, we introduced a pipeline of Monte Carlo Feature Selection and rule-base modeling using ROSETTA in order to identify combinations of CpG sites that classify samples in different age intervals based on the DNA methylation levels. The combination of genes that showed up to be acting together, motivated us to develop an interactive pathway browser, named PiiL, to check the methylation status of multiple genes in a pathway. The tool enhances detecting differential patterns of DNA methylation and/or gene expression by quickly assessing large data sets. In Paper III, we developed a novel unsupervised clustering method, methylSaguaro, for analyzing various types of cancers, to detect cancer subtypes based on their DNA methylation patterns. Using this method we confirmed the previously reported findings that challenge the histological grouping of the patients, and proposed new subtypes based on DNA methylation patterns. In Paper IV, we investigated the DNA methylation patterns in a cohort of schizophrenic and healthy samples, using all the methods that were introduced and developed in the first three papers.
|
56 |
Detection of artefacts in FFPE-sample sequence dataSwenson, Hugo January 2019 (has links)
Next generation sequencing is increasingly used as a diagnostic tool in the clinical setting. This is driven by the vast increase in molecular targeted therapy, which requires detailed information on what genetic variants are present in patient samples. In the hospital setting, most cancer diagnostics are based on Formalin Fixed Paraffin Embedded (FFPE) samples. The FFPE routine is very beneficial for logistical purposes and for some histopathological analyses, but creates problems for molecular diagnostics based on DNA. These problems derive from sample immersion informalin, which results in DNA fragmentation, interstrand DNA crosslinking and sequence artefacts due to hydrolytic deamination. Distinguishing such artefacts from true somatic variants can be challenging, thus affecting both research and clinical analyses. In order to identify FFPE-artefacts from true variants in next generation sequencing data from FFPE samples, I developed the novelprogram FUSAC (FFPE tissue UMI based Sequence Artefact Classifier) for the facility Clinical Genomics in Uppsala. FUSAC utilizes UniqueMolecular Identifiers (UMI's) to identify and group sequencing reads based on their molecule of origin. By using UMI's to collapse duplicate paired reads into consensus reads, FFPE-artefacts are classified through comparative analysis of the positive and negative strand sequences. My findings indicate that FUSAC can succesfully classify UMI-tagged next generation sequencing reads with FFPE-artefacts, from sequencing reads with true variants. FUSAC thus presents a novel approach in bioinformatic pipelines for studying FFPE-artefacts.
|
57 |
High performance reconfigurable architectures for biological sequence alignmentIsa, Mohammad Nazrin January 2013 (has links)
Bioinformatics and computational biology (BCB) is a rapidly developing multidisciplinary field which encompasses a wide range of domains, including genomic sequence alignments. It is a fundamental tool in molecular biology in searching for homology between sequences. Sequence alignments are currently gaining close attention due to their great impact on the quality aspects of life such as facilitating early disease diagnosis, identifying the characteristics of a newly discovered sequence, and drug engineering. With the vast growth of genomic data, searching for a sequence homology over huge databases (often measured in gigabytes) is unable to produce results within a realistic time, hence the need for acceleration. Since the exponential increase of biological databases as a result of the human genome project (HGP), supercomputers and other parallel architectures such as the special purpose Very Large Scale Integration (VLSI) chip, Graphic Processing Unit (GPUs) and Field Programmable Gate Arrays (FPGAs) have become popular acceleration platforms. Nevertheless, there are always trade-off between area, speed, power, cost, development time and reusability when selecting an acceleration platform. FPGAs generally offer more flexibility, higher performance and lower overheads. However, they suffer from a relatively low level programming model as compared with off-the-shelf microprocessors such as standard microprocessors and GPUs. Due to the aforementioned limitations, the need has arisen for optimized FPGA core implementations which are crucial for this technology to become viable in high performance computing (HPC). This research proposes the use of state-of-the-art reprogrammable system-on-chip technology on FPGAs to accelerate three widely-used sequence alignment algorithms; the Smith-Waterman with affine gap penalty algorithm, the profile hidden Markov model (HMM) algorithm and the Basic Local Alignment Search Tool (BLAST) algorithm. The three novel aspects of this research are firstly that the algorithms are designed and implemented in hardware, with each core achieving the highest performance compared to the state-of-the-art. Secondly, an efficient scheduling strategy based on the double buffering technique is adopted into the hardware architectures. Here, when the alignment matrix computation task is overlapped with the PE configuration in a folded systolic array, the overall throughput of the core is significantly increased. This is due to the bound PE configuration time and the parallel PE configuration approach irrespective of the number of PEs in a systolic array. In addition, the use of only two configuration elements in the PE optimizes hardware resources and enables the scalability of PE systolic arrays without relying on restricted onboard memory resources. Finally, a new performance metric is devised, which facilitates the effective comparison of design performance between different FPGA devices and families. The normalized performance indicator (speed-up per area per process technology) takes out advantages of the area and lithography technology of any FPGA resulting in fairer comparisons. The cores have been designed using Verilog HDL and prototyped on the Alpha Data ADM-XRC-5LX card with the Virtex-5 XC5VLX110-3FF1153 FPGA. The implementation results show that the proposed architectures achieved giga cell updates per second (GCUPS) performances of 26.8, 29.5 and 24.2 respectively for the acceleration of the Smith-Waterman with affine gap penalty algorithm, the profile HMM algorithm and the BLAST algorithm. In terms of speed-up improvements, comparisons were made on performance of the designed cores against their corresponding software and the reported FPGA implementations. In the case of comparison with equivalent software execution, acceleration of the optimal alignment algorithm in hardware yielded an average speed-up of 269x as compared to the SSEARCH 35 software. For the profile HMM-based sequence alignment, the designed core achieved speed-up of 103x and 8.3x against the HMMER 2.0 and the latest version of HMMER (version 3.0) respectively. On the other hand, the implementation of the gapped BLAST with the two-hit method in hardware achieved a greater than tenfold speed-up compared to the latest NCBI BLAST software. In terms of comparison against other reported FPGA implementations, the proposed normalized performance indicator was used to evaluate the designed architectures fairly. The results showed that the first architecture achieved more than 50 percent improvement, while acceleration of the profile HMM sequence alignment in hardware gained a normalized speed-up of 1.34. In the case of the gapped BLAST with the two-hit method, the designed core achieved 11x speed-up after taking out advantages of the Virtex-5 FPGA. In addition, further analysis was conducted in terms of cost and power performances; it was noted that, the core achieved 0.46 MCUPS per dollar spent and 958.1 MCUPS per watt. This shows that FPGAs can be an attractive platform for high performance computation with advantages of smaller area footprint as well as represent economic ‘green’ solution compared to the other acceleration platforms. Higher throughput can be achieved by redeploying the cores on newer, bigger and faster FPGAs with minimal design effort.
|
58 |
Evaluation and visualization of complexity in parameter setting in automotive industryLunev, Alexey January 2018 (has links)
Parameter setting is a process primary used to specify in what kind of vehicle an electronic control unit of each type is used. This thesis is targeted to investigate whether the current strategy to measure complexity gives user satisfactory results. The strategy consists of structure-based algorithms that are an essential part of the Complexity Analyzer - a prototype application used to evaluate the complexity. The results described in this work suggest that the currently implemented algorithms have to be properly defined and adapted to be used in terms of parameter setting. Moreover, the measurements that the algorithms output has been analyzed in more detail making the results easier to interpret. It has been shown that a typical parameter setting file can be regarded as a tree structure. To measure variation in this structure a new concept, called Path entropy has been formulated, tested and implemented. The main disadvantage of the original version of the Complexity Analyzer application is its lack of user-friendliness. Therefore, a web version of the application based on Model-View-Controller technique has been developed. Different to the original version it has user interface included and it takes just a couple of seconds to see the visualization of data, compared to the original version where it took several minutes to run the application.
|
59 |
Development of a hierarchical k-selecting clustering algorithm – application to allergy.Malm, Patrik January 2007 (has links)
The objective with this Master’s thesis was to develop, implement and evaluate an iterative procedure for hierarchical clustering with good overall performance which also merges features of certain already described algorithms into a single integrated package. An accordingly built tool was then applied to an allergen IgE-reactivity data set. The finally implemented algorithm uses a hierarchical approach which illustrates the emergence of patterns in the data. At each level of the hierarchical tree a partitional clustering method is used to divide data into k groups, where the number k is decided through application of cluster validation techniques. The cross-reactivity analysis, by means of the new algorithm, largely arrives at anticipated cluster formations in the allergen data, which strengthen results obtained through previous studies on the subject. Notably, though, certain unexpected findings presented in the former analysis where aggregated differently, and more in line with phylogenetic and protein family relationships, by the novel clustering package.
|
60 |
Evaluation and Development of Methods for Identification of Biochemical Networks / Evaluering och utveckling av metoder för identifiering av biokemiska nätverkJauhiainen, Alexandra January 2005 (has links)
Systems biology is an area concerned with understanding biology on a systems level, where structure and dynamics of the system is in focus. Knowledge about structure and dynamics of biological systems is fundamental information about cells and interactions within cells and also play an increasingly important role in medical applications. System identification deals with the problem of constructing a model of a system from data and an extensive theory of particularly identification of linear systems exists. This is a master thesis in systems biology treating identification of biochemical systems. Methods based on both local parameter perturbation data and time series data have been tested and evaluated in silico. The advantage of local parameter perturbation data methods proved to be that they demand less complex data, but the drawbacks are the reduced information content of this data and sensitivity to noise. Methods employing time series data are generally more robust to noise but the lack of available data limits the use of these methods. The work has been conducted at the Fraunhofer-Chalmers Research Centre for Industrial Mathematics in Göteborg, and at the division of Computational Biology at the Department of Physics and Measurement Technology, Biology, and Chemistry at Linköping University during the autumn of 2004.
|
Page generated in 0.1421 seconds