• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 9
  • 2
  • Tagged with
  • 14
  • 14
  • 8
  • 7
  • 6
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Neighborhood-Oriented feature selection and classification of Duke’s stages on colorectal Cancer using high density genomic data.

Peng, Liang January 1900 (has links)
Master of Science / Department of Statistics / Haiyan Wang / The selection of relevant genes for classification of phenotypes for diseases with gene expression data have been extensively studied. Previously, most relevant gene selection was conducted on individual gene with limited sample size. Modern technology makes it possible to obtain microarray data with higher resolution of the chromosomes. Considering gene sets on an entire block of a chromosome rather than individual gene could help to reveal important connection of relevant genes with the disease phenotypes. In this report, we consider feature selection and classification while taking into account of the spatial location of probe sets in classification of Duke’s stages B and C using DNA copy number data or gene expression data from colorectal cancers. A novel method was presented for feature selection in this report. A chromosome was first partitioned into blocks after the probe sets were aligned along their chromosome locations. Then a test of interaction between Duke’s stage and probe sets was conducted on each block of probe sets to select significant blocks. For each significant block, a new multiple comparison procedure was carried out to identify truly relevant probe sets while preserving the neighborhood location information of the probe sets. Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) classification using the selected final probe sets was conducted for all samples. Leave-One-Out Cross Validation (LOOCV) estimate of accuracy is reported as an evaluation of selected features. We applied the method on two large data sets, each containing more than 50,000 features. Excellent classification accuracy was achieved by the proposed procedure along with SVM or KNN for both data sets even though classification of prognosis stages (Duke’s stages B and C) is much more difficult than that for the normal or tumor types.
12

Protecting Privacy: Automatic Compression and Encryption of Next-Generation Sequencing Alignment Data

Gustafsson, Wiktor January 2019 (has links)
As the field of next-generation sequencing (NGS) matures and the technology grows more advanced, it is becoming an increasingly strong tool for solving various biological problems. Harvesting and analysing the full genomic sequence of an individual and comparing it to a reference genome can unravel information about detrimental mutations, in particular ones that give rise to diseases such as cancer. At the Rudbeck Laboratory, Uppsala University, a fully automatic software pipeline for somatic mutational analysis of cancer patient sequence data is in development. This will increase the efficiency and accuracy of a process which today consists of several discrete computation steps. In turn, this will reduce the time to result and facilitate the process of making a diagnosis and delegate the optimal treatment for the patient. However, the genomic data of an individual is very sensitive and private, which demands that great security precautions are taken. Moreover, as more and more data are produced storage space is becoming increasingly valuable, which requires that data are handled and stored as efficiently as possible. In this project, I developed a Python pipeline for automatic compression and encryption of NGS alignment data, which aims to ensure full privacy protection of patient data while maintaining high computational and storage efficiency. The pipeline uses a state-of-the-art real-time compression algorithm combined with an Advanced Encryption Standard cipher. It offers security that meets rigorous modern standards, and performance which at least matches that of existing solutions. The system is made to be easily integrated in the somatic mutation analysis pipeline. This way, the data generated during the analysis, which are too large to be kept in operational memory, can safely be stored to disk.
13

Efficient analysis of complex, multimodal genomic data

Acharya, Chaitanya Ramanuj January 2016 (has links)
<p>Our primary goal is to better understand complex diseases using statistically disciplined approaches. As multi-modal data is streaming out of consortium projects like Genotype-Tissue Expression (GTEx) project, which aims at collecting samples from various tissue sites in order to understand tissue-specific gene regulation, new approaches are needed that can efficiently model groups of data with minimal loss of power. For example, GTEx project delivers RNA-Seq, Microarray gene expression and genotype data (SNP Arrays) from a vast number of tissues in a given individual subject. In order to analyze this type of multi-level (hierarchical) multi-modal data, we proposed a series of efficient-score based tests or score tests and leveraged groups of tissues or gene isoforms in order map genomic biomarkers. We model group-specific variability as a random effect within a mixed effects model framework. In one instance, we proposed a score-test based approach to map expression quantitative trait loci (eQTL) across multiple-tissues. In order to do that we jointly model all the tissues and make use of all the information available to maximize the power of eQTL mapping and investigate an overall shift in the gene expression combined with tissue-specific effects due to genetic variants. In the second instance, we showed the flexibility of our model framework by expanding it to include tissue-specific epigenetic data (DNA methylation) and map eQTL by leveraging both tissues and methylation. Finally, we also showed that our methods are applicable on different data type such as whole transcriptome expression data, which is designed to analyze genomic events such alternative gene splicing. In order to accomplish this, we proposed two different models that exploit gene expression data of all available gene-isoforms within a gene to map biomarkers of interest (either genes or gene-sets) in paired early-stage breast tumor samples before and after treatment with external beam radiation. Our efficient score-based approaches have very distinct advantages. They have a computational edge over existing methods because they do not need parameter estimation under the alternative hypothesis. As a result, model parameters only have to be estimated once per genome, significantly decreasing computation time. Also, the efficient score is the locally most powerful test and is guaranteed a theoretical optimality over all other approaches in a neighborhood of the null hypothesis. This theoretical performance is born out in extensive simulation studies which show that our approaches consistently outperform existing methods both in statistical power and computational speed. We applied our methods to publicly available datasets. It is important to note that all of our methods also accommodate the analysis of next-generation sequencing data.</p> / Dissertation
14

Pronostic moléculaire basé sur l'ordre des gènes et découverte de biomarqueurs guidé par des réseaux pour le cancer du sein / Rank-based Molecular Prognosis and Network-guided Biomarker Discovery for Breast Cancer

Jiao, Yunlong 11 September 2017 (has links)
Le cancer du sein est le deuxième cancer le plus répandu dans le monde et la principale cause de décès due à un cancer chez les femmes. L'amélioration du pronostic du cancer a été l'une des principales préoccupations afin de permettre une meilleure gestion et un meilleur traitement clinique des patients. Avec l'avancement rapide des technologies de profilage génomique durant ces dernières décennies, la disponibilité aisée d'une grande quantité de données génomiques pour la recherche médicale a motivé la tendance actuelle qui consiste à utiliser des outils informatiques tels que l'apprentissage statistique dans le domaine de la science des données afin de découvrir les biomarqueurs moléculaires en lien avec l'amélioration du pronostic. Cette thèse est conçue suivant deux directions d'approches destinées à répondre à deux défis majeurs dans l'analyse de données génomiques pour le pronostic du cancer du sein d'un point de vue méthodologique de l'apprentissage statistique : les approches basées sur le classement pour améliorer le pronostic moléculaire et les approches guidées par un réseau donné pour améliorer la découverte de biomarqueurs. D'autre part, les méthodologies développées et étudiées dans cette thèse, qui concernent respectivement l'apprentissage à partir de données de classements et l'apprentissage sur un graphe, apportent une contribution significative à plusieurs branches de l'apprentissage statistique, concernant au moins les applications à la biologie du cancer et la théorie du choix social. / Breast cancer is the second most common cancer worldwide and the leading cause of women's death from cancer. Improving cancer prognosis has been one of the problems of primary interest towards better clinical management and treatment decision making for cancer patients. With the rapid advancement of genomic profiling technologies in the past decades, easy availability of a substantial amount of genomic data for medical research has been motivating the currently popular trend of using computational tools, especially machine learning in the era of data science, to discover molecular biomarkers regarding prognosis improvement. This thesis is conceived following two lines of approaches intended to address two major challenges arising in genomic data analysis for breast cancer prognosis from a methodological standpoint of machine learning: rank-based approaches for improved molecular prognosis and network-guided approaches for enhanced biomarker discovery. Furthermore, the methodologies developed and investigated in this thesis, pertaining respectively to learning with rank data and learning on graphs, have a significant contribution to several branches of machine learning, concerning applications across but not limited to cancer biology and social choice theory.

Page generated in 0.0375 seconds