Spelling suggestions: "subject:"biomarker identification"" "subject:"miomarker identification""
1 |
A Multivariate Framework for Variable Selection and Identification of Biomarkers in High-Dimensional Omics DataZuber, Verena 17 December 2012 (has links) (PDF)
In this thesis, we address the identification of biomarkers in high-dimensional omics data. The identification of valid biomarkers is especially relevant for personalized medicine that depends on accurate prediction rules. Moreover, biomarkers elucidate the provenance of disease, or molecular changes related to disease. From a statistical point of view the identification of biomarkers is best cast as variable selection. In particular, we refer to variables as the molecular attributes under investigation, e.g. genes, genetic variation, or metabolites; and we refer to observations as the specific samples whose attributes we investigate, e.g. patients and controls. Variable selection in high-dimensional omics data is a complicated challenge due to the characteristic structure of omics data. For one, omics data is high-dimensional, comprising cellular information in unprecedented details. Moreover, there is an intricate correlation structure among the variables due to e.g internal cellular regulation, or external, latent factors. Variable selection for uncorrelated data is well established. In contrast, there is no consensus on how to approach variable selection under correlation.
Here, we introduce a multivariate framework for variable selection that explicitly accounts for the correlation among markers. In particular, we present two novel quantities for variable importance: the correlation-adjusted t (CAT) score for classification, and the correlation-adjusted (marginal) correlation (CAR) score for regression. The CAT score is defined as the Mahalanobis-decorrelated t-score vector, and the CAR score as the Mahalanobis-decorrelated correlation between the predictor variables and the outcome. We derive the CAT and CAR score from a predictive point of view in linear discriminant analysis and regression; both quantities assess the weight of a decorrelated and standardized variable on the prediction rule. Furthermore, we discuss properties of both scores and relations to established quantities. Above all, the CAT score decomposes Hotelling’s T 2 and the CAR score the proportion of variance explained. Notably, the decomposition of total variance into explained and unexplained variance in the linear model can be rewritten in terms of CAR scores.
To render our approach applicable on high-dimensional omics data we devise an efficient algorithm for shrinkage estimates of the CAT and CAR score. Subsequently, we conduct extensive simulation studies to investigate the performance of our novel approaches in ranking and prediction under correlation. Here, CAT and CAR scores consistently improve over marginal approaches in terms of more true positives selected and a lower model error. Finally, we illustrate the application of CAT and CAR score on real omics data. In particular, we analyze genomics, transcriptomics, and metabolomics data. We ascertain that CAT and CAR score are competitive or outperform state of the art techniques in terms of true positives detected and prediction error.
|
2 |
TOWARDS BIOMARKER DISCOVERY IN CONGENITAL URINARY TRACT OBSTRUCTIONOrton, Dennis 09 May 2014 (has links)
Proteome analysis techonologies are commonly employed for discovery-based biomarker identification studies. This thesis aims to help bridge the gap between analytical technology development and clinical application by improving and appling a proteomics workflow for biomarker discovery in congenital urinary tract obstruction (UTO). By accentuating the importance of experimental design, and evaluating the biological relevance of quantitative proteome analyses, the results of this research provide confidence in a number of identified candidate biomarkers of UTO.
A sensitive method for quantification of proteome samples was developed using temperature controlled reversed-phase liquid chromatography (TPLC). The TPLC system provides high recovery (> 90 %), as well as high accuracy and precision in estimating the concentration across a number of protein sample types (CV < 10 %).
The need for extensive fractionation strategies coupled with LC-MS analysis challenges the throughput of the overall experiment. Development of a dual column LC-MS interface reduced the total analysis time by a factor of 2 over conventional single column LC-MS systems. The system was applied to a quantitative proteome analysis of proximal tubule cells exposed to mechanical stretch, mimicking the conditions they experience during UTO and a urinary exosomal proteome analysis for candidate biomarker identification of this disease.
A total of 1636 proteins were identified in the whole cell proteome analysis, of which 317 were found to be significantly altered in abundance. Analysis of the urinary exosomal proteome yielded 318 proteins, of which 189 were found to be altered in abundance due to obstruction. Western blot confirmation of a few select proteins provided backing to the quantitative proteome analysis, while gene ontology and KEGG pathway analysis yielded functional information.
The results from the quantitative analyses of the urinary exosomes and proximal tubule cells identified candidates for both diagnosis and prognosis of UTO. In addition, activation of a novel pathway was identified, presenting a potential drug target which could be exploited to improve recovery of children following relief of UTO. This thesis therefore contributes useful technological and methodological advancements towards routine proteome analysis, as well as providing candidate biomarker identification for the leading cause of renal functional loss in children.
|
3 |
Rule-based Risk Monitoring Systems for Complex DatasetsHaghighi, Mona 28 June 2016 (has links)
In this dissertation we present rule-based machine learning methods for solving problems with high-dimensional or complex datasets. We are applying decision tree methods on blood-based biomarkers and neuropsychological tests to predict Alzheimer’s disease in its early stages. We are also using tree-based methods to identify disparity in dementia related biomarkers among three female ethnic groups. In another part of this research, we tried to use rule-based methods to identify homogeneous subgroups of subjects who share the same risk patterns out of a heterogeneous population. Finally, we applied a network-based method to reduce the dimensionality of a clinical dataset, while capturing the interaction among variables. The results show that the proposed methods are efficient and easy to use in comparison to the current machine learning methods.
|
4 |
Selecting Biomarkers for Pluripotency and Alzheimer's Disease: The Real Strength of the GA/SVMScheubert, Lena 16 October 2012 (has links)
Pluripotency and Alzheimer's disease are two very different biological states. Even so, they are similar in the lack of knowledge about their underlying molecular mechanisms. Identifying important genes well suited as biomarkers for these two states improves our understanding. We use different feature selection methods for the identification of important genes usable as potential biomarkers.
Beside the identification of biomarkers for these two specific states we are also interested in general algorithms showing good results in biomarker detection. For this reason we compare three feature selection methods with each other. Particularly good results show a rarely noticed wrapper approach of genetic algorithm and support vector machine (GA/SVM). More detailed investigations of the results show the strength of the small gene sets selected by our GA/SVM.
In our work we identify a number of promising biomarker candidates for pluripotency as well as for Alzheimer's disease. We also show that the GA/SVM is well suited for feature selection even if its potential is not yet exhausted.
|
5 |
Biomarker discovery and clinical outcome prediction using knowledge based-bioinformaticsPhan, John H. 02 April 2009 (has links)
Advances in high-throughput genomic and proteomic technology have led to a growing interest in cancer biomarkers. These biomarkers can potentially improve the accuracy of cancer subtype prediction and subsequently, the success of therapy. However, identification of statistically and biologically relevant biomarkers from high-throughput data can be unreliable due to the nature of the data--e.g., high technical variability, small sample size, and high dimension size. Due to the lack of available training samples, data-driven machine learning methods are often insufficient without the support of knowledge-based algorithms. We research and investigate the benefits of using knowledge-based algorithms to solve clinical prediction problems. Because we are interested in identifying biomarkers that are also feasible in clinical prediction models, we focus on two analytical components: feature selection and predictive model selection. In addition to data variance, we must also consider the variance of analytical methods. There are many existing feature selection algorithms, each of which may produce different results. Moreover, it is not trivial to identify model parameters that maximize the sensitivity and specificity of clinical prediction. Thus, we introduce a method that uses independently validated biological knowledge to reduce the space of relevant feature selection algorithms and to improve the reliability of clinical predictors. Finally, we implement several functions of this knowledge-based method as a web-based, user-friendly, and standards-compatible software application.
|
6 |
Discovery of Novel Serum Biomarkers for Diagnosing and Staging Alzheimer's DiseaseShah, Dipti Jigar 01 June 2014 (has links) (PDF)
Discovery of Novel Serum Biomarkers for Diagnosing and Staging Alzheimer’s DiseaseDipti Jigar ShahDepartment of Chemistry and Biochemistry, BYUDoctor of PhilosophyAlzheimer’s disease (AD) is an untreatable neurologic disease affecting more than 5 million Americans, most over 60 years of age. Protein plaques and neurofibrillary tangles typify AD brain pathology and are thought to cause the progressive dementia and brain shrinkage observed in AD. Currently there are no methods to diagnose the disease at a time before damage becomes irreversible.Biochemical tests for AD using cerebrospinal fluid analysis or neuroimaging are not yet sufficiently sensitive and specific, and they are invasive. This points to a need for a more easily applied and more sensitive diagnostic test. Although the gross anatomical changes are localized to the brain, AD is likely to involve changes throughout the body. As a result of this, changes in the abundance of certain biomolecules present in the circulation system are likely to occur. Consequently, a serum proteomics approach able to measure such changes, when applied to AD, would likely find quantitative changes in relevant molecules that can help diagnose the disease correctly, ideally early in the disease process. The goal of this work was to discover and validate novel diagnostic serum biomarkers for AD. For biomarker discovery and validation, we used a novel serum proteomics approach involving reversed phase capillary-liquid chromatography-electrospray ionization-quadrupole-time of flight mass spectrometry. Our samples were protein depleted, which helped us survey low molecular weight species in the serum without ion suppression from larger proteins like albumin. We were able to observe more than 8000 molecular species in a single run. The overall project was comprised of four studies: (i) discovery of novel potential serum AD markers, (ii) blinded validation of diagnostically promising biomarkers found in the initial study, with their further chemical identification, (iii) exploring gender-based serum AD biomarkers, and (v) discovery of biomarkers that distinguish early versus moderate stage AD. In the first study, the approach found 38 significant (p < 0.05) biomarkers and 21 near significant (p = 0.05 to 0.099) biomarkers. On using the forward selection approach, we built multi-marker panels with specificities and sensitivities higher than 80%.The second study reports on a blinded validation study that was performed on a new set of serum samples. We focused on the 13 most promising AD biomarkers found as part of the initial study. We successfully validated 4 of these biomarkers that showed highly significant statistical p-values. As part of this study, research was conducted to identify these 4 biomarkers, which was accomplished using tandem mass spectrometry with fragmentation experiments. The third study used data from the initial study but looked at gender specific biomarkers. We found 31 significant and near significant serum AD biomarkers for women, 16 for men, and 25 that were gender independent. Multi-marker panels of AD biomarkers for women or men had sensitivities of >60% and specificities >85%.In the fourth study, cases with moderate AD were compared to cases with very mild or mild AD to find novel biomarkers that could be used for staging. We found 44 significant and near significant biomarkers that were quantitatively different between mild and severe AD. In conclusion, we were successful in accomplishing the goal of this work of finding, validating and identifying novel serum biomarkers that diagnose AD.
|
7 |
Integrative Modeling and Analysis of High-throughput Biological DataChen, Li 21 January 2011 (has links)
Computational biology is an interdisciplinary field that focuses on developing mathematical models and algorithms to interpret biological data so as to understand biological problems. With current high-throughput technology development, different types of biological data can be measured in a large scale, which calls for more sophisticated computational methods to analyze and interpret the data. In this dissertation research work, we propose novel methods to integrate, model and analyze multiple biological data, including microarray gene expression data, protein-DNA interaction data and protein-protein interaction data. These methods will help improve our understanding of biological systems.
First, we propose a knowledge-guided multi-scale independent component analysis (ICA) method for biomarker identification on time course microarray data. Guided by a knowledge gene pool related to a specific disease under study, the method can determine disease relevant biological components from ICA modes and then identify biologically meaningful markers related to the specific disease. We have applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification.
Second, we propose a novel method for transcriptional regulatory network identification by integrating gene expression data and protein-DNA binding data. The approach is built upon a multi-level analysis strategy designed for suppressing false positive predictions. With this strategy, a regulatory module becomes increasingly significant as more relevant gene sets are formed at finer levels. At each level, a two-stage support vector regression (SVR) method is utilized to reduce false positive predictions by integrating binding motif information and gene expression data; a significance analysis procedure is followed to assess the significance of each regulatory module. The resulting performance on simulation data and yeast cell cycle data shows that the multi-level SVR approach outperforms other existing methods in the identification of both regulators and their target genes. We have further applied the proposed method to breast cancer cell line data to identify condition-specific regulatory modules associated with estrogen treatment. Experimental results show that our method can identify biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer.
Third, we propose a bootstrapping Markov Random Filed (MRF)-based method for subnetwork identification on microarray data by incorporating protein-protein interaction data. Methodologically, an MRF-based network score is first derived by considering the dependency among genes to increase the chance of selecting hub genes. A modified simulated annealing search algorithm is then utilized to find the optimal/suboptimal subnetworks with maximal network score. A bootstrapping scheme is finally implemented to generate confident subnetworks. Experimentally, we have compared the proposed method with other existing methods, and the resulting performance on simulation data shows that the bootstrapping MRF-based method outperforms other methods in identifying ground truth subnetwork and hub genes. We have then applied our method to breast cancer data to identify significant subnetworks associated with drug resistance. The identified subnetworks not only show good reproducibility across different data sets, but indicate several pathways and biological functions potentially associated with the development of breast cancer and drug resistance. In addition, we propose to develop network-constrained support vector machines (SVM) for cancer classification and prediction, by taking into account the network structure to construct classification hyperplanes. The simulation study demonstrates the effectiveness of our proposed method. The study on the real microarray data sets shows that our network-constrained SVM, together with the bootstrapping MRF-based subnetwork identification approach, can achieve better classification performance compared with conventional biomarker selection approaches and SVMs.
We believe that the research presented in this dissertation not only provides novel and effective methods to model and analyze different types of biological data, the extensive experiments on several real microarray data sets and results also show the potential to improve the understanding of biological mechanisms related to cancers by generating novel hypotheses for further study. / Ph. D.
|
8 |
A Multivariate Framework for Variable Selection and Identification of Biomarkers in High-Dimensional Omics DataZuber, Verena 27 June 2012 (has links)
In this thesis, we address the identification of biomarkers in high-dimensional omics data. The identification of valid biomarkers is especially relevant for personalized medicine that depends on accurate prediction rules. Moreover, biomarkers elucidate the provenance of disease, or molecular changes related to disease. From a statistical point of view the identification of biomarkers is best cast as variable selection. In particular, we refer to variables as the molecular attributes under investigation, e.g. genes, genetic variation, or metabolites; and we refer to observations as the specific samples whose attributes we investigate, e.g. patients and controls. Variable selection in high-dimensional omics data is a complicated challenge due to the characteristic structure of omics data. For one, omics data is high-dimensional, comprising cellular information in unprecedented details. Moreover, there is an intricate correlation structure among the variables due to e.g internal cellular regulation, or external, latent factors. Variable selection for uncorrelated data is well established. In contrast, there is no consensus on how to approach variable selection under correlation.
Here, we introduce a multivariate framework for variable selection that explicitly accounts for the correlation among markers. In particular, we present two novel quantities for variable importance: the correlation-adjusted t (CAT) score for classification, and the correlation-adjusted (marginal) correlation (CAR) score for regression. The CAT score is defined as the Mahalanobis-decorrelated t-score vector, and the CAR score as the Mahalanobis-decorrelated correlation between the predictor variables and the outcome. We derive the CAT and CAR score from a predictive point of view in linear discriminant analysis and regression; both quantities assess the weight of a decorrelated and standardized variable on the prediction rule. Furthermore, we discuss properties of both scores and relations to established quantities. Above all, the CAT score decomposes Hotelling’s T 2 and the CAR score the proportion of variance explained. Notably, the decomposition of total variance into explained and unexplained variance in the linear model can be rewritten in terms of CAR scores.
To render our approach applicable on high-dimensional omics data we devise an efficient algorithm for shrinkage estimates of the CAT and CAR score. Subsequently, we conduct extensive simulation studies to investigate the performance of our novel approaches in ranking and prediction under correlation. Here, CAT and CAR scores consistently improve over marginal approaches in terms of more true positives selected and a lower model error. Finally, we illustrate the application of CAT and CAR score on real omics data. In particular, we analyze genomics, transcriptomics, and metabolomics data. We ascertain that CAT and CAR score are competitive or outperform state of the art techniques in terms of true positives detected and prediction error.
|
9 |
Indexation de spectres HSQC et d’images IRMf appliquée à la détection de bio-marqueurs / Indexing of HSQC spectra and FMRI images for biomarker identificationBelghith, Akram 30 March 2012 (has links)
Les techniques d'acquisition des signaux médicaux sont en constante évolution et fournissent une quantité croissante de données hétérogènes qui doivent être analysées par le médecin. Dans ce contexte, des méthodes automatiques de traitement des signaux médicaux sont régulièrement proposées pour aider l'expert dans l'analyse qualitative et quantitative en facilitant leur interprétation. Ces méthodes doivent tenir compte de la physique de l'acquisition, de l'a priori que nous avons sur ces signaux et de la quantité de données à analyser pour une interprétation plus précise et plus fiable. Dans cette thèse, l'analyse des tissus biologique par spectroscopie RMN et la recherche des activités fonctionnelles cérébrales et leurs connectivités par IRMf sont explorées pour la recherche de nouveaux bio-marqueurs. Chaque information médicale sera caractérisée par un ensemble d'objets que nous cherchons à extraire, à aligner, et à coder. Le regroupement de ces objets par la mesure de leur similitude permettra leur classification et l'identification de bio-marqueurs. C'est ce schéma global d'indexation et de recherche par le contenu d'objets pour la détection des bio-marqueurs que nous proposons. Pour cela, nous nous sommes intéressés dans cette thèse à modéliser et intégrer les connaissances a priori que nous avons sur ces signaux biologiques permettant ainsi de proposer des méthodes appropriées à chaque étape d'indexation et à chaque type de signal. / The medical signal acquisition techniques are constantly evolving in recent years and providing an increasing amount of data which should be then analyzed. In this context, automatic signal processing methods are regularly proposed to assist the expert in the qualitative and quantitative analysis of these images in order to facilitate their interpretation. These methods should take into account the physics of signal acquisition, the a priori we have on the signal formation and the amount of data to analyze for a more accurate and reliable interpretation. In this thesis, we focus on the two-dimensional 2D Heteronuclear Single Quantum Coherence HSQC spectra obtained by High-Resolution Magic Angle Spinning HR-MAS NMR for biological tissue analysis and the functional Magnetic Resonance Imaging fMRI images for functional brain activities analysis. Each processed medical information will be characterized by a set of objects that we seek to extract, align, and code. The clustering of these objects by measuring their similarity will allow their classification and then the identification of biomarkers. It is this global content-based object indexing and retrieval scheme that we propose. We are interested in this thesis to properly model and integrate the a priori knowledge we have on these biological signal allowing us to propose there after appropriate methods to each indexing step and each type of signal.
|
Page generated in 0.1047 seconds