1 |
New support vector machine formulations and algorithms with application to biomedical data analysisGuan, Wei 13 June 2011 (has links)
The Support Vector Machine (SVM) classifier seeks to find the separating hyperplane wx=r that maximizes the margin distance 1/||w||2^2. It can be formalized as an optimization problem that minimizes the hinge loss Ʃ[subscript i](1-y[subscript i] f(x[subscript i]))₊ plus the L₂-norm of the weight vector. SVM is now a mainstay method of machine learning. The goal of this dissertation work is to solve different biomedical data analysis problems efficiently using extensions of SVM, in which we augment the standard SVM formulation based on the application requirements. The biomedical applications we explore in this thesis include: cancer diagnosis, biomarker discovery, and energy function learning for protein structure prediction.
Ovarian cancer diagnosis is problematic because the disease is typically asymptomatic especially at early stages of progression and/or recurrence. We investigate a sample set consisting of 44 women diagnosed with serous papillary ovarian cancer and 50 healthy women or women with benign conditions. We profile the relative metabolite levels in the patient sera using a high throughput ambient ionization mass spectrometry technique, Direct Analysis in Real Time (DART). We then reduce the diagnostic classification on these metabolic profiles into a functional classification problem and solve it with functional Support Vector Machine (fSVM) method. The assay distinguished between the cancer and control groups with an unprecedented 99\% accuracy (100\% sensitivity, 98\% specificity) under leave-one-out-cross-validation. This approach has significant clinical potential as a cancer diagnostic tool.
High throughput technologies provide simultaneous evaluation of thousands of potential biomarkers to distinguish different patient groups. In order to assist biomarker discovery from these low sample size high dimensional cancer data, we first explore a convex relaxation of the L₀-SVM problem and solve it using mixed-integer programming techniques. We further propose a more efficient L₀-SVM approximation, fractional norm SVM, by replacing the L₂-penalty with L[subscript q]-penalty (q in (0,1)) in the optimization formulation. We solve it through Difference of Convex functions (DC) programming technique. Empirical studies on the synthetic data sets as well as the real-world biomedical data sets support the effectiveness of our proposed L₀-SVM approximation methods over other commonly-used sparse SVM methods such as the L₁-SVM method.
A critical open problem in emph{ab initio} protein folding is protein energy function design. We reduce the problem of learning energy function for extit{ab initio} folding to a standard machine learning problem, learning-to-rank. Based on the application requirements, we constrain the reduced ranking problem with non-negative weights and develop two efficient algorithms for non-negativity constrained SVM optimization. We conduct the empirical study on an energy data set for random conformations of 171 proteins that falls into the {it ab initio} folding class. We compare our approach with the optimization approach used in protein structure prediction tool, TASSER. Numerical results indicate that our approach was able to learn energy functions with improved rank statistics (evaluated by pairwise agreement) as well as improved correlation between the total energy and structural dissimilarity.
|
2 |
Applications and challenges in mass spectrometry-based untargeted metabolomicsJones, Christina Michele 27 May 2016 (has links)
Metabolomics is the methodical scientific study of biochemical processes associated with the metabolome—which comprises the entire collection of metabolites in any biological entity. Metabolome changes occur as a result of modifications in the genome and proteome, and are, therefore, directly related to cellular phenotype. Thus, metabolomic analysis is capable of providing a snapshot of cellular physiology. Untargeted metabolomics is an impartial, all-inclusive approach for detecting as many metabolites as possible without a priori knowledge of their identity. Hence, it is a valuable exploratory tool capable of providing extensive chemical information for discovery and hypothesis-generation regarding biochemical processes. A history of metabolomics and advances in the field corresponding to improved analytical technologies are described in Chapter 1 of this dissertation. Additionally, Chapter 1 introduces the analytical workflows involved in untargeted metabolomics research to provide a foundation for Chapters 2 – 5.
Part I of this dissertation which encompasses Chapters 2 – 3 describes the utilization of mass spectrometry (MS)-based untargeted metabolomic analysis to acquire new insight into cancer detection. There is a knowledge deficit regarding the biochemical processes of the origin and proliferative molecular mechanisms of many types of cancer which has also led to a shortage of sensitive and specific biomarkers. Chapter 2 describes the development of an in vitro diagnostic multivariate index assay (IVDMIA) for prostate cancer (PCa) prediction based on ultra performance liquid chromatography-mass spectrometry (UPLC-MS) metabolic profiling of blood serum samples from 64 PCa patients and 50 healthy individuals. A panel of 40 metabolic spectral features was found to be differential with 92.1% sensitivity, 94.3% specificity, and 93.0% accuracy. The performance of the IVDMIA was higher than the prevalent prostate-specific antigen blood test, thus, highlighting that a combination of multiple discriminant features yields higher predictive power for PCa detection than the univariate analysis of a single marker. Chapter 3 describes two approaches that were taken to investigate metabolic patterns for early detection of ovarian cancer (OC). First, Dicer-Pten double knockout (DKO) mice that phenocopy many of the features of metastatic high-grade serous carcinoma (HGSC) observed in women were studied. Using UPLC-MS, serum samples from 14 early-stage tumor DKO mice and 11 controls were analyzed. Iterative multivariate classification selected 18 metabolites that, when considered as a panel, yielded 100% accuracy, sensitivity, and specificity for early-stage HGSC detection. In the second approach, serum metabolic phenotypes of an early-stage OC pilot patient cohort were characterized. Serum samples were collected from 24 early-stage OC patients and 40 healthy women, and subsequently analyzed using UPLC-MS. Multivariate statistical analysis employing support vector machine learning methods and recursive feature elimination selected a panel of metabolites that differentiated between age-matched samples with 100% cross-validated accuracy, sensitivity, and specificity. This small pilot study demonstrated that metabolic phenotypes may be useful for detecting early-stage OC and, thus, supports conducting larger, more comprehensive studies.
Many challenges exist in the field of untargeted metabolomics.
Part II of this dissertation which encompasses Chapters 4 – 5 focuses on two specific challenges. While metabolomic data may be used to generate hypothesis concerning biological processes, determining causal relationships within metabolic networks with only metabolomic data is impractical. Proteins play major roles in these networks; therefore, pairing metabolomic information with that acquired from proteomics gives a more comprehensive snapshot of perturbations to metabolic pathways. Chapter 4 describes the integration of MS- and NMR-based metabolomics with proteomics analyses to investigate the role of chemically mediated ecological interactions between Karenia brevis and two diatom competitors, Asterionellopsis glacialis and Thalassiosira pseudonana. This integrated systems biology approach showed that K. brevis allelopathy distinctively perturbed the metabolisms of these two competitors. A. glacialis had a more robust metabolic response to K. brevis allelopathy which may be a result of its repeated exposure to K. brevis blooms in the Gulf of Mexico. However, K. brevis allelopathy disrupted energy metabolism and obstructed cellular protection mechanisms including altering cell membrane components, inhibiting osmoregulation, and increasing oxidative stress in T. pseudonana. This work represents the first instance of metabolites and proteins measured simultaneously to understand the effects of allelopathy or in fact any form of competition.
Chromatography is traditionally coupled to MS for untargeted metabolomics studies. While coupling chromatography to MS greatly enhances metabolome analysis due to the orthogonality of the techniques, the lengthy analysis times pose challenges for large metabolomics studies. Consequently, there is still a need for developing higher throughput MS approaches. A rapid metabolic fingerprinting method that utilizes a new transmission mode direct analysis in real time (TM-DART) ambient sampling technique is presented in Chapter 5. The optimization of TM-DART parameters directly affecting metabolite desorption and ionization, such as sample position and ionizing gas desorption temperature, was critical in achieving high sensitivity and detecting a broad mass range of metabolites. In terms of reproducibility, TM-DART compared favorably with traditional probe mode DART analysis, with coefficients of variation as low as 16%. TM-DART MS proved to be a powerful analytical technique for rapid metabolome analysis of human blood sera and was adapted for exhaled breath condensate (EBC) analysis. To determine the feasibility of utilizing TM-DART for metabolomics investigations, TM-DART was interfaced with traveling wave ion mobility spectrometry (TWIMS) time-of-flight (TOF) MS for the analysis of EBC samples from cystic fibrosis patients and healthy controls. TM-DART-TWIMS-TOF MS was able to successfully detect cystic fibrosis in this small sample cohort, thereby, demonstrating it can be employed for probing metabolome changes.
Finally, in Chapter 6, a perspective on the presented work is provided along with goals on which future studies may focus.
|
Page generated in 0.131 seconds