Global ETD Search

231	Topic Model-based Mass Spectrometric Data Analysis in Cancer Biomarker Discovery Studies Wang, Minkun 14 June 2017 (has links) Identification of disease-related alterations in molecular and cellular mechanisms may reveal useful biomarkers for human diseases including cancers. High-throughput omic technologies for identifying and quantifying multi-level biological molecules (e.g., proteins, glycans, and metabolites) have facilitated the advances in biological research in recent years. Liquid (or gas) chromatography coupled with mass spectrometry (LC/GC-MS) has become an essential tool in such large-scale omic studies. Appropriate LC/GC-MS data preprocessing pipelines are needed to detect true differences between biological groups. Challenges exist in several aspects of MS data analysis. Specifically for biomarker discovery, one fundamental challenge in quantitation of biomolecules is owing to the heterogeneous nature of human biospecimens. Although this issue has been a subject of discussion in cancer genomic studies, it has not yet been rigorously investigated in mass spectrometry based omic studies. Purification of mass spectometric data is highly desired prior to subsequent differential analysis. In this research dissertation, we majorly target at addressing the purification problem through probabilistic modeling. We propose an intensity-level purification model (IPM) to computationally purify LC/GC-MS based cancerous data in biomarker discovery studies. We further extend IPM to scan-level purification model (SPM) by considering information from extracted ion chromatogram (EIC, scan-level feature). Both IPM and SPM belong to the category of topic modeling approach, which aims to identify the underlying "topics" (sources) and their mixture proportions in composing the heterogeneous data. Additionally, denoise deconvolution model (DMM) is proposed to capture the noise signals in samples based on purified profiles. Variational expectation-maximization (VEM) and Markov chain Monte Carlo (MCMC) methods are used to draw inference on the latent variables and estimate the model parameters. Before we come to purification, other research topics in related to mass spectrometric data analysis for cancer biomarker discovery are also investigated in this dissertation. Chapter 3 discusses the developed methods in the differential analysis of LC/GC-MS based omic data, specifically for the preprocessing in data of LC-MS profiled glycans. Chapter 4 presents the assumptions and inference details of IPM, SPM, and DDM. A latent Dirichlet allocation (LDA) core is used to model the heterogeneous cancerous data as mixtures of topics consisting of sample-specific pure cancerous source and non-cancerous contaminants. We evaluated the capability of the proposed models in capturing mixture proportions of contaminants and cancer profiles on LC-MS based serum and tissue proteomic and GC-MS based tissue metabolomic datasets acquired from patients with hepatocellular carcinoma (HCC) and liver cirrhosis. Chapter 5 elaborates these applications in cancer biomarker discovery, where typical single omic and integrative analysis of multi-omic studies are included. / Ph. D. / This dissertation documents the methodology and outputs for computational deconvolution of heterogeneous omics data generated from biospecimens of interest. These omics data convey qualitative and quantitative information of biomolecules (e.g., glycans, proteins, metabolites, etc.) which are profiled by instruments named liquid (or gas) chromatography and mass spectrometer (LC/GC-MS). In the scenarios of biomarker discovery, we aim to find out the significant difference on intensities of biomolecules with respect to two specific phenotype groups so that the biomarkers can be used as clinical indicators for early stage diagnose. However, the purity of collected samples constitutes the fundamental challenge to the process of differential analysis. Instead of experimental methods that are costly and time-consuming, we treat the purification task as one of the topic modeling procedures, where we assume each observed biomolecular profile is a mixture of hidden pure source together with unwanted contaminants. The developed models output the estimated mixture proportion as well as the underlying “topics”. With different level’s purification applied, improved discrimination power of candidate biomarkers and more biologically meaningful pathways were discovered in LC/GC-MS based multi-omic studies for liver cancer. This research work originates from a broader scope of probabilistic generative modeling, where rational assumptions are made to characterize the generation process of the observations. Therefore, the developed models in this dissertation have great potential in applications other than heterogeneous data purification discussed in this dissertation. A good example is to uncover the relationship of human gut microbiome with the host’s phenotypes of interest (e.g., disease like type-II diabetes). Similar challenges exist in how to infer the underlying intestinal flora distribution and estimate their mixture proportions. This dissertation also covers topics of related data preprocessing and integration, but with a consistent goal in improving the performance of biomarker discovery. In summary, the research help address sample heterogeneity issue observed in LC/GC-MS based cancer biomarker discovery studies and shed light on computational deconvolution of the mixtures, which can be generalized to other domains of interest. topic model computational purification Bayesian inference biomarker discovery
232	Bayesian Modeling for Isoform Identification and Phenotype-specific Transcript Assembly Shi, Xu 24 October 2017 (has links) The rapid development of biotechnology has enabled researchers to collect high-throughput data for studying various biological processes at the genomic level, transcriptomic level, and proteomic level. Due to the large noise in the data and the high complexity of diseases (such as cancer), it is a challenging task for researchers to extract biologically meaningful information that can help reveal the underlying molecular mechanisms. The challenges call for more efforts in developing efficient and effective computational methods to analyze the data at different levels so as to understand the biological systems in different aspects. In this dissertation research, we have developed novel Bayesian approaches to infer alternative splicing mechanisms in biological systems using RNA sequencing data. Specifically, we focus on two research topics in this dissertation: isoform identification and phenotype-specific transcript assembly. For isoform identification, we develop a computational approach, SparseIso, to jointly model the existence and abundance of isoforms in a Bayesian framework. A spike-and-slab prior is incorporated into the model to enforce the sparsity of expressed isoforms. A Gibbs sampler is developed to sample the existence and abundance of isoforms iteratively. For transcript assembly, we develop a Bayesian approach, IntAPT, to assemble phenotype-specific transcripts from multiple RNA sequencing profiles. A two-layer Bayesian framework is used to model the existence of phenotype-specific transcripts and the transcript abundance in individual samples. Based on the hierarchical Bayesian model, a Gibbs sampling algorithm is developed to estimate the joint posterior distribution for phenotype-specific transcript assembly. The performances of our proposed methods are evaluated with simulation data, compared with existing methods and benchmarked with real cell line data. We then apply our methods on breast cancer data to identify biologically meaningful splicing mechanisms associated with breast cancer. For the further work, we will extend our methods for de novo transcript assembly to identify novel isoforms in biological systems; we will incorporate isoform-specific networks into our methods to better understand splicing mechanisms in biological systems. / Ph. D. / The next-generation sequencing technology has significantly improved the resolution of the biomedical research at the genomic level and transcriptomic level. Due to the large noise in the data and the high complexity of diseases (such as cancer), it is a challenging task for researchers to extract biologically meaningful information that can help reveal the underlying molecular mechanisms. In this dissertation, we have developed two novel Bayesian approaches to infer alternative splicing mechanisms in biological systems using RNA sequencing data. We have demonstrated the advantages of our proposed approaches over existing methods on both simulation data and real cell line data. Furthermore, the application of our methods on real breast cancer data and glioblastoma tissue data has further shown the efficacy of our methods in real biological applications. Transcriptome Assembly RNA-seq Data Analysis Bayesian Inference Gibbs Sampling Markov Chain Monte Carlo (MCMC)
233	Enhanced Air Transportation Modeling Techniques for Capacity Problems Spencer, Thomas Louis 02 September 2016 (has links) Effective and efficient air transportation systems are crucial to a nation's economy and connectedness. These systems involve capital-intensive facilities and equipment and move millions of people and tonnes of freight every day. As air traffic has continued to increase, the systems necessary to ensure safe and efficient operation will continue to grow more and more complex. Hence, it is imperative that air transport analysts are equipped with the best tools to properly predict and respond to expected air transportation operations. This dissertation aims to improve on those tools currently available to air transportation analysts, while offering new ones. Specifically, this thesis will offer the following: 1) A model for predicting arrival runway occupancy times (AROT); 2) a model for predicting departure runway occupancy times (DROT); and 3) a flight planning model. This thesis will also offer an exploration of the uses of unmanned aerial vehicles for providing wireless communications services. For the predictive models of AROT and DROT, we fit hierarchical Bayesian regression models to the data, grouped by aircraft type using airport physical and aircraft operational parameters as the regressors. Recognizing that many existing air transportation models require distributions of AROT and DROT, Bayesian methods are preferred since their output are distributions that can be directly inputted into air transportation modeling programs. Additionally, we exhibit how analysts will be able to decouple AROT and DROT predictions from the traditional 4 or 5 groupings of aircraft currently in use. Lastly, for the flight planning model, we present a 2-D model using presently available wind data that provides wind-optimal flight routings. We improve over current models by allowing free-flight unconnected to pre-existing airways and by offering finer resolutions over the current 2.5 degree norm. / Ph. D. Air Transportation Runway Occupancy Time Bayesian Inference Airport Capacity Hierarchical Regression Flight Planning Unmanned Aerial Systems
234	Proactive Decision Support Tools for National Park and Non-Traditional Agencies in Solving Traffic-Related Problems Fuentes, Antonio 26 March 2019 (has links) Transportation Engineers have recently begun to incorporate statistical and machine learning approaches to solving difficult problems, mainly due to the vast quantities of data collected that is stochastic (sensors, video, and human collected). In transportation engineering, a transportation system is often denoted by jurisdiction boundaries and evaluated as such. However, it is ultimately defined by the consideration of the analyst in trying to answer the question of interest. In this dissertation, a transportation system located in Jackson, Wyoming under the jurisdiction of the Grand Teton National Park and recognized as the Moose-Wilson Corridor is evaluated to identify transportation-related factors that influence its operational performance. The evaluation considers its unique prevalent conditions and takes into account future management strategies. The dissertation accomplishes this by detailing four distinct aspects in individual chapters; each chapter is a standalone manuscript with detailed introduction, purpose, literature review, findings, and conclusion. Chapter 1 provides a general introduction and provides a summary of Chapters 2 – 6. Chapter 2 focuses on evaluating the operational performance of the Moose-Wilson Corridor's entrance station, where queueing performance and arrival and probability mass functions of the vehicle arrival rates are determined. Chapter 3 focuses on the evaluation of a parking system within the Moose-Wilson Corridor in a popular attraction known as the Laurance S. Rockefeller Preserve, in which the system's operational performance is evaluated, and a probability mass function under different arrival and service rates are provided. Chapter 4 provides a data science approach to predicting the probability of vehicles stopping along the Moose-Wilson Corridor. The approach is a machine learning classification methodology known as "decision tree." In this study, probabilities of stopping at attractions are predicted based on GPS tracking data that include entrance location, time of day and stopping at attractions. Chapter 5 considers many of the previous findings, discusses and presents a developed tool which utilizes a Bayesian methodology to determine the posterior distributions of observed arrival rates and service rates which serve as bounds and inputs to an Agent-Based Model. The Agent-Based Model represents the Moose-Wilson Corridor under prevailing conditions and considers some of the primary operational changes in Grand Teton National Park's comprehensive management plan for the Moose-Wilson Corridor. The implementation of an Agent-Based Model provides a flexible platform to model multiple aspects unique to a National Park, including visitor behavior and its interaction with wildlife. Lastly, Chapter 6 summarizes and concludes the dissertation. / Doctor of Philosophy / In this dissertation, a transportation system located in Jackson, Wyoming under the jurisdiction of the Grand Teton National Park and recognized as the Moose-Wilson Corridor is evaluated to identify transportation-related factors that influence its operational performance. The evaluation considers its unique prevalent conditions and takes into account future management strategies. Furthermore, emerging analytical strategies are implemented to identify and address transportation system operational concerns. Thus, in this dissertation, decision support tools for the evaluation of a unique system in a National Park are presented in four distinct manuscripts. The manuscripts cover traditional approaches that breakdown and evaluate traffic operations and identify mitigation strategies. Additionally, emerging strategies for the evaluation of data with machine learning approaches are implemented on GPS-tracks to determine vehicles stopping at park attractions. Lastly, an agent-based model is developed in a flexible platform to utilize previous findings and evaluate the Moose-Wilson corridor while considering future policy constraints and the unique natural interactions between visitors and prevalent ecological and wildlife. Queueing Simulation Machine learning Bayesian Inference Agent-Based Modeling Transportation System Evaluation
235	Bayesian Alignment Model for Analysis of LC-MS-based Omic Data Tsai, Tsung-Heng 22 May 2014 (has links) Liquid chromatography coupled with mass spectrometry (LC-MS) has been widely used in various omic studies for biomarker discovery. Appropriate LC-MS data preprocessing steps are needed to detect true differences between biological groups. Retention time alignment is one of the most important yet challenging preprocessing steps, in order to ensure that ion intensity measurements among multiple LC-MS runs are comparable. In this dissertation, we propose a Bayesian alignment model (BAM) for analysis of LC-MS data. BAM uses Markov chain Monte Carlo (MCMC) methods to draw inference on the model parameters and provides estimates of the retention time variability along with uncertainty measures, enabling a natural framework to integrate information of various sources. From methodology development to practical application, we investigate the alignment problem through three research topics: 1) development of single-profile Bayesian alignment model, 2) development of multi-profile Bayesian alignment model, and 3) application to biomarker discovery research. Chapter 2 introduces the profile-based Bayesian alignment using a single chromatogram, e.g., base peak chromatogram from each LC-MS run. The single-profile alignment model improves on existing MCMC-based alignment methods through 1) the implementation of an efficient MCMC sampler using a block Metropolis-Hastings algorithm, and 2) an adaptive mechanism for knot specification using stochastic search variable selection (SSVS). Chapter 3 extends the model to integrate complementary information that better captures the variability in chromatographic separation. We use Gaussian process regression on the internal standards to derive a prior distribution for the mapping functions. In addition, a clustering approach is proposed to identify multiple representative chromatograms for each LC-MS run. With the Gaussian process prior, these chromatograms are simultaneously considered in the profile-based alignment, which greatly improves the model estimation and facilitates the subsequent peak matching process. Chapter 4 demonstrates the applicability of the proposed Bayesian alignment model to biomarker discovery research. We integrate the proposed Bayesian alignment model into a rigorous preprocessing pipeline for LC-MS data analysis. Through the developed analysis pipeline, candidate biomarkers for hepatocellular carcinoma (HCC) are identified and confirmed on a complementary platform. / Ph. D. alignment Bayesian inference biomarker discovery Markov chain Monte Carlo (MCMC)
236	Predictive Turbulence Modeling with Bayesian Inference and Physics-Informed Machine Learning Wu, Jinlong 25 September 2018 (has links) Reynolds-Averaged Navier-Stokes (RANS) simulations are widely used for engineering design and analysis involving turbulent flows. In RANS simulations, the Reynolds stress needs closure models and the existing models have large model-form uncertainties. Therefore, the RANS simulations are known to be unreliable in many flows of engineering relevance, including flows with three-dimensional structures, swirl, pressure gradients, or curvature. This lack of accuracy in complex flows has diminished the utility of RANS simulations as a predictive tool for engineering design, analysis, optimization, and reliability assessments. Recently, data-driven methods have emerged as a promising alternative to develop the model of Reynolds stress for RANS simulations. In this dissertation I explore two physics-informed, data-driven frameworks to improve RANS modeled Reynolds stresses. First, a Bayesian inference framework is proposed to quantify and reduce the model-form uncertainty of RANS modeled Reynolds stress by leveraging online sparse measurement data with empirical prior knowledge. Second, a machine-learning-assisted framework is proposed to utilize offline high-fidelity simulation databases. Numerical results show that the data-driven RANS models have better prediction of Reynolds stress and other quantities of interest for several canonical flows. Two metrics are also presented for an a priori assessment of the prediction confidence for the machine-learning-assisted RANS model. The proposed data-driven methods are also applicable to the computational study of other physical systems whose governing equations have some unresolved physics to be modeled. / Ph. D. / Reynolds-Averaged Navier–Stokes (RANS) simulations are widely used for engineering design and analysis involving turbulent flows. In RANS simulations, the Reynolds stress needs closure models and the existing models have large model-form uncertainties. Therefore, the RANS simulations are known to be unreliable in many flows of engineering relevance, including flows with three-dimensional structures, swirl, pressure gradients, or curvature. This lack of accuracy in complex flows has diminished the utility of RANS simulations as a predictive tool for engineering design, analysis, optimization, and reliability assessments. Recently, data-driven methods have emerged as a promising alternative to develop the model of Reynolds stress for RANS simulations. In this dissertation I explore two physics-informed, data-driven frameworks to improve RANS modeled Reynolds stresses. First, a Bayesian inference framework is proposed to quantify and reduce the model-form uncertainty of RANS modeled Reynolds stress by leveraging online sparse measurement data with empirical prior knowledge. Second, a machine-learning-assisted framework is proposed to utilize offline high fidelity simulation databases. Numerical results show that the data-driven RANS models have better prediction of Reynolds stress and other quantities of interest for several canonical flows. Two metrics are also presented for an a priori assessment of the prediction confidence for the machine-learning-assisted RANS model. The proposed data-driven methods are also applicable to the computational study of other physical systems whose governing equations have some unresolved physics to be modeled. Turbulence modeling RANS Model-form uncertainty Data-driven Uncertainty quantification Bayesian Inference Machine learning
237	Bayesian Parameter Estimation on Three Models of Influenza Torrence, Robert Billington 11 May 2017 (has links) Mathematical models of viral infections have been informing virology research for years. Estimating parameter values for these models can lead to understanding of biological values. This has been successful in HIV modeling for the estimation of values such as the lifetime of infected CD8 T-Cells. However, estimating these values is notoriously difficult, especially for highly complex models. We use Bayesian inference and Monte Carlo Markov Chain methods to estimate the underlying densities of the parameters (assumed to be continuous random variables) for three models of influenza. We discuss the advantages and limitations of parameter estimation using these methods. The data and influenza models used for this project are from the lab of Dr. Amber Smith in Memphis, Tennessee. / Master of Science / Mathematical models of viral infections have been informing virology research for years. Estimating parameter values for these models can lead to understanding of biological values. This has been successful in HIV modeling for the estimation of values such as the lifetime of infected CD8 T-Cells. However, estimating these values is notoriously difficult, especially for highly complex models. We use Bayesian inference and Monte Carlo Markov Chain methods to perform parameter estimation for three models of influenza. We discuss the advantages and limitations of these methods. The data and influenza models used for this project are from the lab of Dr. Amber Smith in Memphis, Tennessee. Influenza Bayesian Inference Parameter Estimation Mathematical Biology MCMC Methods Metropolis Algorithm
238	Continuous Continuous Probabilistic Genotyping: A differentiable model and modern Bayesian inference techniques for forensic DNA mixtures Susik, Mateusz 19 June 2024 (has links) DNA samples are a part of the collected physical evidence during the comtemporary crime scene investigation procedure. After processing the samples, a laboratory obtains short tandem repeat electropherograms. In case of mixed DNA profiles, i.e., profiles that contain DNA material from more than one contributor, the laboratory needs to estimate the test statistic (likelihood ratio) that could provide evidence, either inculpatory or exculpatory, against the person of interest. This is automated with probabilistic genotyping (PG) software with (fully-)continuous models: the ones that consider the heights of the observed peaks. In this thesis, we provide understanding of the modern PG methods. We then show how to improve measurable indicators of the algorithm performance, such as precision and inference runtime, that directly correspond to the efficiency and efficacy of work performed in a lab. With quicker algorithms the forensics laboratories can process more samples and provide more comprehensive results by reanalysing the mixtures with different hypotheses and hyperparameterisations. With more precise algorithms, there will be a grater confidence in their results. The precision of the solution would ameliorate the admissibility of the provided evidence and reliability of the results. We achieve improvements over the state-of-the-art by utilising probabilistic programming and modern Bayesian inference methods. We describe a differentiable (and hence continuous) continuous model that can be used with different estimators from both the sampling and variational families of techniques. Finally, as the different PG products output different likelihood ratios, we provide explanation of some of the factors causing this behaviour. This is of high importance because if two solutions are used for the same crime case, the difference must be understood. Otherwise, because of lack of consensus, the results would cause confusion or, in the worst case, would not be admitted by the court. info:eu-repo/classification/ddc/006 ddc:006
239	Hierarchical Initial Condition Generator for Cosmic Structure Using Normalizing Flows / Hierarkisk generator av begynnelsetillstånd till kosmisk struktur med användning av normaliserade flöden Holma, Pontus January 2024 (has links) In this report, we present a novel Bayesian inference framework to reconstruct the three-dimensional initial conditions of cosmic structure formation from data. To achieve this goal, we leverage deep learning technologies to create a generative model of cosmic initial conditions paired with a fast machine learning surrogate model emulating the complex gravitational structure formation. According to the cosmological paradigm, all observable structures were formed from tiny primordial quantum fluctuations generated during the early stages of the Universe. As time passed, these seed fluctuations grew via gravitational aggregation to form the presently observed cosmic web traced by galaxies. For this reason, the specific shape of a configuration of the observed galaxy distribution retains a memory of its initial conditions and the physical processes that shaped it. To recover this information, we develop a novel machine learning approach that leverages the hierarchical nature of structure formation. We demonstrate our method in a mock analysis and find that we can recover the initial conditions with high accuracy, showing the potential of our model. / I detta examensarbete presenteras ett ramverk baserat på Bayesiansk inferens för att rekonstruera de tredimensionella begynnelsevärdena av kosmisk struktur från data. För att uppnå detta har tekniker från djupinlärning använts för att skapa en generativ modell av kosmiska begynnelsevärden, vilket parats ihop med en snabb maskininlärningsmodell som efterliknar den komplexa gravitationella strukturformationen. Utifrån moderna teorier inom kosmologi skapades alla observerbara strukturer i universum från små kvantfluktuationer i det tidiga universumet. Allt eftersom tiden gick har dessa fluktuationer vuxit via gravitationella krafter för att forma det av galaxer uppspända kosmiska nät som idag kan observeras. På grund av detta bibehåller en specifik form av en konfiguration av den observerade galaxdistributionen ett minne av sina begynnelsevärden och de fysikaliska processer som formade den. För att återfå denna information presenteras en maskininlärningsmetod som utnyttjar den hierarkiska naturen av strukturformation. Metoden demonstreras genom ett modelltest som påvisar att begynnelsevärdena kan återfås med hög noggrannhet, vilket indikerar modellens potential. Cosmology Astrophysics Machine Learning Bayesian Inference Kosmologi astrofysik maskininlärning Bayesiansk inferens Physical Sciences Fysik
240	Essays on Bayesian Inference for Social Networks Koskinen, Johan January 2004 (has links) <p>This thesis presents Bayesian solutions to inference problems for three types of social network data structures: a single observation of a social network, repeated observations on the same social network, and repeated observations on a social network developing through time.</p><p>A social network is conceived as being a structure consisting of actors and their social interaction with each other. A common conceptualisation of social networks is to let the actors be represented by nodes in a graph with edges between pairs of nodes that are relationally tied to each other according to some definition. Statistical analysis of social networks is to a large extent concerned with modelling of these relational ties, which lends itself to empirical evaluation.</p><p>The first paper deals with a family of statistical models for social networks called exponential random graphs that takes various structural features of the network into account. In general, the likelihood functions of exponential random graphs are only known up to a constant of proportionality. A procedure for performing Bayesian inference using Markov chain Monte Carlo (MCMC) methods is presented. The algorithm consists of two basic steps, one in which an ordinary Metropolis-Hastings up-dating step is used, and another in which an importance sampling scheme is used to calculate the acceptance probability of the Metropolis-Hastings step.</p><p>In paper number two a method for modelling reports given by actors (or other informants) on their social interaction with others is investigated in a Bayesian framework. The model contains two basic ingredients: the unknown network structure and functions that link this unknown network structure to the reports given by the actors. These functions take the form of probit link functions. An intrinsic problem is that the model is not identified, meaning that there are combinations of values on the unknown structure and the parameters in the probit link functions that are observationally equivalent. Instead of using restrictions for achieving identification, it is proposed that the different observationally equivalent combinations of parameters and unknown structure be investigated a posteriori. Estimation of parameters is carried out using Gibbs sampling with a switching devise that enables transitions between posterior modal regions. The main goal of the procedures is to provide tools for comparisons of different model specifications.</p><p>Papers 3 and 4, propose Bayesian methods for longitudinal social networks. The premise of the models investigated is that overall change in social networks occurs as a consequence of sequences of incremental changes. Models for the evolution of social networks using continuos-time Markov chains are meant to capture these dynamics. Paper 3 presents an MCMC algorithm for exploring the posteriors of parameters for such Markov chains. More specifically, the unobserved evolution of the network in-between observations is explicitly modelled thereby avoiding the need to deal with explicit formulas for the transition probabilities. This enables likelihood based parameter inference in a wider class of network evolution models than has been available before. Paper 4 builds on the proposed inference procedure of Paper 3 and demonstrates how to perform model selection for a class of network evolution models.</p> Bayesian inference social network analysis Markov chain Monte Carlo exponential random graphs cognitive social structures longitudinal social networks. Statistics Statistik

Search results