921 |
Statistical methods for Mendelian randomization using GWAS summary dataHu, Xianghong 23 August 2019 (has links)
Mendelian Randomization (MR) is a powerful tool for accessing causality of exposure on an outcome using genetic variants as the instrumental variables. Much of the recent developments is propelled by the increasing availability of GWAS summary data. However, the accuracy of the MR causal effect estimates could be challenged in case of the MR assumptions are violated. The source of biases could attribute to the weak effects arising because of polygenicity, the presentence of horizontal pleiotropy and other biases, e.g., selection bias. In this thesis, we proposed two works, expecting to deal with these issues.In the first part, we proposed a method named 'Bayesian Weighted Mendelian Randomization (BMWR)' for causal inference using summary statistics from GWAS. In BWMR, we not only take into account the uncertainty of weak effects owning to polygenicity of human genomics but also models the weak horizontal pleiotropic effects. Moreover, BWMR adopts a Bayesian reweighting strategy for detection of large pleiotropic outliers. An efficient algorithm based on variational inference was developed to make BWMR computationally efficient and stable. Considering the underestimated variance provided by variational inference, we further derived a closed form variance estimator inspired by a linear response method. We conducted several simulations to evaluate the performance of BWMR, demonstrating the advantage of BWMR over other methods. Then, we applied BWMR to access causality between 126 metabolites and 90 complex traits, revealing novel causal relationships. In the second part, we further developed BWMR-C: Statistical correction of selection bias for Mendelian Randomization based on a Bayesian weighted method. Based on the framework of BWMR, the probability model in BWMR-C is built conditional on the IV selection criteria. In such way, BWMR-C delicated to reduce the influence of the selection process on the causal effect estimates and also preserve the good properties of BWMR. To make the causal inference computationally stable and efficient, we developed a variational EM algorithm. We conducted several comprehensive simulations to evaluate the performance of BWMR-C for correction of selection bias. Then, we applied BWMR-C on seven body fat distribution related traits and 140 UK Biobank traits. Our results show that BWMR-C achieves satisfactory performance for correcting selection bias. Keywords: Mendelian Randomization, polygenicity, horizontal pleiotropy, selection bias, variation inference.
|
922 |
Clustering Algorithm for Zero-Inflated DataJanuary 2020 (has links)
Zero-inflated data are common in biomedical research. In cluster analysis, the heuristic
approach fails to provide inferential properties to the outcome while the existing model-based
approach only works in the case of a mixture of multivariate normal. In this dissertation, I
developed two new model-based clustering algorithms- the multivariate zero-inflated log-normal
and the multivariate zero-inflated Poisson clustering algorithms. I then applied these methods to
the questionnaire data and compare the resulting clusters to the ones derived from assuming
multivariate normal distribution. Associations between clustering results and clinical outcomes
were also investigated.
|
923 |
Spectral methods for the detection and characterization of Topologically Associated DomainsCresswell, Kellen Garrison 01 January 2019 (has links)
The three-dimensional (3D) structure of the genome plays a crucial role in gene expression regulation. Chromatin conformation capture technologies (Hi-C) have revealed that the genome is organized in a hierarchy of topologically associated domains (TADs), sub-TADs, and chromatin loops which is relatively stable across cell-lines and even across species. These TADs dynamically reorganize during development of disease, and exhibit cell- and conditionspecific differences. Identifying such hierarchical structures and how they change between conditions is a critical step in understanding genome regulation and disease development. Despite their importance, there are relatively few tools for identification of TADs and even fewer for identification of hierarchies. Additionally, there are no publicly available tools for comparison of TADs across datasets. These tools are necessary to conduct large-scale genome-wide analysis and comparison of 3D structure. To address the challenge of TAD identification, we developed a novel sliding window-based spectral clustering framework that uses gaps between consecutive eigenvectors for TAD boundary identification. Our method, implemented in an R package, SpectralTAD, has automatic parameter selection, is robust to sequencing depth, resolution and sparsity of Hi-C data, and detects hierarchical, biologically relevant TADs. SpectralTAD outperforms four state-of-the-art TAD callers in simulated and experimental settings. We demonstrate that TAD boundaries shared among multiple levels of the TAD hierarchy were more enriched in classical boundary marks and more conserved across cell lines and tissues. SpectralTAD is available at http://bioconductor.org/packages/SpectralTAD/.
To address the problem of TAD comparison, we developed TADCompare. TADCompare is based on a spectral clustering-derived measure called the eigenvector gap, which enables a loci-by-loci comparison of TAD boundary differences between datasets. Using this measure, we introduce methods for identifying differential and consensus TAD boundaries and tracking TAD boundary changes over time. We further propose a novel framework for the systematic classification of TAD boundary changes. Colocalization- and gene enrichment analysis of different types of TAD boundary changes revealed distinct biological functionality associated with them. TADCompare is available on https://github.com/dozmorovlab/TADCompare.
|
924 |
Test Validity and Statistical AnalysisSargsyan, Alex 17 September 2018 (has links)
No description available.
|
925 |
STATISTICAL MODELING OF SHIP AIRWAKES INCLUDING THE FEASIBILITY OF APPLYING MACHINE LEARNINGUnknown Date (has links)
Airwakes are shed behind the ship’s superstructure and represent a highly turbulent and rapidly distorting flow field. This flow field severely affects pilot’s workload and such helicopter shipboard operations. It requires both the one-point statistics of autospectrum and the two-point statistics of coherence (normalized cross-spectrum) for a relatively complete description. Recent advances primarily refer to generating databases of flow velocity points through experimental and computational fluid dynamics (CFD) investigations, numerically computing autospectra along with a few cases of cross-spectra and coherences, and developing a framework for extracting interpretive models of autospectra in closed form from a database along with an application of this framework to study the downwash effects. By comparison, relatively little is known about coherences. In fact, even the basic expressions of cross-spectra and coherences for three components of homogeneous isotropic turbulence (HIT) vary from one study to the other, and the related literature is scattered and piecemeal. Accordingly, this dissertation begins with a unified account of all the cross-spectra and coherences of HIT from first principles. Then, it presents a framework for constructing interpretive coherence models of airwake from a database on the basis of perturbation theory. For each velocity component, the coherence is represented by a separate perturbation series in which the basis function or the first term on the right-hand side of the series is represented by the corresponding coherence for HIT. The perturbation series coefficients are evaluated by satisfying the theoretical constraints and fitting a curve in a least squares sense on a set of numerically generated coherence points from a database. Although not tested against a specific database, the framework has a mathematical basis. Moreover, for assumed values of perturbation series constants, coherence results are presented to demonstrate how coherences of airwakes and such flow fields compare to those of HIT. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2020. / FAU Electronic Theses and Dissertations Collection
|
926 |
Discrete Optimization Problems in Popular Matchings and SchedulingPowers, Vladlena January 2020 (has links)
This thesis focuses on two central classes of problems in discrete optimization: matching and scheduling. Matching problems lie at the intersection of different areas of mathematics, computer science, and economics. In two-sided markets, Gale and Shapley's model has been widely used and generalized to assign, e.g., students to schools and interns to hospitals. The goal is to find a matching that respects a certain concept of fairness called stability. This model has been generalized in many ways. Relaxing the stability condition to popularity allows to overcome one of the main drawbacks of stable matchings: the fact that two individuals (a blocking pair) can prevent the matching from being much larger. The first part of this thesis is devoted to understanding the complexity of various problems around popular matchings. We first investigate maximum weighted popular matching problems. In particular, we show various NP-hardness results, while on the other hand prove that a popular matching of maximum weight (if any) can be found in polynomial time if the input graph has bounded treewidth. We also investigate algorithmic questions on the relationship between popular, stable, and Pareto optimal matchings. The last part of the thesis deals with a combinatorial scheduling problem arising in cyber-security. Moving target defense strategies allow to mitigate cyber attacks. We analyze a strategic game, PLADD, which is an abstract model for these strategies.
|
927 |
Selecting the best model for predicting a term deposit product take-up in bankingHlongwane, Rivalani Willie 19 February 2019 (has links)
In this study, we use data mining techniques to build predictive models on data collected by a Portuguese bank through a term savings product campaign conducted between May 2008 and November 2010. This data is imbalanced, given an observed take-up rate of 11.27%. Ling et al. (1998) indicated that predictive models built on imbalanced data tend to yield low sensitivity and high specificity, an indication of low true positive and high true negative rates. Our study confirms this finding. We, therefore, use three sampling techniques, namely, under-sampling, oversampling and Synthetic Minority Over-sampling Technique, to balance the data, this results in three additional datasets to use for modelling. We build the following predictive models: random forest, multivariate adaptive regression splines, neural network and support vector machine on the datasets and we compare the models against each other for their ability to identify customers that are likely to take-up a term savings product. As part of the model building process, we investigate parameter permutations related to each modelling technique to tune the models, we find that this assists in building robust models. We assess our models for predictive performance through the use of the receiver operating characteristic curve, confusion matrix, GINI, kappa, sensitivity, specificity, and lift and gains charts. A multivariate adaptive regression splines model built on over-sampled data is found to be the best model for predicting term savings product takeup.
|
928 |
Evolutionary Dynamics of Large SystemsNikhil Nayanar (10702254) 06 May 2021 (has links)
<div><div><div><p>Several socially and economically important real-world systems comprise large numbers of interacting constituent entities. Examples include the World Wide Web and Online Social Networks (OSNs). Developing the capability to forecast the macroscopic behavior of such systems based on the microscopic interactions of the constituent parts is of considerable economic importance.</p><p>Previous researchers have investigated phenomenological forecasting models in such contexts as the spread of diseases in the real world and the diffusion of innovations in the OSNs. The previous forecasting models work well in predicting future states of a system that are at equilibrium or near equilibrium. However, forecasting non-equilibrium states – such as the transient emergence of hotspots in web traffic – remains a challenging problem. In this thesis we investigate a hypothesis, rooted in Ludwig Boltzmann's celebrated H-theorem, that the evolutionary dynamics of a large system – such as the World Wide Web – is driven by the system's innate tendency to evolve towards a state of maximum entropy.</p><p>Whereas closed systems may be expected to evolve towards a state of maximum entropy, most real-world systems are not closed. However, the stipulation that if a system is closed then it should asymptotically approach a state of maximum entropy provides a strong constraint on the inverse problem of formulating the microscopic interaction rules that give rise to the observed macroscopic behavior. We make the constraint stronger by insisting that, if closed, a system should evolve monotonically towards a state of maximum entropy and formulate microscopic interaction rules consistent with the stronger constraint.</p><p>We test the microscopic interaction rules that we formulate by applying them to two real world phenomena: the flow of web traffic in the gaming forums on Reddit and the spread of Covid-19 virus. We show that our hypothesis leads to a statistically significant improvement over the existing models in predicting the traffic flow in gaming forums on Reddit. Our interaction rules are also able to qualitatively reproduce the heterogeneity in the number of COVID-19 cases across the cities around the globe. The above experiments provide supporting evidence for our hypothesis, suggesting that our approach is worthy of further investigation.</p><p>In addition to the above stochastic model, we also study a deterministic model of attention flow over a network and establish sufficient conditions that, when met, signal imminent parabolic accretion of attention at a node<br></p></div></div></div>
|
929 |
Advances in Machine Learning for Compositional DataGordon Rodriguez, Elliott January 2022 (has links)
Compositional data refers to simplex-valued data, or equivalently, nonnegative vectors whose totals are uninformative. This data modality is of relevance across several scientific domains. A classical example of compositional data is the chemical composition of geological samples, e.g., major-oxide concentrations. A more modern example arises from the microbial populations recorded using high-throughput genetic sequencing technologies, e.g., the gut microbiome. This dissertation presents a set of methodological and theoretical contributions that advance the state of the art in the analysis of compositional data.
Our work can be divided along two categories: problems in which compositional data represents the input to a predictive model, and problems in which it represents the output of the model. For the first class of problems, we build on the popular log-ratio framework to develop an efficient learning algorithm for high-dimensional compositional data. Our algorithm runs orders of magnitude faster than competing alternatives, without sacrificing model quality. For the second class of problems, we define a novel exponential family of probability distributions supported on the simplex. This distribution enjoys attractive mathematical properties and provides a performant probability model for simplex-valued outcomes. Taken together, our results constitute a broad contribution to the toolkit of researchers and practitioners studying compositional data.
|
930 |
A decision support system for sugarcane irrigation supply and demand managementPatel, Zubair January 2017 (has links)
Commercial sugarcane farming requires large quantities of water to be delivered to the fields. Ideal irrigation schedules are produced indicating how much water to be supplied to fields considering multiple objectives in the farming process. Software packages do not fully account for the fact that the ideal irrigation schedule may not be met due to limitations in the water distribution network. This dissertation proposes the use of mathematical modelling to better understand water supply and demand management on a commercial sugarcane farm. Due to the complex nature of water stress on sugarcane, non-linearities occur in the model. A piecewise linear approximation is used to handle the non-linearity in the water allocation model and is solved in a commercial optimisation software package. A test data set is first used to exercise and evaluate the model performance, then to illustrate the practical applicability of the model, a commercial sized data set is used and analysed.
|
Page generated in 0.1193 seconds