Global ETD Search

Return to search

Statistical methods for certain large, complex data challenges

Big data concerns large-volume, complex, growing data sets, and it provides us opportunities as well as challenges. This thesis focuses on statistical methods for several specific large, complex data challenges - each involving representation of data with complex format, utilization of complicated information, and/or intensive computational cost.

The first problem we work on is hypothesis testing for multilayer network data, motivated by an example in computational biology. We show how to represent the complex structure of a multilayer network as a single data point within the space of supra-Laplacians and then develop a central limit theorem and hypothesis testing theories for multilayer networks in that space. We develop both global and local testing strategies for mean comparison and investigate sample size requirements. The methods were applied to the motivating computational biology example and compared with the classic Gene Set Enrichment Analysis(GSEA). More biological insights are found in this comparison.

The second problem is the source detection problem in epidemiology, which is one of the most important issues for control of epidemics. Ideally, we want to locate the sources based on all history data. However, this is often infeasible, because the history data is complex, high-dimensional and cannot be fully observed. Epidemiologists have recognized the crucial role of human mobility as an important proxy to a complete history, but little in the literature to date uses this information for source detection. We recast the source detection problem as identifying a relevant mixture component in a multivariate Gaussian mixture model. Human mobility within a stochastic PDE model is used to calibrate the parameters. The capability of our method is demonstrated in the context of the 2000-2002 cholera outbreak in the KwaZulu-Natal province.

The third problem is about multivariate time series imputation, which is a classic problem in statistics. To address the common problem of low signal-to-noise ratio in high-dimensional multivariate time series, we propose models based on state-space models which provide more precise inference of missing values by clustering multivariate time series components in a nonparametric way. The models are suitable for large-scale time series due to their efficient parameter estimation. / 2019-05-15T00:00:00Z

https://hdl.handle.net/2144/33134

Statistics

Big data

Hypothesis testing

Multivariate time series imputation

Source detection

Statistical network analysis

Identifer	oai:union.ndltd.org:bu.edu/oai:open.bu.edu:2144/33134
Date	15 November 2018
Creators	Li, Jun
Contributors	Kolaczyk, Eric D.
Source Sets	Boston University
Language	en_US
Detected Language	English
Type	Thesis/Dissertation

Page generated in 0.1244 seconds

Statistical methods for certain large, complex data challenges

Description

Links & Downloads

Tags

Additional Fields