At the onset of the "Big Data" age, we are faced with ubiquitous data in various forms and with various characteristics, such as noise, high dimensionality, autocorrelation, and so on. The question of how to obtain
accurate and computationally efficient estimates from such data is one that has stoked the interest of many researchers. This dissertation mainly concentrates on two general problem areas: inference for high-dimensional and noisy data, and estimation of the steady-state mean for univariate data generated by computer simulation experiments. We develop and evaluate three separate sequential algorithms for the two topics. One major
advantage of sequential algorithms is that they allow for careful experimental adjustments as sampling proceeds. Unlike one-step sampling plans, sequential algorithms adapt to different situations arising from the ongoing sampling; this makes these procedures efficacious as problems become more complicated and more-delicate requirements need to be
satisfied. We will elaborate on each research topic in the following discussion. Concerning the first topic, our goal is to develop a robust graphical model for noisy data in a high-dimensional setting. Under a Gaussian distributional assumption, the estimation of undirected Gaussian graphs is equivalent to the estimation of inverse covariance matrices. Particular interest has focused upon estimating a sparse inverse covariance matrix to reveal insight on the data as suggested by the principle of parsimony. For
estimation with high-dimensional data, the influence of anomalous observations becomes severe as the dimensionality increases. To address this problem, we propose a robust estimation procedure for the Gaussian graphical model based on the Integrated Squared Error (ISE) criterion. The
robustness result is obtained by using ISE as a nonparametric criterion for seeking the largest portion of the data that "matches" the model. Moreover, an l₁-type regularization is applied to encourage sparse
estimation. To address the non-convexity of the objective function, we develop a sequential algorithm in the spirit of a
majorization-minimization scheme. We summarize the results of Monte Carlo
experiments supporting the conclusion that our estimator of the inverse covariance matrix converges weakly (i.e., in probability) to the latter matrix as the sample size grows large. The performance of the proposed
method is compared with that of several existing approaches through numerical simulations. We further demonstrate the strength of our method with applications in genetic network inference and financial portfolio optimization. The second topic consists of two parts, and both concern the computation of point and confidence interval (CI) estimators for the mean µ of a
stationary discrete-time univariate stochastic process X \equiv \{X_i:
i=1,2,...} generated by a simulation experiment. The point estimation
is relatively easy when the underlying system starts in steady state; but
the traditional way of calculating CIs usually fails since the data encountered in simulation output are typically serially correlated. We
propose two distinct sequential procedures that each yield a CI for µ with user-specified reliability and absolute or relative precision. The first sequential procedure is based on variance estimators computed from standardized time series applied to nonoverlapping batches of
observations, and it is characterized by its simplicity relative to methods based on batch means and its ability to deliver CIs for the
variance parameter of the output process (i.e., the sum of covariances at all lags). The second procedure is the first sequential algorithm that uses overlapping variance estimators to construct asymptotically valid CI estimators for the steady-state mean based on standardized time series. The advantage of this procedure is that compared with other popular procedures for steady-state simulation analysis, the second procedure yields significant reduction both in the variability of its CI estimator and in the sample size needed to satisfy the precision requirement. The effectiveness of both procedures is evaluated via comparisons with
state-of-the-art methods based on batch means under a series of experimental settings: the M/M/1 waiting-time process with 90% traffic intensity; the M/H_2/1 waiting-time process with 80% traffic
intensity; the M/M/1/LIFO waiting-time process with 80% traffic intensity; and an AR(1)-to-Pareto (ARTOP) process. We find that the new procedures perform comparatively well in terms of their average
required sample sizes as well as the coverage and average half-length of
their delivered CIs.
Identifer | oai:union.ndltd.org:GATECH/oai:smartech.gatech.edu:1853/51857 |
Date | 22 May 2014 |
Creators | Tang, Peng |
Contributors | Alexopoulos, Christos, Goldsman, David M. |
Publisher | Georgia Institute of Technology |
Source Sets | Georgia Tech Electronic Thesis and Dissertation Archive |
Language | en_US |
Detected Language | English |
Type | Dissertation |
Format | application/pdf |
Page generated in 0.0024 seconds