Global ETD Search

1	Embarrassingly Parallel Statistics and its Applications: Divide & Recombine Methods for Parallel Computation of Quantiles and Construction of K-D Trees for Big-Data Aritra Chakravorty (5929565) 16 January 2019 (has links) <div>In Divide & Recombine (D&R), data are divided into subsets, analytic methodsare applied to each subset independently, with no communication between processes;then the subset outputs for each method are recombined. For big data, this providesalmost all of the analytic tasking needed when data are analyzed. It also provideshigh computational performance because typically most of the computation is em-barrassingly parallel, the simplest parallel computation.</div><div><br></div><div>Another kind of tasking must address computational performance and numericaccuracy: the computing of functions of all of the data, or “statistics”. For data bigand small, it is often important to compute such statistics for all of the data, whichcan be summaries of the data, such as sample quantiles of continuous variables, orcan process the data into a form that helps analysis, such as dividing the data intorepresentative subsets. Development of computational methods to compute thesestatistics can be challenging.</div><div><br></div><div>D&R can be a very effective framework for computing statistics. To supportthis, we introduce the concept of embarrassingly parallel (EP) statistics, both weakand strong. The concept of EP statistics is not entirely new, but has had littledevelopment. The existing methodology is mainly sums of sums. For example, this isdone when computing the necessary statistics for least squares where sums of productsand cross productions are carried out on subsets then summed across subsets. Ourtreatment of EP statistics has taken the concept much further. The outcome is abilityto use EP statistics in conjunction with the use a Fourier series to approximate an optimization criteria. The series terms, which are strongly EP statistics, are summedacross subsets, and the result is optimized. These are EP-F computational methods.</div><div><br></div><div>We have so far developed two EP-F computational methods for two widely usedstatistic computations. EP-F-Quantile is for quantiles of big data, and EP-F-KDtreeis for KD-trees. Speed and accuracy of EPF-Quantile are compared with that of thewell-known binning method, which also can be formulated in terms of EP statistics. EPF-KDtree is the first parallel KD-tree computational method of which we areaware. EP and EPF computational methods have potentially many other applicationsto computing statistics.</div> Statistics Divide and Recombine Map-Reduce Parallel algorithms. KD-tree quantiles
2	Statistical Predictions Based on Accelerated Degradation Data and Spatial Count Data Duan, Yuanyuan 04 March 2014 (has links) This dissertation aims to develop methods for statistical predictions based on various types of data from different areas. We focus on applications from reliability and spatial epidemiology. Chapter 1 gives a general introduction of statistical predictions. Chapters 2 and 3 investigate the photodegradation of an organic coating, which is mainly caused by ultraviolet (UV) radiation but also affected by environmental factors, including temperature and humidity. In Chapter 2, we identify a physically motivated nonlinear mixed-effects model, including the effects of environmental variables, to describe the degradation path. Unit-to-unit variabilities are modeled as random effects. The maximum likelihood approach is used to estimate parameters based on the accelerated test data from laboratory. The developed model is then extended to allow for time-varying covariates and is used to predict outdoor degradation where the explanatory variables are time-varying. Chapter 3 introduces a class of models for analyzing degradation data with dynamic covariate information. We use a general path model with random effects to describe the degradation paths and a vector time series model to describe the covariate process. Shape restricted splines are used to estimate the effects of dynamic covariates on the degradation process. The unknown parameters of these models are estimated by using the maximum likelihood method. Algorithms for computing the estimated lifetime distribution are also described. The proposed methods are applied to predict the photodegradation path of an organic coating in a complicated dynamic environment. Chapter 4 investigates the Lyme disease emergency in Virginia at census tract level. Based on areal (census tract level) count data of Lyme disease cases in Virginia from 1998 to 2011, we analyze the spatial patterns of the disease using statistical smoothing techniques. We also use the space and space-time scan statistics to reveal the presence of clusters in the spatial and spatial/temporal distribution of Lyme disease. Chapter 5 builds a predictive model for Lyme disease based on historical data and environmental/demographical information of each census tract. We propose a Divide-Recombine method to take advantage of parallel computing. We compare prediction results through simulation studies, which show our method can provide comparable fitting and predicting accuracy but can achieve much more computational efficiency. We also apply the proposed method to analyze Virginia Lyme disease spatio-temporal data. Our method makes large-scale spatio-temporal predictions possible. Chapter 6 gives a general review on the contributions of this dissertation, and discusses directions for future research. / Ph. D. Coatings Covariate process Clusters Divide-Recombine Environmental conditions Lifetime prediction Lyme disease Kernel smoothing Photodegradation Usage history UV exposure Random effects Reliability Spatio-temporal.

Search results

Embarrassingly Parallel Statistics and its Applications: Divide & Recombine Methods for Parallel Computation of Quantiles and Construction of K-D Trees for Big-Data

Statistical Predictions Based on Accelerated Degradation Data and Spatial Count Data