• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • No language data
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Embarrassingly Parallel Statistics and its Applications: Divide & Recombine Methods for Parallel Computation of Quantiles and Construction of K-D Trees for Big-Data

Aritra Chakravorty (5929565) 16 January 2019 (has links)
<div>In Divide & Recombine (D&R), data are divided into subsets, analytic methodsare applied to each subset independently, with no communication between processes;then the subset outputs for each method are recombined. For big data, this providesalmost all of the analytic tasking needed when data are analyzed. It also provideshigh computational performance because typically most of the computation is em-barrassingly parallel, the simplest parallel computation.</div><div><br></div><div>Another kind of tasking must address computational performance and numericaccuracy: the computing of functions of all of the data, or “statistics”. For data bigand small, it is often important to compute such statistics for all of the data, whichcan be summaries of the data, such as sample quantiles of continuous variables, orcan process the data into a form that helps analysis, such as dividing the data intorepresentative subsets. Development of computational methods to compute thesestatistics can be challenging.</div><div><br></div><div>D&R can be a very effective framework for computing statistics. To supportthis, we introduce the concept of embarrassingly parallel (EP) statistics, both weakand strong. The concept of EP statistics is not entirely new, but has had littledevelopment. The existing methodology is mainly sums of sums. For example, this isdone when computing the necessary statistics for least squares where sums of productsand cross productions are carried out on subsets then summed across subsets. Ourtreatment of EP statistics has taken the concept much further. The outcome is abilityto use EP statistics in conjunction with the use a Fourier series to approximate an optimization criteria. The series terms, which are strongly EP statistics, are summedacross subsets, and the result is optimized. These are EP-F computational methods.</div><div><br></div><div>We have so far developed two EP-F computational methods for two widely usedstatistic computations. EP-F-Quantile is for quantiles of big data, and EP-F-KDtreeis for KD-trees. Speed and accuracy of EPF-Quantile are compared with that of thewell-known binning method, which also can be formulated in terms of EP statistics. EPF-KDtree is the first parallel KD-tree computational method of which we areaware. EP and EPF computational methods have potentially many other applicationsto computing statistics.</div>

Page generated in 0.0861 seconds