Global ETD Search

1	Sample Size Determination in Multivariate Parameters With Applications to Nonuniform Subsampling in Big Data High Dimensional Linear Regression Yu Wang (11821553) 20 December 2021 (has links) Subsampling is an important method in the analysis of Big Data. Subsample size determination (SSSD) plays a crucial part in extracting information from data and in breaking<br>the challenges resulted from huge data sizes. In this thesis, (1) Sample size determination<br>(SSD) is investigated in multivariate parameters, and sample size formulas are obtained for<br>multivariate normal distribution. (2) Sample size formulas are obtained based on concentration inequalities. (3) Improved bounds for McDiarmid’s inequalities are obtained. (4) The<br>obtained results are applied to nonuniform subsampling in Big Data high dimensional linear<br>regression. (5) Numerical studies are conducted.<br>The sample size formula in univariate normal distribution is a melody in elementary<br>statistics. It appears that its generalization to multivariate normal (or more generally multivariate parameters) hasn’t been caught much attention to the best of our knowledge. In<br>this thesis, we introduce a definition for SSD, and obtain explicit formulas for multivariate<br>normal distribution, in gratifying analogy of the sample size formula in univariate normal.<br>Commonly used concentration inequalities provide exponential rates, and sample sizes<br>based on these inequalities are often loose. Talagrand (1995) provided the missing factor to<br>sharpen these inequalities. We obtained the numeric values of the constants in the missing<br>factor and slightly improved his results. Furthermore, we provided the missing factor in<br>McDiarmid’s inequality. These improved bounds are used to give shrunken sample sizes <br> Statistics Sample Size Determination
2	Sample Size Determination in Multivariate Parameters With Applications to Nonuniform Subsampling in Big Data High Dimensional Linear Regression Wang, Yu 12 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Subsampling is an important method in the analysis of Big Data. Subsample size determination (SSSD) plays a crucial part in extracting information from data and in breaking the challenges resulted from huge data sizes. In this thesis, (1) Sample size determination (SSD) is investigated in multivariate parameters, and sample size formulas are obtained for multivariate normal distribution. (2) Sample size formulas are obtained based on concentration inequalities. (3) Improved bounds for McDiarmid’s inequalities are obtained. (4) The obtained results are applied to nonuniform subsampling in Big Data high dimensional linear regression. (5) Numerical studies are conducted. The sample size formula in univariate normal distribution is a melody in elementary statistics. It appears that its generalization to multivariate normal (or more generally multivariate parameters) hasn’t been caught much attention to the best of our knowledge. In this thesis, we introduce a definition for SSD, and obtain explicit formulas for multivariate normal distribution, in gratifying analogy of the sample size formula in univariate normal. Commonly used concentration inequalities provide exponential rates, and sample sizes based on these inequalities are often loose. Talagrand (1995) provided the missing factor to sharpen these inequalities. We obtained the numeric values of the constants in the missing factor and slightly improved his results. Furthermore, we provided the missing factor in McDiarmid’s inequality. These improved bounds are used to give shrunken sample sizes. Sample size determination Concentration inequality Subsampling
3	Estimates of Statistical Power and Accuracy for Latent Trajectory Class Enumeration in the Growth Mixture Model Brown, Eric C 09 June 2003 (has links) This study employed Monte Carlo simulation to investigate the ability of the growth mixture model (GMM) to correctly identify models based on a "true" two-class pseudo-population from alternative models consisting of "false" one- and three-latent trajectory classes. This ability was assessed in terms of statistical power, defined as the proportion of replications that correctly identified the two-class model as having optimal fit to the data compared to the one-class model, and accuracy, which was defined as the proportion of replications that correctly identified the two-class model over both one- and three-class models. Estimates of power and accuracy were adjusted by empirically derived critical values to reflect nominal Type I error rates of a = .05. Six experimental conditions were examined: (a) standardized between-class differences in growth parameters, (b) percentage of total variance explained by growth parameters, (c) correlation between intercepts and slopes, (d) sample size, (e) number of repeated measures, and (f) planned missingness. Estimates of statistical power and accuracy were related to a measure of the degree of separation and distinction between latent trajectory classes (λ2 ), which approximated a chi-square based noncentrality parameter. Model selection relied on four criteria: (a) the Bayesian information criterion (BIC), (b) the sample-size adjusted BIC (ABIC), (c) the Akaike information criterion (AIC), and (d) the likelihood ratio test (LRT). Results showed that power and accuracy of the GMM to correctly enumerate latent trajectory classes were positively related to greater between-class separation, greater proportion of total variance explained by growth parameters, larger sample sizes, greater numbers of repeated measures, and larger negative correlations between intercepts and slopes; and inversely related to greater proportions of missing data. Results of the Monte Carlo simulations were field tested using specific design and population characteristics from an evaluation of a longitudinal demonstration project. This test compared estimates of power and accuracy generated via Monte Carlo simulation to estimates predicted from a regression of derived λ2 values. Results of this motivating example indicated that knowledge of λ2 can be useful in the two-class case for predicting power and accuracy without extensive Monte Carlo simulations. Monte Carlo simulation structural equation model noncentral chisquare distribution longitudinal design sample size determination American Studies Arts and Humanities
4	Aspectos estatísticos da amostragem de água de lastro / Statistical aspects of ballast water sampling Costa, Eliardo Guimarães da 01 March 2013 (has links) A água de lastro de navios é um dos principais agentes dispersivos de organismos nocivos à saúde humana e ao meio ambiente e normas internacionais exigem que a concentração desses organismos no tanque seja menor que um valor previamente especificado. Por limitações de tempo e custo, esse controle requer o uso de amostragem. Sob a hipótese de que a concentração desses organismos no tanque é homogênea, vários autores têm utilizado a distribuição Poisson para a tomada de decisão com base num teste de hipóteses. Como essa proposta é pouco realista, estendemos os resultados para casos em que a concentração de organismos no tanque é heterogênea utilizando estratificação, processos de Poisson não-homogêneos ou assumindo que ela obedece a uma distribuição Gama, que induz uma distribuição Binomial Negativa para o número de organismos amostrados. Além disso, propomos uma nova abordagem para o problema por meio de técnicas de estimação baseadas na distribuição Binomial Negativa. Para fins de aplicação, implementamos rotinas computacionais no software R / Ballast water is a leading dispersing agent of harmful organisms to human health and to the environment and international standards require that the concentration of these organisms in the tank must be less than a prespecified value. Because of time and cost limitations, this inspection requires the use of sampling. Under the assumption of an homogeneous organism concentration in the tank, several authors have used the Poisson distribution for decision making based on hypothesis testing. Since this proposal is unrealistic, we extend the results for cases in which the organism concentration in the tank is heterogeneous, using stratification, nonhomogeneous Poisson processes or assuming that it follows a Gamma distribution, which induces a Negative Binomial distribution for the number of sampled organisms. Furthermore, we propose a novel approach to the problem through estimation techniques based on the Negative Binomial distribution. For practical applications, we implemented computational routines using the R software determinação de tamanho amostral distribuição Binomial Negativa distribuição Poisson Negative Binomial distribution nonhomogeneous Poisson processes Poisson distribution processos de Poisson não-homogêneos sample size determination
5	Aspectos estatísticos da amostragem de água de lastro / Statistical aspects of ballast water sampling Eliardo Guimarães da Costa 01 March 2013 (has links) A água de lastro de navios é um dos principais agentes dispersivos de organismos nocivos à saúde humana e ao meio ambiente e normas internacionais exigem que a concentração desses organismos no tanque seja menor que um valor previamente especificado. Por limitações de tempo e custo, esse controle requer o uso de amostragem. Sob a hipótese de que a concentração desses organismos no tanque é homogênea, vários autores têm utilizado a distribuição Poisson para a tomada de decisão com base num teste de hipóteses. Como essa proposta é pouco realista, estendemos os resultados para casos em que a concentração de organismos no tanque é heterogênea utilizando estratificação, processos de Poisson não-homogêneos ou assumindo que ela obedece a uma distribuição Gama, que induz uma distribuição Binomial Negativa para o número de organismos amostrados. Além disso, propomos uma nova abordagem para o problema por meio de técnicas de estimação baseadas na distribuição Binomial Negativa. Para fins de aplicação, implementamos rotinas computacionais no software R / Ballast water is a leading dispersing agent of harmful organisms to human health and to the environment and international standards require that the concentration of these organisms in the tank must be less than a prespecified value. Because of time and cost limitations, this inspection requires the use of sampling. Under the assumption of an homogeneous organism concentration in the tank, several authors have used the Poisson distribution for decision making based on hypothesis testing. Since this proposal is unrealistic, we extend the results for cases in which the organism concentration in the tank is heterogeneous, using stratification, nonhomogeneous Poisson processes or assuming that it follows a Gamma distribution, which induces a Negative Binomial distribution for the number of sampled organisms. Furthermore, we propose a novel approach to the problem through estimation techniques based on the Negative Binomial distribution. For practical applications, we implemented computational routines using the R software determinação de tamanho amostral distribuição Binomial Negativa distribuição Poisson processos de Poisson não-homogêneos Negative Binomial distribution nonhomogeneous Poisson processes Poisson distribution sample size determination
6	Analyse de connectivité et techniques de partitionnement de données appliquées à la caractérisation et la modélisation d'écoulement au sein des réservoirs très hétérogènes / Connectivity analysis and clustering techniques applied for the characterisation and modelling of flow in highly heterogeneous reservoirs Darishchev, Alexander 10 December 2015 (has links) Les techniques informatiques ont gagné un rôle primordial dans le développement et l'exploitation des ressources d'hydrocarbures naturelles ainsi que dans d'autres opérations liées à des réservoirs souterrains. L'un des problèmes cruciaux de la modélisation de réservoir et les prévisions de production réside dans la présélection des modèles de réservoir appropriés à la quantification d'incertitude et au le calage robuste des résultats de simulation d'écoulement aux réelles mesures et observations acquises du gisement. La présente thèse s'adresse à ces problématiques et à certains autres sujets connexes.Nous avons élaboré une stratégie pour faciliter et accélérer l'ajustement de tels modèles numériques aux données de production de champ disponibles. En premier lieu, la recherche s'était concentrée sur la conceptualisation et l'implémentation des modèles de proxy reposant sur l'analyse de la connectivité, comme une propriété physique intégrante et significative du réservoir, et des techniques avancées du partitionnement de données et de l'analyse de clusters. La méthodologie développée comprend aussi plusieurs approches originales de type probabiliste orientées vers les problèmes d'échantillonnage d'incertitude et de détermination du nombre de réalisations et de l'espérance de la valeur d'information d'échantillon. Afin de cibler et donner la priorité aux modèles pertinents, nous avons agrégé les réalisations géostatistiques en formant des classes distinctes avec une mesure de distance généralisée. Ensuite, afin d'améliorer la classification, nous avons élargi la technique graphique de silhouettes, désormais appelée la "séquence entière des silhouettes multiples" dans le partitionnement de données et l'analyse de clusters. Cette approche a permis de recueillir une information claire et compréhensive à propos des dissimilarités intra- et intre-cluster, particulièrement utile dans le cas des structures faibles, voire artificielles. Finalement, la séparation spatiale et la différence de forme ont été visualisées graphiquement et quantifiées grâce à la mesure de distance probabiliste.Il apparaît que les relations obtenues justifient et valident l'applicabilité des approches proposées pour améliorer la caractérisation et la modélisation d'écoulement. Des corrélations fiables ont été obtenues entre les chemins de connectivité les plus courts "injecteur-producteur" et les temps de percée d'eau pour des configurations différentes de placement de puits, niveaux d'hétérogénéité et rapports de mobilité de fluides variés. Les modèles de connectivité proposés ont produit des résultats suffisamment précis et une performance compétitive au méta-niveau. Leur usage comme des précurseurs et prédicateurs ad hoc est bénéfique en étape du traitement préalable de la méthodologie. Avant le calage d'historique, un nombre approprié et gérable des modèles pertinents peut être identifié grâce à la comparaison des données de production disponibles avec les résultats de... / Computer-based workflows have gained a paramount role in development and exploitation of natural hydrocarbon resources and other subsurface operations. One of the crucial problems of reservoir modelling and production forecasting is in pre-selecting appropriate models for quantifying uncertainty and robustly matching results of flow simulation to real field measurements and observations. This thesis addresses these and other related issues. We have explored a strategy to facilitate and speed up the adjustment of such numerical models to available field production data. Originally, the focus of this research was on conceptualising, developing and implementing fast proxy models related to the analysis of connectivity, as a physically meaningful property of the reservoir, with advanced cluster analysis techniques. The developed methodology includes also several original probability-oriented approaches towards the problems of sampling uncertainty and determining the sample size and the expected value of sample information. For targeting and prioritising relevant reservoir models, we aggregated geostatistical realisations into distinct classes with a generalised distance measure. Then, to improve the classification, we extended the silhouette-based graphical technique, called hereafter the "entire sequence of multiple silhouettes" in cluster analysis. This approach provided clear and comprehensive information about the intra- and inter-cluster dissimilarities, especially helpful in the case of weak, or even artificial, structures. Finally, the spatial separation and form-difference of clusters were graphically visualised and quantified with a scale-invariant probabilistic distance measure. The obtained relationships appeared to justify and validate the applicability of the proposed approaches to enhance the characterisation and modelling of flow. Reliable correlations were found between the shortest "injector-producer" pathways and water breakthrough times for different configurations of well placement, various heterogeneity levels and mobility ratios of fluids. The proposed graph-based connectivity proxies provided sufficiently accurate results and competitive performance at the meta-level. The use of them like precursors and ad hoc predictors is beneficial at the pre-processing stage of the workflow. Prior to history matching, a suitable and manageable number of appropriate reservoir models can be identified from the comparison of the available production data with the selected centrotype-models regarded as the class representatives, only for which the full fluid flow simulation is pre-requisite. The findings of this research work can easily be generalised and considered in a wider scope. Possible extensions, further improvements and implementation of them may also be expected in other fields of science and technology. Récupération d'huile Injection d'eau Modèle de réservoir Simulation d'écoulement Théorie des graphes Proxy Propagation des incertitudes Distance probabiliste Valeur de l'information Partitionnement de données Analyse de clusters Silhouettes Oil recovery Water injection Reservoir model Flow simulation Graph theory Proxy Uncertainty propagation Probabilistic distance Sample size determination Value of information Cluster analysis Silhouettes

1

Page generated in 0.1364 seconds