• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 16
  • 7
  • 4
  • 3
  • 2
  • Tagged with
  • 38
  • 11
  • 8
  • 8
  • 6
  • 6
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Robust A-optimal Subsampling for Massive Data Robust Linear Regression

Ziting Tang (8081000) 05 December 2019 (has links)
<div>This thesis is concerned with massive data analysis via robust A-optimally efficient non-uniform subsampling. Motivated by the fact that massive data often contain outliers and that uniform sampling is not efficient, we give numerous sampling distributions by minimizing the sum of the component variances of the subsampling estimate. And these sampling distributions are robust against outliers. Massive data pose two computational bottlenecks. Namely, data exceed a computer’s storage space, and computation requires too long waiting time. The two bottle necks can be simultaneously addressed by selecting a subsample as a surrogate for the full sample and completing the data analysis. We develop our theory in a typical setting for robust linear regression in which the estimating functions are not differentiable. For an arbitrary sampling distribution, we establish consistency for the subsampling estimate for both fixed and growing dimension( as high dimensionality is common in massive data). We prove asymptotic normality for fixed dimension. We discuss the A-optimal scoring method for fast computing. We conduct large simulations to evaluate the numerical performance of our proposed A-optimal sampling distribution. Real data applications are also performed.</div>
12

A-Optimal Subsampling For Big Data General Estimating Equations

Cheung, Chung Ching 08 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.
13

Testing and characterization of high-speed signals using incoherent undersampling driven signal reconstruction algorithms

Moon, Thomas 07 January 2016 (has links)
The objective of the proposed research is to develop a framework for the signal reconstruction algorithm with sub-Nyquist sampling rate and the low-cost hardware design in system level. A further objective of the proposed research is to monitor the device-under-test (DUT) and to adapt its behaviors. The key contribution of this research is that the high-speed signal acquisition is done by direct subsampling. As the signal is directly sampled without any front-end radio-frequency (RF) components such as mixers or filters, the cost of hardware is reduced. Furthermore, the distortion and the nonlinearity from the RF components can be avoided. The first proposed work is wideband signal reconstruction by dual-rate time-interleaved subsampling hardware and Multi-coset signal reconstruction. Using the combination of the dual-rate hardware and the multi-coset algorithm, the number of sampling channel is significantly reduced compared to the conventional multi-coset works. The second proposed work is jitter tracking by accurate period estimation with incoherent subsampling. In this work, the long-term jitter in PRBS is tracked without hardware synchronization and clock-data-recovery (CDR) circuits. The third proposed work is eye-monitoring and time-domain-reflectometry (TDR) by monobit receiver signal reconstruction. Using a monobit receiver based on incoherent subsampling and time-variant threshold signal, high resolution of reconstructed signal in both amplitude and time is achieved. Compared to a multibit-receiver, the scalability of the test-system is significantly increased.
14

Benthic Macroinvertebrate Subsampling Effort and Taxonomic Resolution for Bioassessments of Streams in the James River Watershed of Virginia

Williams, Laurel 01 May 2014 (has links)
Benthic macroinvertebrate diversity influences stream food web dynamics, nutrient cycling and material exchange between the benthos and the water column. Stream bioassessment has moved to the forefront of water quality monitoring in terms of benthic macroinvertebrate diversity in the recent past. The objectives of this study were to determine optimum subsample size and level of taxonomic resolution necessary to accurately and precisely describe macroinvertebrate diversity in streams flowing in the Piedmont province of the James River watershed in Virginia. Forty-nine sampling sites were selected from streams within the Piedmont Physiographic Province of the James River watershed. Ten sites were randomly selected to have all macroinvertebrates in the sample identified to the genus level whenever possible. Optimum subsampling intensities and Virginia Stream Condition Index (VSCI) metrics and scores were determined. For samples with the total number of individuals at less than 500, the genus level of taxonomy provided lower overall optimum subsampling intensities. However, for samples with total individuals over 1000, optimum subsampling intensities at the genus level of taxonomy were higher than the family level for more than 50% of the metrics. For both family and genus levels of taxonomy, the majority of optimum subsampling intensities were well over 50% of the total individuals in the sample, with some as high as 100% of the individuals. While optimum subsampling intensities were valuable in comparing family and genus level taxonomy, they are not reasonable for stream bioassessment protocols; the cost:benefit ratio would be highly unbalanced. A minimum subsample size of 200 individuals is optimum for determining VSCI scores, while optimum taxonomic resolution is dependent on several factors. Thus, the level of taxonomic resolution for a particular study should be determined by the study objectives, level of site impairment and sample size.
15

A-OPTIMAL SUBSAMPLING FOR BIG DATA GENERAL ESTIMATING EQUATIONS

Chung Ching Cheung (7027808) 13 August 2019 (has links)
<p>A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.</p>
16

Estimador subsemble espacial para dados massivos em geoestatística

Barbian, Márcia Helena January 2016 (has links)
Um problema que vem se tornando habitual em análise geoestatística é a quantidade crescente de observações. Em tais casos é comum que estimadores usualmente utilizados não possam ser empregados devido a dificuldades numéricas. Esta tese têm por objetivo propor um novo estimador para massivas observações em geoestatística: o estimador subsemble espacial. O estimador subsemble espacial seleciona várias subamostras, espacialmente estruturadas, do conjunto completo de dados. Cada subamostra estima com facilidade os parâmetros do modelo e as estimativas resultantes são ponderadas através de um subconjunto de validação. Em estudos simulados, compara-se a metodologia proposta com outros métodos e os resultados apresentam sua acurácia e rapidez. Além disso, uma aplicação em um banco de dados reais, com 11.000 observações, confirma essas características. / A problem that is becoming common in geostatistical analysis is the growing number of observations. In such cases, common estimators cannot be used due to numerical difficulties. This thesis proposes a new estimator for massive observations in geostatistics: the spatial subsemble estimator. The estimator selects small spatially structured subset of observations. The model parameters are estimated easily with each subsample, and the resulting estimates are weighted by a subset of validation. We compare the spatial subsemble with competing alternatives showing that it is faster and accurate. In addition, we present an application in a real database with 11000 observations.
17

Estimador subsemble espacial para dados massivos em geoestatística

Barbian, Márcia Helena January 2016 (has links)
Um problema que vem se tornando habitual em análise geoestatística é a quantidade crescente de observações. Em tais casos é comum que estimadores usualmente utilizados não possam ser empregados devido a dificuldades numéricas. Esta tese têm por objetivo propor um novo estimador para massivas observações em geoestatística: o estimador subsemble espacial. O estimador subsemble espacial seleciona várias subamostras, espacialmente estruturadas, do conjunto completo de dados. Cada subamostra estima com facilidade os parâmetros do modelo e as estimativas resultantes são ponderadas através de um subconjunto de validação. Em estudos simulados, compara-se a metodologia proposta com outros métodos e os resultados apresentam sua acurácia e rapidez. Além disso, uma aplicação em um banco de dados reais, com 11.000 observações, confirma essas características. / A problem that is becoming common in geostatistical analysis is the growing number of observations. In such cases, common estimators cannot be used due to numerical difficulties. This thesis proposes a new estimator for massive observations in geostatistics: the spatial subsemble estimator. The estimator selects small spatially structured subset of observations. The model parameters are estimated easily with each subsample, and the resulting estimates are weighted by a subset of validation. We compare the spatial subsemble with competing alternatives showing that it is faster and accurate. In addition, we present an application in a real database with 11000 observations.
18

Estimador subsemble espacial para dados massivos em geoestatística

Barbian, Márcia Helena January 2016 (has links)
Um problema que vem se tornando habitual em análise geoestatística é a quantidade crescente de observações. Em tais casos é comum que estimadores usualmente utilizados não possam ser empregados devido a dificuldades numéricas. Esta tese têm por objetivo propor um novo estimador para massivas observações em geoestatística: o estimador subsemble espacial. O estimador subsemble espacial seleciona várias subamostras, espacialmente estruturadas, do conjunto completo de dados. Cada subamostra estima com facilidade os parâmetros do modelo e as estimativas resultantes são ponderadas através de um subconjunto de validação. Em estudos simulados, compara-se a metodologia proposta com outros métodos e os resultados apresentam sua acurácia e rapidez. Além disso, uma aplicação em um banco de dados reais, com 11.000 observações, confirma essas características. / A problem that is becoming common in geostatistical analysis is the growing number of observations. In such cases, common estimators cannot be used due to numerical difficulties. This thesis proposes a new estimator for massive observations in geostatistics: the spatial subsemble estimator. The estimator selects small spatially structured subset of observations. The model parameters are estimated easily with each subsample, and the resulting estimates are weighted by a subset of validation. We compare the spatial subsemble with competing alternatives showing that it is faster and accurate. In addition, we present an application in a real database with 11000 observations.
19

Subsampling methods for robust inference in regression models

Ling, Xiao 31 August 2009 (has links)
This thesis is a pilot study on the subsampling methods for robust estimation in regression models when there are possible outliers in the data. Two basic proposals of the subsampling method are investigated. The main idea is to identify good data points through fitting the model to randomly chosen subsamples. Subsamples containing no outliers are identified by good fit and they are combined to form a subset of good data points. Once the combined sets containing only good data points are identified, classical estimation methods such as the least-squares method and the maximum likelihood method can be applied to do regression analysis using the combined set. Numerical evidence suggest that the subsampling method is robust against outliers with high breakdown point, and it is competitive to other robust methods in terms of both robustness and efficiency. It has wide application to a variety of regression models including the linear regression models, the non-linear regression models and the generalized linear regression models. Research is ongoing with the aim of making this method an effective and practical method for robust inference on regression models.
20

Evaluation of performance of MSI detection tools using targeted sequencing data

Kolluri, Satya Krishna Prasanna January 2021 (has links)
In recent years, digitalization and computer-based technologies have greatly revolutionized the field of bioinformatics. Advance research and development of computer-based programs have enhanced various DNA sequencing technologies. This advancement has significantly broadened our understanding of genomic evolution and has widely contributed to the application of clinical genomics. Cancer has been one of the major causes of death across the world. Cancer is mainly caused due to the damage or changes in DNA that affect the function of genes which contain a set of instructions that control various functions of cells. This damage in genes that maintain DNA repair mechanism may lead towards genome instability allowing rapid growth of cancer.   Microsatellite instability (MSI) is one such condition characterized due to genomic alteration leading towards the failure of DNA repair mechanism in cancerous cells. MSI is found in various types of cancer but is most often found in colorectal cancer, gastric cancer, and endometrial cancer. Hence, detection of this MSI can greatly contribute towards cancer therapies and enable to plan for the best treatment. This study mainly focuses on evaluating the performance of MSI calling algorithms using targeted sequencing methods.   The literature provides a detailed outline of various topics related to MSI detection. Moreover, different computational methods like MSIsensor, MSIsensor-ct, MSIsensor-pro, MSings, MiMSI, and MSIsensor2 were used in this study for the detection of MSI in selected samples are thoroughly discussed in the methodology section. Finally, the findings of this study conclude that the MSI calling algorithms mentioned above provide accurate detection of MSI in the chosen samples. Also, these algorithms enable us to determine the MSI status of the chosen samples more precisely

Page generated in 0.0646 seconds