• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 149
  • 24
  • 17
  • 7
  • 2
  • 1
  • 1
  • Tagged with
  • 257
  • 257
  • 132
  • 77
  • 50
  • 48
  • 41
  • 39
  • 37
  • 36
  • 32
  • 28
  • 28
  • 27
  • 26
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
201

ESSAYS ON SCALABLE BAYESIAN NONPARAMETRIC AND SEMIPARAMETRIC MODELS

Chenzhong Wu (18275839) 29 March 2024 (has links)
<p dir="ltr">In this thesis, we delve into the exploration of several nonparametric and semiparametric econometric models within the Bayesian framework, highlighting their applicability across a broad spectrum of microeconomic and macroeconomic issues. Positioned in the big data era, where data collection and storage expand at an unprecedented rate, the complexity of economic questions we aim to address is similarly escalating. This dual challenge ne- cessitates leveraging increasingly large datasets, thereby underscoring the critical need for designing flexible Bayesian priors and developing scalable, efficient algorithms tailored for high-dimensional datasets.</p><p dir="ltr">The initial two chapters, Chapter 2 and 3, are dedicated to crafting Bayesian priors suited for environments laden with a vast array of variables. These priors, alongside their corresponding algorithms, are optimized for computational efficiency, scalability to extensive datasets, and, ideally, distributability. We aim for these priors to accommodate varying levels of dataset sparsity. Chapter 2 assesses nonparametric additive models, employing a smoothing prior alongside a band matrix for each additive component. Utilizing the Bayesian backfitting algorithm significantly alleviates the computational load. In Chapter 3, we address multiple linear regression settings by adopting a flexible scale mixture of normal priors for coefficient parameters, thus allowing data-driven determination of the necessary amount of shrinkage. The use of a conjugate prior enables a closed-form solution for the posterior, markedly enhancing computational speed.</p><p dir="ltr">The subsequent chapters, Chapter 4 and 5, pivot towards time series dataset model- ing and Bayesian algorithms. A semiparametric modeling approach dissects the stochastic volatility in macro time series into persistent and transitory components, the latter addi- tional component addressing outliers. Utilizing a Dirichlet process mixture prior for the transitory part and a collapsed Gibbs sampling algorithm, we devise a method capable of efficiently processing over 10,000 observations and 200 variables. Chapter 4 introduces a simple univariate model, while Chapter 5 presents comprehensive Bayesian VARs. Our al- gorithms, more efficient and effective in managing outliers than existing ones, are adept at handling extensive macro datasets with hundreds of variables.</p>
202

Adaptive Mixture Estimation and Subsampling PCA

Liu, Peng January 2009 (has links)
No description available.
203

Bayesian Analysis of Partitioned Demand Models

Smith, Adam Nicholas 26 October 2017 (has links)
No description available.
204

Partial EM Procedure for Big-Data Linear Mixed Effects Model, and Generalized PPE for High-Dimensional Data in Julia

Cho, Jang Ik 31 August 2018 (has links)
No description available.
205

Sparse Principal Component Analysis for High-Dimensional Data: A Comparative Study

Bonner, Ashley J. 10 1900 (has links)
<p><strong>Background:</strong> Through unprecedented advances in technology, high-dimensional datasets have exploded into many fields of observational research. For example, it is now common to expect thousands or millions of genetic variables (p) with only a limited number of study participants (n). Determining the important features proves statistically difficult, as multivariate analysis techniques become flooded and mathematically insufficient when n < p. Principal Component Analysis (PCA) is a commonly used multivariate method for dimension reduction and data visualization but suffers from these issues. A collection of Sparse PCA methods have been proposed to counter these flaws but have not been tested in comparative detail. <strong>Methods:</strong> Performances of three Sparse PCA methods were evaluated through simulations. Data was generated for 56 different data-structures, ranging p, the number of underlying groups and the variance structure within them. Estimation and interpretability of the principal components (PCs) were rigorously tested. Sparse PCA methods were also applied to a real gene expression dataset. <strong>Results:</strong> All Sparse PCA methods showed improvements upon classical PCA. Some methods were best at obtaining an accurate leading PC only, whereas others were better for subsequent PCs. There exist different optimal choices of Sparse PCA methods when ranging within-group correlation and across-group variances; thankfully, one method repeatedly worked well under the most difficult scenarios. When applying methods to real data, concise groups of gene expressions were detected with the most sparse methods. <strong>Conclusions:</strong> Sparse PCA methods provide a new insightful way to detect important features amidst complex high-dimension data.</p> / Master of Science (MSc)
206

The Growth Curve Model for High Dimensional Data and its Application in Genomics

Jana, Sayantee 04 1900 (has links)
<p>Recent advances in technology have allowed researchers to collect high-dimensional biological data simultaneously. In genomic studies, for instance, measurements from tens of thousands of genes are taken from individuals across several experimental groups. In time course microarray experiments, gene expression is measured at several time points for each individual across the whole genome resulting in massive amount of data. In such experiments, researchers are faced with two types of high-dimensionality. The first is global high-dimensionality, which is common to all genomic experiments. The global high-dimensionality arises because inference is being done on tens of thousands of genes resulting in multiplicity. This challenge is often dealt with statistical methods for multiple comparison, such as the Bonferroni correction or false discovery rate (FDR). We refer to the second type of high-dimensionality as gene specific high-dimensionality, which arises in time course microarry experiments due to the fact that, in such experiments, sample size is often smaller than the number of time points ($n</p> <p>In this thesis, we use the growth curve model (GCM), which is a generalized multivariate analysis of variance (GMANOVA) model, and propose a moderated test statistic for testing a special case of the general linear hypothesis, which is specially useful for identifying genes that are expressed. We use the trace test for the GCM and modify it so that it can be used in high-dimensional situations. We consider two types of moderation: the Moore-Penrose generalized inverse and Stein's shrinkage estimator of $ S $. We performed extensive simulations to show performance of the moderated test, and compared the results with original trace test. We calculated empirical level and power of the test under many scenarios. Although the focus is on hypothesis testing, we also provided moderated maximum likelihood estimator for the parameter matrix and assessed its performance by investigating bias and mean squared error of the estimator and compared the results with those of the maximum likelihood estimators. Since the parameters are matrices, we consider distance measures in both power and level comparisons as well as when investigating bias and mean squared error. We also illustrated our approach using time course microarray data taken from a study on Lung Cancer. We were able to filter out 1053 genes as non-noise genes from a pool of 22,277 genes which is approximately 5\% of the total number of genes. This is in sync with results from most biological experiments where around 5\% genes are found to be differentially expressed.</p> / Master of Science (MSc)
207

Canonical Correlation and Clustering for High Dimensional Data

Ouyang, Qing January 2019 (has links)
Multi-view datasets arise naturally in statistical genetics when the genetic and trait profile of an individual is portrayed by two feature vectors. A motivating problem concerning the Skin Intrinsic Fluorescence (SIF) study on the Diabetes Control and Complications Trial (DCCT) subjects is presented. A widely applied quantitative method to explore the correlation structure between two domains of a multi-view dataset is the Canonical Correlation Analysis (CCA), which seeks the canonical loading vectors such that the transformed canonical covariates are maximally correlated. In the high dimensional case, regularization of the dataset is required before CCA can be applied. Furthermore, the nature of genetic research suggests that sparse output is more desirable. In this thesis, two regularized CCA (rCCA) methods and a sparse CCA (sCCA) method are presented. When correlation sub-structure exists, stand-alone CCA method will not perform well. To tackle this limitation, a mixture of local CCA models can be employed. In this thesis, I review a correlation clustering algorithm proposed by Fern, Brodley and Friedl (2005), which seeks to group subjects into clusters such that features are identically correlated within each cluster. An evaluation study is performed to assess the effectiveness of CCA and correlation clustering algorithms using artificial multi-view datasets. Both sCCA and sCCA-based correlation clustering exhibited superior performance compare to the rCCA and rCCA-based correlation clustering. The sCCA and the sCCA-clustering are applied to the multi-view dataset consisted of PrediXcan imputed gene expression and SIF measurements of DCCT subjects. The stand-alone sparse CCA method identified 193 among 11538 genes being correlated with SIF#7. Further investigation of these 193 genes with simple linear regression and t-test revealed that only two genes, ENSG00000100281.9 and ENSG00000112787.8, were significance in association with SIF#7. No plausible clustering scheme was detected by the sCCA based correlation clustering method. / Thesis / Master of Science (MSc)
208

Probing Human Category Structures with Synthetic Photorealistic Stimuli

Chang Cheng, Jorge 08 September 2022 (has links)
No description available.
209

Contributions to Data Reduction and Statistical Model of Data with Complex Structures

Wei, Yanran 30 August 2022 (has links)
With advanced technology and information explosion, the data of interest often have complex structures, with the large size and dimensions in the form of continuous or discrete features. There is an emerging need for data reduction, efficient modeling, and model inference. For example, data can contain millions of observations with thousands of features. Traditional methods, such as linear regression or LASSO regression, cannot effectively deal with such a large dataset directly. This dissertation aims to develop several techniques to effectively analyze large datasets with complex structures in the observational, experimental and time series data. In Chapter 2, I focus on the data reduction for model estimation of sparse regression. The commonly-used subdata selection method often considers sampling or feature screening. Un- der the case of data with both large number of observation and predictors, we proposed a filtering approach for model estimation (FAME) to reduce both the size of data points and features. The proposed algorithm can be easily extended for data with discrete response or discrete predictors. Through simulations and case studies, the proposed method provides a good performance for parameter estimation with efficient computation. In Chapter 3, I focus on modeling the experimental data with quantitative-sequence (QS) factor. Here the QS factor concerns both quantities and sequence orders of several compo- nents in the experiment. Existing methods usually can only focus on the sequence orders or quantities of the multiple components. To fill this gap, we propose a QS transformation to transform the QS factor to a generalized permutation matrix, and consequently develop a simple Gaussian process approach to model the experimental data with QS factors. In Chapter 4, I focus on forecasting multivariate time series data by leveraging the au- toregression and clustering. Existing time series forecasting method treat each series data independently and ignore their inherent correlation. To fill this gap, I proposed a clustering based on autoregression and control the sparsity of the transition matrix estimation by adap- tive lasso and clustering coefficient. The clustering-based cross prediction can outperforms the conventional time series forecasting methods. Moreover, the the clustering result can also enhance the forecasting accuracy of other forecasting methods. The proposed method can be applied on practical data, such as stock forecasting, topic trend detection. / Doctor of Philosophy / This dissertation focuses on three projects that are related to data reduction and statistical modeling of data with complex structures. In chapter 2, we propose a filtering approach of data for parameter estimation of sparse regression. Given data with thousands of ob- servations and predictors or even more, large storage and computation spaces is need to handle these data. It is challenging to computational power and takes long time in terms of computational cost. So we come up with an algorithm (FAME) that can reduce both the number of observations and predictors. After data reduction, this subdata selected by FAME keeps most information of the original dataset in terms of parameter estimation. Compare with existing methods, the dimension of the subdata generated by the proposed algorithm is smaller while the computational time does not increase. In chapter 3, we use quantitative-sequence (QS) factor to describe experimental data. One simple example of experimental data is milk tea. Adding 1 cup of milk first or adding 2 cup of tea first will influence the flavor. And this case can be extended to cases when there are thousands of ingredients need to be input into the experiment. Then the order and amount of ingredients will generate different experimental results. We use QS factor to describe this kind of order and amount. Then by transforming the QS factor to a matrix containing continuous value and set this matrix as input, we model the experimental results with a simple Gaussian process. In chapter 4, we propose an autoregression-based clustering and forecasting method of multi- variate time series data. Existing research works often treat each time series independently. Our approach incorporates the inherent correlation of data and cluster related series into one group. The forecasting is built based on each cluster and data within one cluster can cross predict each other. One application of this method is on topic trending detection. With thousands of topics, it is unfeasible to apply one model for forecasting all time series. Considering the similarity of trends among related topics, the proposed method can cluster topics based on their similarity, and then perform forecasting in autoregression model based on historical data within each cluster.
210

Semiparametric and Nonparametric Methods for Complex Data

Kim, Byung-Jun 26 June 2020 (has links)
A variety of complex data has broadened in many research fields such as epidemiology, genomics, and analytical chemistry with the development of science, technologies, and design scheme over the past few decades. For example, in epidemiology, the matched case-crossover study design is used to investigate the association between the clustered binary outcomes of disease and a measurement error in covariate within a certain period by stratifying subjects' conditions. In genomics, high-correlated and high-dimensional(HCHD) data are required to identify important genes and their interaction effect over diseases. In analytical chemistry, multiple time series data are generated to recognize the complex patterns among multiple classes. Due to the great diversity, we encounter three problems in analyzing those complex data in this dissertation. We have then provided several contributions to semiparametric and nonparametric methods for dealing with the following problems: the first is to propose a method for testing the significance of a functional association under the matched study; the second is to develop a method to simultaneously identify important variables and build a network in HDHC data; the third is to propose a multi-class dynamic model for recognizing a pattern in the time-trend analysis. For the first topic, we propose a semiparametric omnibus test for testing the significance of a functional association between the clustered binary outcomes and covariates with measurement error by taking into account the effect modification of matching covariates. We develop a flexible omnibus test for testing purposes without a specific alternative form of a hypothesis. The advantages of our omnibus test are demonstrated through simulation studies and 1-4 bidirectional matched data analyses from an epidemiology study. For the second topic, we propose a joint semiparametric kernel machine network approach to provide a connection between variable selection and network estimation. Our approach is a unified and integrated method that can simultaneously identify important variables and build a network among them. We develop our approach under a semiparametric kernel machine regression framework, which can allow for the possibility that each variable might be nonlinear and is likely to interact with each other in a complicated way. We demonstrate our approach using simulation studies and real application on genetic pathway analysis. Lastly, for the third project, we propose a Bayesian focal-area detection method for a multi-class dynamic model under a Bayesian hierarchical framework. Two-step Bayesian sequential procedures are developed to estimate patterns and detect focal intervals, which can be used for gas chromatography. We demonstrate the performance of our proposed method using a simulation study and real application on gas chromatography on Fast Odor Chromatographic Sniffer (FOX) system. / Doctor of Philosophy / A variety of complex data has broadened in many research fields such as epidemiology, genomics, and analytical chemistry with the development of science, technologies, and design scheme over the past few decades. For example, in epidemiology, the matched case-crossover study design is used to investigate the association between the clustered binary outcomes of disease and a measurement error in covariate within a certain period by stratifying subjects' conditions. In genomics, high-correlated and high-dimensional(HCHD) data are required to identify important genes and their interaction effect over diseases. In analytical chemistry, multiple time series data are generated to recognize the complex patterns among multiple classes. Due to the great diversity, we encounter three problems in analyzing the following three types of data: (1) matched case-crossover data, (2) HCHD data, and (3) Time-series data. We contribute to the development of statistical methods to deal with such complex data. First, under the matched study, we discuss an idea about hypothesis testing to effectively determine the association between observed factors and risk of interested disease. Because, in practice, we do not know the specific form of the association, it might be challenging to set a specific alternative hypothesis. By reflecting the reality, we consider the possibility that some observations are measured with errors. By considering these measurement errors, we develop a testing procedure under the matched case-crossover framework. This testing procedure has the flexibility to make inferences on various hypothesis settings. Second, we consider the data where the number of variables is very large compared to the sample size, and the variables are correlated to each other. In this case, our goal is to identify important variables for outcome among a large amount of the variables and build their network. For example, identifying few genes among whole genomics associated with diabetes can be used to develop biomarkers. By our proposed approach in the second project, we can identify differentially expressed and important genes and their network structure with consideration for the outcome. Lastly, we consider the scenario of changing patterns of interest over time with application to gas chromatography. We propose an efficient detection method to effectively distinguish the patterns of multi-level subjects in time-trend analysis. We suggest that our proposed method can give precious information on efficient search for the distinguishable patterns so as to reduce the burden of examining all observations in the data.

Page generated in 0.0811 seconds