• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 91
  • 9
  • 9
  • 5
  • 4
  • 4
  • 2
  • 1
  • 1
  • Tagged with
  • 153
  • 153
  • 40
  • 38
  • 36
  • 22
  • 20
  • 20
  • 18
  • 18
  • 18
  • 17
  • 17
  • 15
  • 15
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

An Efficient Algorithm for Clustering Genomic Data

Zhou, Xuan January 2014 (has links)
No description available.
32

ON TWO NEW ESTIMATORS FOR THE CMS THROUGH EXTENSIONS OF OLS

Zhang, Yongxu January 2017 (has links)
As a useful tool for multivariate analysis, sufficient dimension reduction (SDR) aims to reduce the predictor dimensionality while simultaneously keeping the full regression information, or some specific aspects of the regression information, between the response and the predictor. When the goal is to retain the information about the regression mean, the target of the inference is known as the central mean space (CMS). Ordinary least squares (OLS) is a popular estimator of the CMS, but it has the limitation that it can recover at most one direction in the CMS. In this dissertation, we introduce two new estimators of the CMS: the sliced OLS and the hybrid OLS. Both estimators can estimate multiple directions in the CMS. The dissertation is organized as follows. Chapter 1 provides a literature review about basic concepts and some traditional methods in SDR. Motivated from the popular SDR method called sliced inverse regression, sliced OLS is proposed as the first extension of OLS in Chapter 2. The asymptotic properties of sliced OLS, order determination, as well as testing predictor contribution through sliced OLS are studied in Chapter 2 as well. It is well-known that slicing methods such as sliced inverse regression may lead to different results with different number of slices. Chapter 3 proposes hybrid OLS as the second extension. Hybrid OLS shares the benefit of sliced OLS and recovers multiple directions in the CMS. At the same time, hybrid OLS improves over sliced OLS by avoiding slicing. Extensive numerical results are provided to demonstrate the desirable performances of the proposed estimators. We conclude the dissertation with some discussions about the future work in Chapter 4. / Statistics
33

Bayesian Factor Models for Clustering and Spatiotemporal Analysis

Shin, Hwasoo 28 May 2024 (has links)
Multivariate data is prevalent in modern applications, yet it often presents significant analytical challenges. Factor models can offer an effective tool to address issues associated with large-scale datasets. In this dissertation, we propose two novel Bayesian factors models. These models are designed to effectively reduce the dimensionality of the data, as the number of latent factors is typically much smaller than that of the observation vectors. Therefore, our proposed models can achieve substantial dimension reduction. Our first model is for spatiotemporal areal data. In this case, the region of interest is divided into subregions, and at each time point, there is one univariate observation per subregion. Our model writes the vector of observations at each time point in a factor model form as the product of a vector of factor loadings and a vector of common factors plus a vector of error. Our model assumes that the common factor evolves through time according to a dynamic linear model. To represent the spatial relationships among subregions, each column of the factor loadings matrix is assigned intrinsic conditional autoregressive (ICAR) priors. Therefore, we call our approach the Dynamic ICAR Spatiotemporal Factor Models (DIFM). Our second model, Bayesian Clustering Factor Model (BCFM) assumes latent factors and clusters are present in the data. We apply Gaussian mixture models on common factors to discover clusters. For both models, we develop MCMC to explore the posterior distribution of the parameters. To select the number of factors and, in the case of clustering methods, the number of clusters, we develop model selection criteria that utilize the Laplace-Metropolis estimator of the predictive density and BIC with integrated likelihood. / Doctor of Philosophy / Understanding large-scale datasets has emerged as one of the most significant challenges for researchers recently. This is particularly true for datasets that are inherently complex and nontrivial to analyze. In this dissertation, we present two novel classes of Bayesian factor models for two classes of complex datasets. Frequently, the number of factors is much smaller than the number of variables, and therefore factor models can be an effective approach to handle multivariate datasets. First, we develop Dynamic ICAR Spatiotemporal Factor Model (DIFM) for datasets collected on a partition of a spatial domain of interest over time. The DIFM accounts for the spatiotemporal correlation and provides predictions of future trends. Second, we develop Bayesian Clustering Factor Model (BCFM) for multivariate data that cluster in a space of dimension lower than the dimension of the vector of observations. BCFM enables researchers to identify different characteristics of the subgroups, offering valuable insights into their underlying structure.
34

Gradient-Based Sensitivity Analysis with Kernels

Wycoff, Nathan Benjamin 20 August 2021 (has links)
Emulation of computer experiments via surrogate models can be difficult when the number of input parameters determining the simulation grows any greater than a few dozen. In this dissertation, we explore dimension reduction in the context of computer experiments. The active subspace method is a linear dimension reduction technique which uses the gradients of a function to determine important input directions. Unfortunately, we cannot expect to always have access to the gradients of our black-box functions. We thus begin by developing an estimator for the active subspace of a function using kernel methods to indirectly estimate the gradient. We then demonstrate how to deploy the learned input directions to improve the predictive performance of local regression models by ``undoing" the active subspace. Finally, we develop notions of sensitivities which are local to certain parts of the input space, which we then use to develop a Bayesian optimization algorithm which can exploit locally important directions. / Doctor of Philosophy / Increasingly, scientists and engineers developing new understanding or products rely on computers to simulate complex phenomena. Sometimes, these computer programs are so detailed that the amount of time they take to run becomes a serious issue. Surrogate modeling is the problem of trying to predict a computer experiment's result without having to actually run it, on the basis of having observed the behavior of similar simulations. Typically, computer experiments have different settings which induce different behavior. When there are many different settings to tweak, typical surrogate modeling approaches can struggle. In this dissertation, we develop a technique for deciding which input settings, or even which combinations of input settings, we should focus our attention on when trying to predict the output of the computer experiment. We then deploy this technique both to prediction of computer experiment outputs as well as to trying to find which of the input settings yields a particular desired result.
35

Bridging Cognitive Gaps Between User and Model in Interactive Dimension Reduction

Wang, Ming 05 May 2020 (has links)
High-dimensional data is prevalent in all domains but is challenging to explore. Analysis and exploration of high-dimensional data are important for people in numerous fields. To help people explore and understand high-dimensional data, Andromeda, an interactive visual analytics tool, has been developed. However, our analysis uncovered several cognitive gaps relating to the Andromeda system: users do not realize the necessity of explicitly highlighting all the relevant data points; users are not clear about the dimensional information in the Andromeda visualization; and the Andromeda model cannot capture user intentions when constructing and deconstructing clusters. In this study, we designed and implemented solutions to address these gaps. Specifically, for the gap in highlighting all the relevant data points, we introduced a foreground and background view and distance lines. Our user study with a group of undergraduate students revealed that the foreground and background views and distance lines could significantly alleviate the highlighting issue. For the gap in understanding visualization dimensions, we implemented a dimension-assist feature. The results of a second user study with students with various backgrounds suggested that the dimension-assist feature could make it easier for users to find the extremum in one dimension and to describe correlations among multiple dimensions; however, the dimension-assist feature had only a small impact on characterizing the data distribution and assisting users in understanding the meanings of the weighted multidimensional scaling (WMDS) plot axes. Regarding the gap in creating and deconstructing clusters, we implemented a solution utilizing random sampling. A quantitative analysis of the random sampling strategy was performed, and the results demonstrated that the strategy improved Andromeda's capabilities in constructing and deconstructing clusters. We also applied the random sampling to two-point manipulations, making the Andromeda system more flexible and adaptable to differing data exploration tasks. Limitations are discussed, and potential future research directions are identified. / Master of Science / High-dimensional data is the dataset with hundreds or thousands of features. The animal dataset, which has been used in this study, is an example of high-dimensional dataset, since animals can be categorized by a lot of features, such as size, furry, behavior and so on. High-dimensional data is prevalent but difficult for people to analyze. For example, it is hard to find out the similarity among dozens of animals, or to find the relationship between different characterizations of animals. To help people with no statistical knowledge to analyze the high-dimensional dataset, our group developed a web-based visualization software called Andromeda, which can display data as points (such as animal data points) on a screen and allow people to interact with these points to express their similarity by dragging points on the screen (e.g., drag "Lion," "Wolf," and "Killer Whale" together because all three are hunters, forming a cluster of three animals). Therefore, it enables people to interactively analyze the hidden pattern of high-dimensional data. However, we identified several cognitive gaps that have negatively limited Andromeda's effectiveness in helping people understand high-dimensional data. Therefore, in this work, we intended to make improvements to the original Andromeda system to bridge these gaps, including designing new visual features to help people better understand how Andromeda processes and interacts with high-dimensional data and improving the underlying algorithm so that the Andromeda system can better understand people's intension during the data exploration process. We extensively evaluated our designs through both qualitative and quantitative analysis (e.g., user study on both undergraduate and graduate students and statistical testing) on our animal dataset, and the results confirmed that the improved Andromeda system outperformed the original version significantly in a series of high-dimensional data understanding tasks. Finally, the limitations and potential future research directions were discussed.
36

Designing and Evaluating Object-Level Interaction to Support Human-Model Communication in Data Analysis

Self, Jessica Zeitz 09 May 2016 (has links)
High-dimensional data appear in all domains and it is challenging to explore. As the number of dimensions in datasets increases, the harder it becomes to discover patterns and develop insights. Data analysis and exploration is an important skill given the amount of data collection in every field of work. However, learning this skill without an understanding of high-dimensional data is challenging. Users naturally tend to characterize data in simplistic one-dimensional terms using metrics such as mean, median, mode. Real-world data is more complex. To gain the most insight from data, users need to recognize and create high-dimensional arguments. Data exploration methods can encourage thinking beyond traditional one-dimensional insights. Dimension reduction algorithms, such as multidimensional scaling, support data explorations by reducing datasets to two dimensions for visualization. Because these algorithms rely on underlying parameterizations, they may be manipulated to assess the data from multiple perspectives. Manipulating can be difficult for users without a strong knowledge of the underlying algorithms. Visual analytics tools that afford object-level interaction (OLI) allow for generation of more complex insights, despite inexperience with multivariate data or the underlying algorithm. The goal of this research is to develop and test variations on types of interactions for interactive visual analytic systems that enable users to tweak model parameters directly or indirectly so that they may explore high-dimensional data. To study interactive data analysis, we present an interface, Andromeda, that enables non-experts of statistical models to explore domain-specific, high-dimensional data. This application implements interactive weighted multidimensional scaling (WMDS) and allows for both parametric and observation-level interaction to provide in-depth data exploration. We performed multiple user studies to answer how parametric and object-level interaction aid in data analysis. With each study, we found usability issues and then designed solutions for the next study. With each critique we uncovered design principles of effective, interactive, visual analytic tools. The final part of this research presents these principles supported by the results of our multiple informal and formal usability studies. The established design principles focus on human-centered usability for developing interactive visual analytic systems that enable users to analyze high-dimensional data through object-level interaction. / Ph. D.
37

Another Slice of Multivariate Dimension Reduction

Ekblad, Carl January 2022 (has links)
This thesis presents some methods of multivariate dimension reduction, with emphasis on methods guided by the work of R.A. Fisher. Some of the methods presented can be traced back to the 20th century, while some are much more recent. For the more recent methods, additional attention will paid to the foundational underpinnings. The presentation for each of the methods contains a brief introduction of its general philosophy, accompanied by some theorems and ends with the connection to the work of Fisher. / Den här kandidatuppsatsen presenterar ett antal metoder för dimensionsreducering, där betoning läggs på metoder some följer teori utvecklad av R.A. Fisher. En del av metoderna som presenteras utvecklades redan på tidigt 1900-tal, medan andra är utvecklade i närtid. För metoderna utvecklade i närtid, så kommer större vikt läggas vid den grundläggande teorin för metoden. Presentationen av varje metod består av en kortare beskrivning, följt av satser och slutligen beskrivs dess koppling to Fishers teorier.
38

Application of Influence Function in Sufficient Dimension Reduction Models

Shrestha, Prabha 28 September 2020 (has links)
No description available.
39

Predicting reliability in multidisciplinary engineering systems under uncertainty

Hwang, Sungkun 27 May 2016 (has links)
The proposed study develops a framework that can accurately capture and model input and output variables for multidisciplinary systems to mitigate the computational cost when uncertainties are involved. The dimension of the random input variables is reduced depending on the degree of correlation calculated by relative entropy. Feature extraction methods; namely Principal Component Analysis (PCA), the Auto-Encoder (AE) algorithm are developed when the input variables are highly correlated. The Independent Features Test (IndFeaT) is implemented as the feature selection method if the correlation is low to select a critical subset of model features. Moreover, Artificial Neural Network (ANN) including Probabilistic Neural Network (PNN) is integrated into the framework to correctly capture the complex response behavior of the multidisciplinary system with low computational cost. The efficacy of the proposed method is demonstrated with electro-mechanical engineering examples including a solder joint and stretchable patch antenna examples.
40

New Advancements of Scalable Statistical Methods for Learning Latent Structures in Big Data

Zhao, Shiwen January 2016 (has links)
<p>Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.</p><p>Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.</p> / Dissertation

Page generated in 0.0215 seconds