• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 91
  • 9
  • 9
  • 5
  • 4
  • 4
  • 2
  • 1
  • 1
  • Tagged with
  • 153
  • 153
  • 40
  • 38
  • 36
  • 22
  • 20
  • 20
  • 18
  • 18
  • 18
  • 17
  • 17
  • 15
  • 15
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
21

Adapting Component Analysis

Dorri, Fatemeh January 2012 (has links)
A main problem in machine learning is to predict the response variables of a test set given the training data and its corresponding response variables. A predictive model can perform satisfactorily only if the training data is an appropriate representative of the test data. This intuition is re???ected in the assumption that the training data and the test data are drawn from the same underlying distribution. However, the assumption may not be correct in many applications for various reasons. For example, gathering training data from the test population might not be easily possible, due to its expense or rareness. Or, factors like time, place, weather, etc can cause the difference in the distributions. I propose a method based on kernel distribution embedding and Hilbert Schmidt Independence Criteria (HSIC) to address this problem. The proposed method explores a new representation of the data in a new feature space with two properties: (i) the distributions of the training and the test data sets are as close as possible in the new feature space, (ii) the important structural information of the data is preserved. The algorithm can reduce the dimensionality of the data while it preserves the aforementioned properties and therefore it can be seen as a dimensionality reduction method as well. Our method has a closed-form solution and the experimental results on various data sets show that it works well in practice.
22

Knowledge support for parallel performance data mining

Huck, Kevin A., 1972- 03 1900 (has links)
xvi, 231 p. : ill. A print copy of this thesis is available through the UO Libraries. Search the library catalog for the location and call number. / Parallel applications running on high-end computer systems manifest a complex combination of performance phenomena, such as communication patterns, work distributions, and computational inefficiencies. Current performance tools compute results that help to describe performance behavior, as well as to understand performance problems and how they came about. Unfortunately, parallel performance tool research has been limited in its contributions to large-scale performance data management and analysis, automated performance investigation, and knowledge-based performance problem reasoning. This dissertation discusses the design of a performance analysis methodology and framework which integrates scalable data management, dimension reduction, clustering, classification and correlation analysis of individual trials of large dimensions, and comparative analysis between multiple application executions. Analysis process workflows can be captured, automating what would otherwise be time-consuming and possibly error prone tasks. More importantly, process automation provides an extensible interface to the analysis process. The methods also integrate context metadata and a rule-based system in order to capture expert performance analysis knowledge about known anomalous behavior patterns. Applying this knowledge to performance analysis results and associated metadata provides a mechanism for diagnosing the causes of performance problems, rather than just summarizing results. Our prototype implementation of our data mining framework, PerfExplorer, and our data management framework, PerfDMF, are applied in large-scale performance studies to demonstrate each thesis contribution. The dissertation concludes with a discussion of future research directions. / Adviser: Allen D. Malony
23

Variable selection and dimension reduction in high-dimensional regression

Wang, Tao 01 January 2013 (has links)
No description available.
24

Person Re-identification Based on Kernel Local Fisher Discriminant Analysis and Mahalanobis Distance Learning

He, Qiangsen January 2017 (has links)
Person re-identification (Re-ID) has become an intense research area in recent years. The main goal of this topic is to recognize and match individuals over time at the same or different locations. This task is challenging due to the variation of illumination, viewpoints, pedestrians’ appearance and partial occlusion. Previous works mainly focus on finding robust features and metric learning. Many metric learning methods convert the Re-ID problem to a matrix decomposition problem by Fisher discriminant analysis (FDA). Mahalanobis distance metric learning is a popular method to measure similarity; however, since directly extracted descriptors usually have high dimensionality, it’s intractable to learn a high-dimensional semi-positive definite (SPD) matrix. Dimensionality reduction is used to project high-dimensional descriptors to a lower-dimensional space while preserving those discriminative information. In this paper, the kernel Fisher discriminant analysis (KLFDA) [38] is used to reduce dimensionality given that kernelization method can greatly improve Re-ID performance for nonlinearity. Inspired by [47], an SPD matrix is then learned on lower-dimensional descriptors based on the limitation that the maximum intraclass distance is at least one unit smaller than the minimum interclass distance. This method is proved to have excellent performance compared with other advanced metric learning.
25

DEEP LEARNING FOR STATISTICAL DATA ANALYSIS: DIMENSION REDUCTION AND CAUSAL STRUCTURE INFERENCE

Siqi Liang (11799653) 19 December 2021 (has links)
<div>During the past decades, deep learning has been proven to be an important tool for statistical data analysis. Motivated by the promise of deep learning in tackling the curse of dimensionality, we propose three innovative methods which apply deep learning techniques to high-dimensional data analysis in this dissertation.</div><div><br></div><div>Firstly, we propose a nonlinear sufficient dimension reduction method, the so-called split-and-merge deep neural networks (SM-DNN), which employs the split-and-merge technique on deep neural networks to obtain nonlinear sufficient dimension reduction of the input data and then learn a deep neural network on the dimension reduced data. We show that the DNN-based dimension reduction is sufficient for data drawn from exponential family, which retains all information on response contained in the explanatory data. Our numerical experiments indicate that the SM-DNN method can lead to significant improvement in phenotype prediction for a variety of real data examples. In particular, with only rare variants, we achieved a remarkable prediction accuracy of over 74\% for the Early-Onset Myocardial Infarction (EOMI) exome sequence data. </div><div><br></div><div>Secondly, we propose another nonlinear SDR method based on a new type of stochastic neural network under a rigorous probabilistic framework and show that it can be used for sufficient dimension reduction for high-dimensional data. The proposed stochastic neural network can be trained using an adaptive stochastic gradient Markov chain Monte Carlo algorithm. Through extensive experiments on real-world classification and regression problems, we show that the proposed method compares favorably with the existing state-of-the-art sufficient dimension reduction methods and is computationally more efficient for large-scale data.</div><div><br></div><div>Finally, we propose a structure learning method for learning the causal structure hidden in the high-dimensional data, which consists of two stages:</div><div>we first conduct Bayesian sparse learning for variable screening to build a primary graph, and then we perform conditional independence tests to refine the primary graph. </div><div>Extensive numerical experiments and quantitative tests confirm the generality, effectiveness and power of the proposed methods.</div>
26

Numerical algorithms for data clustering

Liu, Ye 30 July 2019 (has links)
Data clustering is a process of grouping unlabeled objects based on the imformation describing their relationship. And it has obtained a lot of attentions in data mining for its wide applications in life. For example, in marketing, companys are interested in finding groups of customers with similar purchase behavior, which will help them to make suitable plans to gain more profits. Besides, in biology, we can make use of data clustering to distinguish planets and animals given their features. Whats more, in earthquake analysis, by clustering observed earthquake epicenters, dangerous area can be identified, it would be helpful for people to take measures to protect them from earthquake in advance. In general, there isnt one clustering algorithm which can solve all the problems. Algorithms are specifically designed to analyze different data categories. In this thesis, we study several novel numerical algorithms for data clustering mainly applied on multi-view data and tensor data. More accurate clustering result can be achieved on multi-view data by integrating information from multiple graphs. However, Most existing multi-view clustering method assume the degree of association among all the graphs are the same. One significant truth is some graphs may be strongly or weakly associated with other graphs in reality. Determining the degree of association between graphs is a key issue when clustering multi-view data. In Chapter 2, 3 and 4, we propose three different models to solve this problem. In chapter 2, a block signed matrix is constructed to integrate information in each graph with association among graphs together. Then we apply spectral clustering on it to seek different cluster structure for each graph respectively and determine the degree of association among graphs using their own cluster structure at the same time. Numerical experiments including simulations, neuron activity data and gene expression data are conducted to illustrate the state-of-art performance of algorithm in clustering and graph association. In Chapter 3, we further consider multiple graphs clustering with graph association solved by self-consistent field iterative algorithm. By using the block graph clustering framework, graphs association are considered to enhance clustering result, and then better clustering result would be used to calculate more accurate association. Self-consistent field iterative method is employed to solve this problem, and the convergence analysis is also presented. Simulations are also carried out to demonstrate the outperformance of our method. Two gene expression data are used to evaluate the effectiveness of proposed model. In Chapter 4, we formulate the multiple graphs clustering problem with the graph association as an objective function, and the graph association is considered as a term in the objective function. The proposed model can be solved efficiently by using gradient flow method. We also present its convergence analysis. Experiments on synthetic data sets and two gene expression data are given to show the efficiency in clustering and capability in graphs association. In the last three chapters, we use multiple graphs to represent the multi-view data. A key challenge is high dimensionality when the number of graphs or objects is large-scale. Moreover, tensor is another common technique to describe multi-view data. Thus tensor decomposition method can be used to learn a low-dimensional representation for high dimensional data firstly and then perform clustering efficiently, which has attract worldwide attention of researchers. In Chapter 5, we propose an orthogonal nonnegative Tucker decomposition method to decompose high-dimensional nonnegative tensor into tensor with smaller size for dimension reduction, and then perform clustering analysis. A convex relaxation algorithm of the augmented Lagrangian function is devoloped to solve the optimization problem and the convergence of the algorithm is discussed. We employ our proposed method on several real image data sets from different real world application, including face recognition, image representation and hyperspectral unmixing problem to illustrate the effectiveness of proposed algorithm.
27

Bayesian Visual Analytics: Interactive Visualization for High Dimensional Data

Han, Chao 07 December 2012 (has links)
In light of advancements made in data collection techniques over the past two decades, data mining has become common practice to summarize large, high dimensional datasets, in hopes of discovering noteworthy data structures. However, one concern is that most data mining approaches rely upon strict criteria that may mask information in data that analysts may find useful. We propose a new approach called Bayesian Visual Analytics (BaVA) which merges Bayesian Statistics with Visual Analytics to address this concern. The BaVA framework enables experts to interact with the data and the feature discovery tools by modeling the "sense-making" process using Bayesian Sequential Updating. In this paper, we use BaVA idea to enhance high dimensional visualization techniques such as Probabilistic PCA (PPCA). However, for real-world datasets, important structures can be arbitrarily complex and a single data projection such as PPCA technique may fail to provide useful insights. One way for visualizing such a dataset is to characterize it by a mixture of local models. For example, Tipping and Bishop [Tipping and Bishop, 1999] developed an algorithm called Mixture Probabilistic PCA (MPPCA) that extends PCA to visualize data via a mixture of projectors. Based on MPPCA, we developped a new visualization algorithm called Covariance-Guided MPPCA which group similar covariance structured clusters together to provide more meaningful and cleaner visualizations. Another way to visualize a very complex dataset is using nonlinear projection methods such as  the Generative Topographic Mapping algorithm(GTM). We developped an interactive version of GTM to discover interesting local data structures. We demonstrate the performance of our approaches using both synthetic and real dataset and compare our algorithms with existing ones. / Ph. D.
28

On Applications of Semiparametric Methods

Li, Zhijian 01 October 2018 (has links)
No description available.
29

Unsupervised Dimension Reduction Techniques for Lung Diagnosis using Radiomics

Kireta, Janet 01 May 2023 (has links) (PDF)
Over the years, cancer has increasingly become a global health problem [12]. For successful treatment, early detection and diagnosis is critical. Radiomics is the use of CT, PET, MRI or Ultrasound imaging as input data, extracting features from image-based data, and then using machine learning for quantitative analysis and disease prediction [23, 14, 19, 1]. Feature reduction is critical as most quantitative features can have unnecessary redundant characteristics. The objective of this research is to use machine learning techniques in reducing the number of dimensions, thereby rendering the data manageable. Radiomics steps include Imaging, segmentation, feature extraction, and analysis. For this research, a large-scale CT data for Lung cancer diagnosis collected by scholars from Medical University in China is used to illustrate the dimension reduction techniques via R, SAS, and Python softwares. The proposed reduction and analysis techniques were PCA, Clustering, and Manifold-based algorithms. The results indicated the texture-based features
30

A Study Of Factors Contributing To Self-reported Anomalies In Civil Aviation

Andrzejczak, Chris 01 January 2010 (has links)
A study investigating what factors are present leading to pilots submitting voluntary anomaly reports regarding their flight performance was conducted. The study employed statistical methods, text mining, clustering, and dimensional reduction techniques in an effort to determine relationships between factors and anomalies. A review of the literature was conducted to determine what factors are contributing to these anomalous incidents, as well as what research exists on human error, its causes, and its management. Data from the NASA Aviation Safety Reporting System (ASRS) was analyzed using traditional statistical methods such as frequencies and multinomial logistic regression. Recently formalized approaches in text mining such as Knowledge Based Discovery (KBD) and Literature Based Discovery (LBD) were employed to create associations between factors and anomalies. These methods were also used to generate predictive models. Finally, advances in dimensional reduction techniques identified concepts or keywords within records, thus creating a framework for an unsupervised document classification system. Findings from this study reinforced established views on contributing factors to civil aviation anomalies. New associations between previously unrelated factors and conditions were also found. Dimensionality reduction also demonstrated the possibility of identifying salient factors from unstructured text records, and was able to classify these records using these identified features.

Page generated in 0.044 seconds