41 |
COPS: Cluster optimized proximity scalingRusch, Thomas, Mair, Patrick, Hornik, Kurt January 2015 (has links) (PDF)
Proximity scaling methods (e.g., multidimensional scaling) represent objects in a low dimensional
configuration so that fitted distances between objects optimally approximate
multivariate proximities. Next to finding the optimal configuration the goal is often also
to assess groups of objects from the configuration. This can be difficult if the optimal
configuration lacks clusteredness (coined c-clusteredness). We present Cluster Optimized
Proximity Scaling (COPS), which attempts to solve this problem by finding a configuration
that exhibts c-clusteredness. In COPS, a flexible scaling loss function (p-stress)
is combined with an index that quantifies c-clusteredness in the solution, the OPTICS
Cordillera. We present two variants of combining p-stress and Cordillera, one for finding
the configuration directly and one for metaparameter selection for p-stress. The first variant
is illustrated by scaling Californian counties with respect to climate change related
natural hazards. We identify groups of counties with similar risk profiles and find that
counties that are in high risk of drought are socially vulnerable. The second variant is
illustrated by finding a clustered nonlinear representation of countries according to their
history of banking crises from 1800 to 2010. (authors' abstract) / Series: Discussion Paper Series / Center for Empirical Research Methods
|
42 |
Rigorous justification of Taylor Dispersion via Center Manifold theoryChaudhary, Osman 10 August 2017 (has links)
Imagine fluid moving through a long pipe or channel, and we inject dye or solute into
this pipe. What happens to the dye concentration after a long time? Initially, the
dye just moves along downstream with the fluid. However, it is also slowly diffusing
down the pipe and towards the edges as well. It turns out that after a long time,
the combined effect of transport via the fluid and this slow diffusion results in what
is effectively a much more rapid diffusion process, lengthwise down the stream. If
0 <nu << 1 is the slow diffusion coeffcient, then the effective longitudinal diffusion
coeffcient is inversely proportional to 1/nu, i.e. much larger. This phenomenon is called
Taylor Dispersion, first studied by GI Taylor in the 1950s, and studied subsequently
by many authors since, such as Aris, Chatwin, Smith, Roberts, and others. However,
none of the approaches used in the past seem to have been mathematically rigorous.
I'll propose a dynamical systems explanation of this phenomenon: specifically, I'll
explain how one can use a Center Manifold reduction to obtain Taylor Dispersion as
the dominant term in the long-time limit, and also explain how this Center Manifold
can be used to provide any finite number of correction terms to Taylor Dispersion as
well.
|
43 |
Visual Hierarchical Dimension ReductionYang, Jing 09 January 2002 (has links)
Traditional visualization techniques for multidimensional data sets, such as parallel coordinates, star glyphs, and scatterplot matrices, do not scale well to high dimensional data sets. A common approach to solve this problem is dimensionality reduction. Existing dimensionality reduction techniques, such as Principal Component Analysis, Multidimensional Scaling, and Self Organizing Maps, have serious drawbacks in that the generated low dimensional subspace has no intuitive meaning to users. In addition, little user interaction is allowed in those highly automatic processes. In this thesis, we propose a new methodology to dimensionality reduction that combines automation and user interaction for the generation of meaningful subspaces, called the visual hierarchical dimension reduction (VHDR) framework. Firstly, VHDR groups all dimensions of a data set into a dimension hierarchy. This hierarchy is then visualized using a radial space-filling hierarchy visualization tool called Sunburst. Thus users are allowed to interactively explore and modify the dimension hierarchy, and select clusters at different levels of detail for the data display. VHDR then assigns a representative dimension to each dimension cluster selected by the users. Finally, VHDR maps the high-dimensional data set into the subspace composed of these representative dimensions and displays the projected subspace. To accomplish the latter, we have designed several extensions to existing popular multidimensional display techniques, such as parallel coordinates, star glyphs, and scatterplot matrices. These displays have been enhanced to express semantics of the selected subspace, such as the context of the dimensions and dissimilarity among the individual dimensions in a cluster. We have implemented all these features and incorporated them into the XmdvTool software package, which will be released as XmdvTool Version 6.0. Lastly, we developed two case studies to show how we apply VHDR to visualize and interactively explore a high dimensional data set.
|
44 |
Statistický model tvaru obličeje / Statistical model of the face shapeBoková, Kateřina January 2019 (has links)
The goal of this thesis is to use machine learning methods for datasets of scanned faces and to create a program that allows to explore and edit faces represented as triangle meshes with a number of controls. Firstly we had to reduce dimension of triangle meshes by PCA and then we tried to predict shape of meshes according to physical properties like weight, height, age and BMI. The modeled faces can be used in animation or games.
|
45 |
Improving Feature Selection Techniques for Machine LearningTan, Feng 27 November 2007 (has links)
As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applications, such as genomic analysis, information retrieval, and text categorization. Researchers have introduced many feature selection algorithms with different selection criteria. However, it has been discovered that no single criterion is best for all applications. We proposed a hybrid feature selection framework called based on genetic algorithms (GAs) that employs a target learning algorithm to evaluate features, a wrapper method. We call it hybrid genetic feature selection (HGFS) framework. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for the target algorithm. The experiments on genomic data demonstrate that ours is a robust and effective approach that can find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm. A common characteristic of text categorization tasks is multi-label classification with a great number of features, which makes wrapper methods time-consuming and impractical. We proposed a simple filter (non-wrapper) approach called Relation Strength and Frequency Variance (RSFV) measure. The basic idea is that informative features are those that are highly correlated with the class and distribute most differently among all classes. The approach is compared with two well-known feature selection methods in the experiments on two standard text corpora. The experiments show that RSFV generate equal or better performance than the others in many cases.
|
46 |
Directional Control of Generating Brownian Path under Quasi Monte CarloLiu, Kai January 2012 (has links)
Quasi-Monte Carlo (QMC) methods are playing an increasingly important role in computational finance. This is attributed to the increased complexity of the derivative securities and the sophistication of the financial models. Simple closed-form solutions for the finance applications typically do not exist and hence numerical methods need to be used to approximate
their solutions. QMC method has been proposed as an alternative method to Monte Carlo (MC) method to accomplish this objective. Unlike MC methods, the efficiency of QMC-based methods is highly dependent on the dimensionality of the problems. In particular, numerous researches have documented, under the Black-Scholes models, the critical role of the generating matrix for simulating the Brownian paths. Numerical results support the notion that generating matrix that reduces the effective dimension of the underlying problems is able to increase the efficiency of QMC. Consequently, dimension reduction methods such as principal component analysis, Brownian bridge, Linear Transformation and Orthogonal Transformation have been proposed to further enhance QMC. Motivated by these results, we first propose a new measure to quantify the effective dimension. We then propose a new dimension reduction method which we refer as the directional method (DC). The proposed DC method has the advantage that it depends explicitly on the given function of interest. Furthermore, by assigning appropriately the direction of importance of the given function, the proposed method optimally determines the generating matrix used to simulate the Brownian paths. Because of the flexibility of our proposed method, it can be shown that many of the existing dimension reduction methods are special cases of our proposed DC methods. Finally, many numerical examples are provided to support the competitive efficiency of the proposed method.
|
47 |
Computational Methods For Functional Motif Identification and Approximate Dimension Reduction in Genomic DataGeorgiev, Stoyan January 2011 (has links)
<p>Uncovering the DNA regulatory logic in complex organisms has been one of the important goals of modern biology in the post-genomic era. The sequencing of multiple genomes in combination with the advent of DNA microarrays and, more recently, of massively parallel high-throughput sequencing technologies has made possible the adoption of a global perspective to the inference of the regulatory rules governing the context-specific interpretation of the genetic code that complements the more focused classical experimental approaches. Extracting useful information and managing the complexity resulting from the sheer volume and the high-dimensionality of the data produced by these genomic assays has emerged as a major challenge which we attempt to address in this work by developing computational methods and tools, specifically designed for the study of the gene regulatory processes in this new global genomic context. </p><p>First, we focus on the genome-wide discovery of physical interactions between regulatory sequence regions and their cognate proteins at both the DNA and RNA level. We present a motif analysis framework that leverages the genome-wide</p><p>evidence for sequence-specific interactions between trans-acting factors and their preferred cis-acting regulatory regions. The utility of the proposed framework is demonstarted on DNA and RNA cross-linking high-throughput data.</p><p>A second goal of this thesis is the development of scalable approaches to dimension reduction based on spectral decomposition and their application to the study of population structure in massive high-dimensional genetic data sets. We have developed computational tools and have performed theoretical and empirical analyses of their statistical properties with particular emphasis on the analysis of the individual genetic variation measured by Single Nucleotide Polymorphism (SNP) microrarrays.</p> / Dissertation
|
48 |
Analysis of Modeling, Training, and Dimension Reduction Approaches for Target Detection in Hyperspectral ImageryFarrell, Michael D., Jr. 03 November 2005 (has links)
Whenever a new sensor or system comes online, engineers and analysts responsible for processing the measured data turn first to methods that are tried and true on existing systems. This is a natural, if not wholly logical approach, and is exactly what has happened in the advent of hyperspectral imagery (HSI) exploitation. However, a closer look at the assumptions made by the approaches published in the literature has not been undertaken.
This thesis analyzes three key aspects of HSI exploitation: statistical data modeling, covariance estimation from training data, and dimension reduction. These items are part of standard processing schemes, and it is worthwhile to understand and quantify the impact that various assumptions for these items have on target detectability and detection statistics.
First, the accuracy and applicability of the standard Gaussian (i.e., Normal) model is evaluated, and it is shown that the elliptically contoured t-distribution (EC-t) sometimes offers a better statistical model for HSI data. A finite mixture approach for EC-t is developed in which all parameters are estimated simultaneously without a priori information. Then the effects of making a poor covariance estimate are shown by including target samples in the training data. Multiple test cases with ground targets are explored. They show that the magnitude of the deleterious effect of covariance contamination on detection statistics depends on algorithm type and target signal characteristics. Next, the two most widely used dimension reduction approaches are tested. It is demonstrated that, in many cases, significant dimension reduction can be achieved with only a minor loss in detection performance.
In addition, a concise development of key HSI detection algorithms is presented, and the state-of-the-art in adaptive detectors is benchmarked for land mine targets. Methods for detection and identification of airborne gases using hyperspectral imagery are discussed, and this application is highlighted as an excellent opportunity for future work.
|
49 |
Feature Reduction and Multi-label Classification Approaches for Document DataJiang, Jung-Yi 08 August 2011 (has links)
This thesis proposes some novel approaches for feature reduction and multi-label classification for text datasets. In text processing, the bag-of-words model is commonly used, with each document modeled as a vector in a high dimensional space. This model is often called the vector-space model. Usually, the dimensionality of the document vector is huge. Such high-dimensionality can be a severe obstacle for text processing algorithms. To improve the performance of text processing algorithms, we propose a feature clustering approach to reduce the dimensionality of document vectors. We also propose an efficient algorithm for text classification.
Feature clustering is a powerful method to reduce the dimensionality
of feature vectors for text classification. We
propose a fuzzy similarity-based self-constructing algorithm for
feature clustering. The words in the feature vector of a document
set are grouped into clusters based on similarity test. Words that
are similar to each other are grouped into the same cluster. Each
cluster is characterized by a membership function with statistical
mean and deviation. When all the words have been fed in, a desired
number of clusters are formed automatically. We then have one
extracted feature for each cluster. The extracted feature
corresponding to a cluster is a weighted combination of the words
contained in the cluster. By this algorithm, the derived membership
functions match closely with and describe properly the real
distribution of the training data. Besides, the user need not
specify the number of extracted features in advance, and
trial-and-error for determining the appropriate number of extracted
features can then be avoided. Experimental results show
that our method can run faster and obtain better extracted features than other methods.
We also propose a fuzzy similarity clustering scheme for multi-label
text categorization in which a document can belong to one or more
than one category. Firstly, feature transformation is performed. An
input document is transformed to a fuzzy-similarity vector. Next,
the relevance degrees of the input document to a collection of
clusters are calculated, which are then combined to obtain the
relevance degree of the input document to each participating
category. Finally, the input document is classified to a certain
category if the associated relevance degree exceeds a threshold. In
text categorization, the number of the involved terms is usually
huge. An automatic classification system may suffer from large
memory requirements and poor efficiency. Our scheme can do without
these difficulties. Besides, we allow the region a category covers
to be a combination of several sub-regions that are not necessarily
connected. The effectiveness of our proposed scheme is demonstrated
by the results of several experiments.
|
50 |
Principal Components Analysis for Binary DataLee, Seokho 2009 May 1900 (has links)
Principal components analysis (PCA) has been widely used as a statistical tool for the dimension
reduction of multivariate data in various application areas and extensively studied
in the long history of statistics. One of the limitations of PCA machinery is that PCA can be
applied only to the continuous type variables. Recent advances of information technology
in various applied areas have created numerous large diverse data sets with a high dimensional
feature space, including high dimensional binary data. In spite of such great demands,
only a few methodologies tailored to such binary dataset have been suggested. The
methodologies we developed are the model-based approach for generalization to binary
data. We developed a statistical model for binary PCA and proposed two stable estimation
procedures using MM algorithm and variational method. By considering the regularization
technique, the selection of important variables is automatically achieved. We also proposed
an efficient algorithm for model selection including the choice of the number of principal
components and regularization parameter in this study.
|
Page generated in 0.019 seconds