Spelling suggestions: "subject:"high dimensional"" "subject:"igh dimensional""
171 |
Some questions in risk management and high-dimensional data analysisWang, Ruodu 04 May 2012 (has links)
This thesis addresses three topics in the area of statistics and
probability, with applications in risk management. First, for the
testing problems in the high-dimensional (HD) data analysis, we
present a novel method to formulate empirical likelihood tests and
jackknife empirical likelihood tests by splitting the sample into
subgroups. New tests are constructed to test the equality of two HD
means, the coefficient in the HD linear models and the HD covariance
matrices. Second, we propose jackknife empirical likelihood methods
to formulate interval estimations for important quantities in
actuarial science and risk management, such as the risk-distortion
measures, Spearman's rho and parametric copulas. Lastly, we
introduce the theory of completely mixable (CM) distributions. We
give properties of the CM distributions, show that a few classes of
distributions are CM and use the new technique to find the bounds
for the sum of individual risks with given marginal distributions
but unspecific dependence structure. The result partially solves a
problem that had been a challenge for decades, and directly leads to
the bounds on quantities of interest in risk management, such as the
variance, the stop-loss premium, the price of the European options
and the Value-at-Risk associated with a joint portfolio.
|
172 |
Statistical Modeling of High-Dimensional Nonlinear Systems: A Projection Pursuit SolutionSwinson, Michael D. 28 November 2005 (has links)
Despite recent advances in statistics, artificial neural network theory, and machine learning, nonlinear function estimation in high-dimensional space remains a nontrivial problem. As the response surface becomes more complicated and the dimensions of the input data increase, the dreaded "curse of dimensionality" takes hold, rendering the best of function approximation methods ineffective. This thesis takes a novel approach to solving the high-dimensional function estimation problem. In this work, we propose and develop two distinct parametric projection pursuit learning networks with wide-ranging applicability. Included in this work is a discussion of the choice of basis functions used as well as a description of the optimization schemes utilized to find the parameters that enable each network to best approximate a response surface.
The essence of these new modeling methodologies is to approximate functions via the superposition of a series of piecewise one-dimensional models that are fit to specific directions, called projection directions. The key to the effectiveness of each model lies in its ability to find efficient projections for reducing the dimensionality of the input space to best fit an underlying response surface. Moreover, each method is capable of effectively selecting appropriate projections from the input data in the presence of relatively high levels of noise. This is accomplished by rigorously examining the theoretical conditions for approximating each solution space and taking full advantage of the principles of optimization to construct a pair of algorithms, each capable of effectively modeling high-dimensional nonlinear response surfaces to a higher degree of accuracy than previously possible.
|
173 |
Dirty statistical modelsJalali, Ali, 1982- 11 July 2012 (has links)
In fields across science and engineering, we are increasingly faced with problems where the number of variables or features we need to estimate is much larger than the number of observations. Under such high-dimensional scaling, for any hope of statistically consistent estimation, it becomes vital to leverage any potential structure in the problem such as sparsity, low-rank structure or block sparsity. However, data may deviate significantly from any one such statistical model. The motivation of this thesis is: can we simultaneously leverage more than one such statistical structural model, to obtain consistency in a larger number of problems, and with fewer samples, than can be obtained by single models? Our approach involves combining via simple linear superposition, a technique we term dirty models. The idea is very simple: while any one structure might not capture the data, a superposition of structural classes might. Dirty models thus searches for a parameter that can be decomposed into a number of simpler structures such as (a) sparse plus block-sparse, (b) sparse plus low-rank and (c) low-rank plus block-sparse. In this thesis, we propose dirty model based algorithms for different problems such as multi-task learning, graph clustering and time-series analysis with latent factors. We analyze these algorithms in terms of the number of observations we need to estimate the variables. These algorithms are based on convex optimization and sometimes they are relatively slow. We provide a class of low-complexity greedy algorithms that not only can solve these optimizations faster, but also guarantee the solution. Other than theoretical results, in each case, we provide experimental results to illustrate the power of dirty models. / text
|
174 |
Analysis of high dimensional repeated measures designs: The one- and two-sample test statistics / Entwicklung von Verfahren zur Analyse von hochdimensionalen Daten mit MesswiederholungenAhmad, Muhammad Rauf 07 July 2008 (has links)
No description available.
|
175 |
Integration of computational methods and visual analytics for large-scale high-dimensional dataChoo, Jae gul 20 September 2013 (has links)
With the increasing amount of collected data, large-scale high-dimensional data analysis is becoming essential in many areas. These data can be analyzed either by using fully computational methods or by leveraging human capabilities via interactive visualization. However, each method has its drawbacks. While a fully computational method can deal with large amounts of data, it lacks depth in its understanding of the data, which is critical to the analysis. With the interactive visualization method, the user can give a deeper insight on the data but suffers when large amounts of data need to be analyzed.
Even with an apparent need for these two approaches to be integrated, little progress has been made. As ways to tackle this problem, computational methods have to be re-designed both theoretically and algorithmically, and the visual analytics system has to expose these computational methods to users so that they can choose the proper algorithms and settings. To achieve an appropriate integration between computational methods and visual analytics, the thesis focuses on essential computational methods for visualization, such as dimension reduction and clustering, and it presents fundamental development of computational methods as well as visual analytic systems involving newly developed methods.
The contributions of the thesis include (1) the two-stage dimension reduction framework that better handles significant information loss in visualization of high-dimensional data, (2) efficient parametric updating of computational methods for fast and smooth user interactions, and (3) an iteration-wise integration framework of computational methods in real-time visual analytics. The latter parts of the thesis focus on the development of visual analytics systems involving the presented computational methods, such as (1) Testbed: an interactive visual testbed system for various dimension reduction and clustering methods, (2) iVisClassifier: an interactive visual classification system using supervised dimension reduction, and (3) VisIRR: an interactive visual information retrieval and recommender system for large-scale document data.
|
176 |
Contributions to Simulation-based High-dimensional Sequential Decision MakingHoock, Jean-Baptiste 10 April 2013 (has links) (PDF)
My thesis is entitled "Contributions to Simulation-based High-dimensional Sequential Decision Making". The context of the thesis is about games, planning and Markov Decision Processes. An agent interacts with its environment by successively making decisions. The agent starts from an initial state until a final state in which the agent can not make decision anymore. At each timestep, the agent receives an observation of the state of the environment. From this observation and its knowledge, the agent makes a decision which modifies the state of the environment. Then, the agent receives a reward and a new observation. The goal is to maximize the sum of rewards obtained during a simulation from an initial state to a final state. The policy of the agent is the function which, from the history of observations, returns a decision. We work in a context where (i) the number of states is huge, (ii) reward carries little information, (iii) the probability to reach quickly a good final state is weak and (iv) prior knowledge is either nonexistent or hardly exploitable. Both applications described in this thesis present these constraints : the game of Go and a 3D simulator of the european project MASH (Massive Sets of Heuristics). In order to take a satisfying decision in this context, several solutions are brought : 1. Simulating with the compromise exploration/exploitation (MCTS) 2. Reducing the complexity by local solving (GoldenEye) 3. Building a policy which improves itself (RBGP) 4. Learning prior knowledge (CluVo+GMCTS) Monte-Carlo Tree Search (MCTS) is the state of the art for the game of Go. From a model of the environment, MCTS builds incrementally and asymetrically a tree of possible futures by performing Monte-Carlo simulations. The tree starts from the current observation of the agent. The agent switches between the exploration of the model and the exploitation of decisions which statistically give a good cumulative reward. We discuss 2 ways for improving MCTS : the parallelization and the addition of prior knowledge. The parallelization does not solve some weaknesses of MCTS; in particular some local problems remain challenges. We propose an algorithm (GoldenEye) which is composed of 2 parts : detection of a local problem and then its resolution. The algorithm of resolution reuses some concepts of MCTS and it solves difficult problems of a classical database. The addition of prior knowledge by hand is laborious and boring. We propose a method called Racing-based Genetic Programming (RBGP) in order to add automatically prior knowledge. The strong point is that RBGP rigorously validates the addition of a prior knowledge and RBGP can be used for building a policy (instead of only optimizing an algorithm). In some applications such as MASH, simulations are too expensive in time and there is no prior knowledge and no model of the environment; therefore Monte-Carlo Tree Search can not be used. So that MCTS becomes usable in this context, we propose a method for learning prior knowledge (CluVo). Then we use pieces of prior knowledge for improving the rapidity of learning of the agent and for building a model, too. We use from this model an adapted version of Monte-Carlo Tree Search (GMCTS). This method solves difficult problems of MASH and gives good results in an application to a word game.
|
177 |
Maximum-likelihood kernel density estimation in high-dimensional feature spaces /| C.M. van der WaltVan der Walt, Christiaan Maarten January 2014 (has links)
With the advent of the internet and advances in computing power, the collection of very large high-dimensional datasets has become feasible { understanding and modelling high-dimensional data has thus become a crucial activity, especially in the field of pattern recognition. Since non-parametric density estimators are data-driven and do not require or impose a pre-defined probability density function on data, they are very powerful tools for probabilistic data modelling and analysis. Conventional non-parametric density estimation methods, however, originated from the field of statistics and were not originally intended to perform density estimation in high-dimensional features spaces { as is often encountered in real-world pattern recognition tasks. Therefore we address the fundamental problem of non-parametric density estimation in high-dimensional feature spaces in this study. Recent advances in maximum-likelihood (ML) kernel density estimation have shown that kernel density estimators hold much promise for estimating nonparametric probability density functions in high-dimensional feature spaces. We therefore derive two new iterative kernel bandwidth estimators from the maximum-likelihood (ML) leave one-out objective function and also introduce a new non-iterative kernel bandwidth estimator (based on the theoretical bounds of the ML bandwidths) for the purpose of bandwidth initialisation. We name the iterative kernel bandwidth estimators the minimum leave-one-out entropy (MLE) and global MLE estimators, and name the non-iterative kernel bandwidth estimator the MLE rule-of-thumb estimator. We compare the performance of the MLE rule-of-thumb estimator and conventional kernel density estimators on artificial data with data properties that are varied in a controlled fashion and on a number of representative real-world pattern recognition tasks, to gain a better understanding of the behaviour of these estimators in high-dimensional spaces and to determine whether these estimators are suitable for initialising the bandwidths of iterative ML bandwidth estimators in high dimensions. We find that there are several regularities in the relative performance of conventional kernel density estimators across different tasks and dimensionalities and that the Silverman rule-of-thumb bandwidth estimator performs reliably across most tasks and dimensionalities of the pattern recognition datasets considered, even in high-dimensional feature spaces. Based on this empirical evidence and the intuitive theoretical motivation that the Silverman estimator optimises the asymptotic mean integrated squared error (assuming a Gaussian reference distribution), we select this estimator to initialise the bandwidths of the iterative ML kernel bandwidth estimators compared in our simulation studies. We then perform a comparative simulation study of the newly introduced iterative MLE estimators and other state-of-the-art iterative ML estimators on a number of artificial and real-world high-dimensional pattern recognition tasks. We illustrate with artificial data (guided by theoretical motivations) under what conditions certain estimators should be preferred and we empirically confirm on real-world data that no estimator performs optimally on all tasks and that the optimal estimator depends on the properties of the underlying density function being estimated. We also observe an interesting case of the bias-variance trade-off where ML estimators with fewer parameters than the MLE estimator perform exceptionally well on a wide variety of tasks; however, for the cases where these estimators do not perform well, the MLE estimator generally performs well. The newly introduced MLE kernel bandwidth estimators prove to be a useful contribution to the field of pattern recognition, since they perform optimally on a number of real-world pattern recognition tasks investigated and provide researchers and
practitioners with two alternative estimators to employ for the task of kernel density
estimation. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014
|
178 |
Distributed high-dimensional similarity search with music information retrieval applicationsFaghfouri, Aidin 29 August 2011 (has links)
Today, the advent of networking technologies and computer hardware have enabled more and more inexpensive PCs, various mobile devices, smart phones, PDAs, sensors and cameras to be linked to the Internet with better connectivity. In recent years, we have witnessed the emergence of several instances of distributed applications, providing infrastructures for social interactions over large-scale wide-area networks and facilitating the ways users share and publish data. User generated data today range from simple text files to (semi-) structured documents and multimedia content. With the emergence of Semantic Web, the number of features (associated with a content) that are used in order to index those large amounts of heterogenous pieces of data is growing dramatically. The feature sets associated with each content type can grow continuously as we discover new ways of describing a content in formulated terms.
As the number of dimensions in the feature data grow (as high as 100 to 1000), it becomes harder and harder to search for information in a dataset due to the curse of dimensionality and it is not appropriate to use naive search methods, as their performance degrade to linear search. As an alternative, we can distribute the content and the query processing load to a set of peers in a distributed Peer-to-Peer (P2P) network and incorporate high-dimensional distributed search techniques to attack the problem.
Currently, a large percentage of Internet traffic consists of video and music files shared and exchanged over P2P networks. In most present services, searching for music is performed through keyword search and naive string-matching algorithms using collaborative filtering techniques which mostly use tag based approaches. In music information retrieval (MIR) systems, the main goal is to make recommendations similar to the music that the user listens to. In these systems, techniques based on acoustic feature extraction can be employed to achieve content-based music similarity search (i.e., searching through music based on what can be heard from the music track). Using these techniques we can devise an automated measure of similarity that can replace the need for human experts (or users) who assign descriptive genre tags and meta-data to each recording and solve the famous cold-start problem associated with the collaborative filtering techniques.
In this work we explore the advantages of distributed structures by efficiently distributing the content features and query processing load on the peers in a P2P network. Using a family of Locality Sensitive Hash (LSH) functions based on p-stable distributions we propose an efficient, scalable and load-balanced system, capable of performing K-Nearest-Neighbor (KNN) and Range queries. We also propose a new load-balanced indexing algorithm and evaluate it using our Java based simulator.
Our results show that this P2P design ensures load-balancing and guarantees logarithmic number of hops for query processing. Our system is extensible to be used with all types of multi-dimensional feature data and it can also be employed as the main indexing scheme of a multipurpose recommendation system. / Graduate
|
179 |
Numerical Solution Methods in Stochastic Chemical KineticsEngblom, Stefan January 2008 (has links)
This study is concerned with the numerical solution of certain stochastic models of chemical reactions. Such descriptions have been shown to be useful tools when studying biochemical processes inside living cells where classical deterministic rate equations fail to reproduce actual behavior. The main contribution of this thesis lies in its theoretical and practical investigation of different methods for obtaining numerical solutions to such descriptions. In a preliminary study, a simple but often quite effective approach to the moment closure problem is examined. A more advanced program is then developed for obtaining a consistent representation of the high dimensional probability density of the solution. The proposed method gains efficiency by utilizing a rapidly converging representation of certain functions defined over the semi-infinite integer lattice. Another contribution of this study, where the focus instead is on the spatially distributed case, is a suggestion for how to obtain a consistent stochastic reaction-diffusion model over an unstructured grid. Here it is also shown how to efficiently collect samples from the resulting model by making use of a hybrid method. In a final study, a time-parallel stochastic simulation algorithm is suggested and analyzed. Efficiency is here achieved by moving parts of the solution phase into the deterministic regime given that a parallel architecture is available. Necessary background material is developed in three chapters in this summary. An introductory chapter on an accessible level motivates the purpose of considering stochastic models in applied physics. In a second chapter the actual stochastic models considered are developed in a multi-faceted way. Finally, the current state-of-the-art in numerical solution methods is summarized and commented upon.
|
180 |
Analyse de données de cytometrie de flux pour un grand nombre d'échantillons / Automated flow cytometric analysis across a large number of samplesChen, Xiaoyi 06 October 2015 (has links)
Cette thèse a conduit à la mise au point de deux nouvelles approches statistiques pour l'identification automatique de populations cellulaires en cytometrie de flux multiparamétrique, et ceci pour le traitement d'un grand nombre d'échantillons, chaque échantillon étant prélevé sur un donneur particulier. Ces deux approches répondent à des besoins exprimés dans le cadre du projet Labex «Milieu Intérieur». Dix panels cytométriques de 8 marqueurs ont été sélectionnés pour la quantification des populations principales et secondaires présentes dans le sang périphérique. Sur la base de ces panels, les données ont été acquises et analysées sur une cohorte de 1000 donneurs sains.Tout d'abord, nous avons recherché une quantification robuste des principales composantes cellulaires du système immunitaire. Nous décrivons une procédure computationnelle, appelée FlowGM, qui minimise l'intervention de l'utilisateur. Le cœur statistique est fondé sur le modèle classique de mélange de lois gaussiennes. Ce modèle est tout d'abord utilisé pour obtenir une classification initiale, le nombre de classes étant déterminé par le critère d'information BIC. Après cela, une méta-classification, qui consiste en l'étiquetage des classes et la fusion de celles qui ont la même étiquette au regard de la référence, a permis l'identification automatique de 24 populations cellulaires sur quatre panels. Ces identifications ont ensuite été intégrées dans les fichiers de cytométrie de flux standard (FCS), permettant ainsi la comparaison avec l'analyse manuelle opérée par les experts. Nous montrons que la qualité est similaire entre FlowGM et l'analyse manuelle classique pour les lymphocytes, mais notamment que FlowGM montre une meilleure discrimination des sous-populations de monocytes et de cellules dendritiques (DC), qui sont difficiles à obtenir manuellement. FlowGM fournit ainsi une analyse rapide de phénotypes cellulaires et se prête à des études de cohortes.A des fins d'évaluation, de diagnostic et de recherche, une analyse tenant compte de l'influence de facteurs, comme par exemple les effets du protocole, l'effet de l'âge et du sexe, a été menée. Dans le contexte du projet MI, les 1000 donneurs sains ont été stratifiés selon le sexe et l'âge. Les résultats de l'analyse quantitative faite avec FlowGM ont été jugés concordants avec l'analyse manuelle qui est considérée comme l'état de l'art. On note surtout une augmentation de la précision pour les populations CD16+ et CDC1, où les sous-populations CD14loCD16hi et HLADRhi CDC1 ont été systématiquement identifiées. Nous démontrons que les effectifs de ces deux populations présentent une corrélation significative avec l'âge. En ce qui concerne les populations qui sont connues pour être associées à l'âge, un modèle de régression linéaire multiple a été considéré qui fournit un coefficient de régression renforcé. Ces résultats établissent une base efficace pour l'évaluation de notre procédure FlowGM.Lors de l'utilisation de FlowGM pour la caractérisation détaillée de certaines sous-populations présentant de fortes variations au travers des différents échantillons, par exemple les cellules T, nous avons constaté que FlowGM était en difficulté. En effet, dans ce cas, l'algorithme EM classique initialisé avec la classification de l'échantillon de référence est insuffisant pour garantir l'alignement et donc l'identification des différentes classes entre tous échantillons. Nous avons donc amélioré FlowGM en une nouvelle procédure FlowGMP. Pour ce faire, nous avens ajouté au modèle de mélange, une distribution a priori sur les paramètres de composantes, conduisant à un algorithme EM contraint. Enfin, l'évaluation de FlowGMP sur un panel difficile de cellules T a été réalisée, en effectuant une comparaison avec l'analyse manuelle. Cette comparaison montre que notre procédure Bayésienne fournit une identification fiable et efficace des onze sous-populations de cellules T à travers un grand nombre d'échantillons. / In the course of my Ph.D. work, I have developed and applied two new computational approaches for automatic identification of cell populations in multi-parameter flow cytometry across a large number of samples. Both approaches were motivated and taken by the LabEX "Milieu Intérieur" study (hereafter MI study). In this project, ten 8-color flow cytometry panels were standardized for assessment of the major and minor cell populations present in peripheral whole blood, and data were collected and analyzed from 1,000 cohorts of healthy donors.First, we aim at robust characterization of major cellular components of the immune system. We report a computational pipeline, called FlowGM, which minimizes operator input, is insensitive to compensation settings, and can be adapted to different analytic panels. A Gaussian Mixture Model (GMM) - based approach was utilized for initial clustering, with the number of clusters determined using Bayesian Information Criterion. Meta-clustering in a reference donor, by which we mean labeling clusters and merging those with the same label in a pre-selected representative donor, permitted automated identification of 24 cell populations across four panels. Cluster labels were then integrated into Flow Cytometry Standard (FCS) files, thus permitting comparisons to human expert manual analysis. We show that cell numbers and coefficient of variation (CV) are similar between FlowGM and conventional manual analysis of lymphocyte populations, but notably FlowGM provided improved discrimination of "hard-to-gate" monocyte and dendritic cell (DC) subsets. FlowGM thus provides rapid, high-dimensional analysis of cell phenotypes and is amenable to cohort studies.After having cell counts across a large number of cohort donors, some further analysis (for example, the agreement with other methods, the age and gender effect, etc.) are required naturally for the purpose of comprehensive evaluation, diagnosis and discovery. In the context of the MI project, the 1,000 healthy donors were stratified across gender (50% women and 50% men) and age (20-69 years of age). Analysis was streamlined using our established approach FlowGM, the results were highly concordant with the state-of-art gold standard manual gating. More important, further precision of the CD16+ monocytes and cDC1 population was achieved using FlowGM, CD14loCD16hi monocytes and HLADRhi cDC1 cells were consistently identified. We demonstrate that the counts of these two populations show a significant correlation with age. As for the cell populations that are well-known to be related to age, a multiple linear regression model was considered, and it is shown that our results provided higher regression coefficient. These findings establish a strong foundation for comprehensive evaluation of our previous work.When extending this FlowGM method for detailed characterization of certain subpopulations where more variations are revealed across a large number of samples, for example the T cells, we find that the conventional EM algorithm initiated with reference clustering is insufficient to guarantee the alignment of clusters between all samples due to the presence of technical and biological variations. We then improved FlowGM and presented FlowGMP pipeline to address this specific panel. We introduce a Bayesian mixture model by assuming a prior distribution of component parameters and derive a penalized EM algorithm. Finally the performance of FlowGMP on this difficult T cell panel with a comparison between automated and manual analysis shows that our method provides a reliable and efficient identification of eleven T cell subpopulations across a large number of samples.
|
Page generated in 0.0685 seconds