Global ETD Search

161	Order in the random forest Karlsson, Isak January 2017 (has links) In many domains, repeated measurements are systematically collected to obtain the characteristics of objects or situations that evolve over time or other logical orderings. Although the classification of such data series shares many similarities with traditional multidimensional classification, inducing accurate machine learning models using traditional algorithms are typically infeasible since the order of the values must be considered. In this thesis, the challenges related to inducing predictive models from data series using a class of algorithms known as random forests are studied for the purpose of efficiently and effectively classifying (i) univariate, (ii) multivariate and (iii) heterogeneous data series either directly in their sequential form or indirectly as transformed to sparse and high-dimensional representations. In the thesis, methods are developed to address the challenges of (a) handling sparse and high-dimensional data, (b) data series classification and (c) early time series classification using random forests. The proposed algorithms are empirically evaluated in large-scale experiments and practically evaluated in the context of detecting adverse drug events. In the first part of the thesis, it is demonstrated that minor modifications to the random forest algorithm and the use of a random projection technique can improve the effectiveness of random forests when faced with discrete data series projected to sparse and high-dimensional representations. In the second part of the thesis, an algorithm for inducing random forests directly from univariate, multivariate and heterogeneous data series using phase-independent patterns is introduced and shown to be highly effective in terms of both computational and predictive performance. Then, leveraging the notion of phase-independent patterns, the random forest is extended to allow for early classification of time series and is shown to perform favorably when compared to alternatives. The conclusions of the thesis not only reaffirm the empirical effectiveness of random forests for traditional multidimensional data but also indicate that the random forest framework can, with success, be extended to sequential data representations. Machine learning random forest ensemble time series data series sequential data sparse data high-dimensional data Computer and Information Science Data- och informationsvetenskap
162	Évaluation de la performance du score de propension à hautes dimensions dans le cadre d’études observationnelles québécoises Guertin, Jason Robert 12 1900 (has links) Les scores de propension (PS) sont fréquemment utilisés dans l’ajustement pour des facteurs confondants liés au biais d’indication. Cependant, ils sont limités par le fait qu’ils permettent uniquement l’ajustement pour des facteurs confondants connus et mesurés. Les scores de propension à hautes dimensions (hdPS), une variante des PS, utilisent un algorithme standardisé afin de sélectionner les covariables pour lesquelles ils vont ajuster. L’utilisation de cet algorithme pourrait permettre l’ajustement de tous les types de facteurs confondants. Cette thèse a pour but d’évaluer la performance de l’hdPS vis-à-vis le biais d’indication dans le contexte d’une étude observationnelle examinant l’effet diabétogénique potentiel des statines. Dans un premier temps, nous avons examiné si l’exposition aux statines était associée au risque de diabète. Les résultats de ce premier article suggèrent que l’exposition aux statines est associée avec une augmentation du risque de diabète et que cette relation est dose-dépendante et réversible dans le temps. Suite à l’identification de cette association, nous avons examiné dans un deuxième article si l’hdPS permettait un meilleur ajustement pour le biais d’indication que le PS; cette évaluation fut entreprise grâce à deux approches: 1) en fonction des mesures d’association ajustées et 2) en fonction de la capacité du PS et de l’hdPS à sélectionner des sous-cohortes appariées de patients présentant des caractéristiques similaires vis-à-vis 19 caractéristiques lorsqu’ils sont utilisés comme critère d’appariement. Selon les résultats présentés dans le cadre du deuxième article, nous avons démontré que l’évaluation de la performance en fonction de la première approche était non concluante, mais que l’évaluation en fonction de la deuxième approche favorisait l’hdPS dans son ajustement pour le biais d’indication. Le dernier article de cette thèse a cherché à examiner la performance de l’hdPS lorsque des facteurs confondants connus et mesurés sont masqués à l’algorithme de sélection. Les résultats de ce dernier article indiquent que l’hdPS pourrait, au moins partiellement, ajuster pour des facteurs confondants masqués et qu’il pourrait donc potentiellement ajuster pour des facteurs confondants non mesurés. Ensemble ces résultats indiquent que l’hdPS serait supérieur au PS dans l’ajustement pour le biais d’indication et supportent son utilisation lors de futures études observationnelles basées sur des données médico-administratives. / Propensity scores (PS) are frequently used to adjust for confounders leading to indication bias. However, PS are limited by the fact that they can only adjust for measured and known confounders. High-dimensional propensity scores (hdPS), a specific type of PS, select which variables they adjust for by means of a standardized selection algorithm. Thanks to the use of this selection algorithm, hdPS could potentially adjust for all type of confounders. This thesis aims to evaluate the hdPS’s performance in the adjustment for indication bias in the context of an observational study focussing on the potential diabetogenic effect of statins. The first article’s aim was to identify if the exposure to statins was associated with the risk of diabetes. Results of this article suggest that exposure to statins is associated with an increase in the risk of diabetes and that this association is dose-dependent and reversible in nature. After having identified this association, we examined if the hdPS outperforms the PS in the adjustment for indication bias. Both methods’ performance were compared by means of the obtained adjusted measures of associations and by means of the standardized differences regarding 19 characteristics following the creation of two matched sub-cohorts (each matched on either patients’ PS or patients’ hdPS). Results of this second article identify that the performance of either method could not be differentiated by means of the first approach but that, based on the second approach, the hdPS outperforms the PS in its adjustment for indication bias. The last article aimed to evaluate if the hdPS could adjust for known confounders which were hidden to the selection algorithm. Results of this third article suggest that the hdPS method can adjust for at least some hidden confounders and that it could potentially adjust for some unmeasured confounders. As a whole, this thesis suggests that the hdPS method could be superior to the PS method in its ability to adjust for indication bias and supports its use in future observational studies using medico-administrative databases. Score de propension Score de propension à hautes dimensions Biais d’indication Statines Diabète Propensity scores High-dimensional propensity scores Indication bias Statins Diabetes
163	Conception d'heuristiques d'optimisation pour les problèmes de grande dimension : application à l'analyse de données de puces à ADN / Heuristics implementation for high-dimensional problem optimization : application in microarray data analysis Gardeux, Vincent 30 November 2011 (has links) Cette thèse expose la problématique récente concernant la résolution de problèmes de grande dimension. Nous présentons les méthodes permettant de les résoudre ainsi que leurs applications, notamment pour la sélection de variables dans le domaine de la fouille de données. Dans la première partie de cette thèse, nous exposons les enjeux de la résolution de problèmes de grande dimension. Nous nous intéressons principalement aux méthodes de recherche linéaire, que nous jugeons particulièrement adaptées pour la résolution de tels problèmes. Nous présentons ensuite les méthodes que nous avons développées, basées sur ce principe : CUS, EUS et EM323. Nous soulignons en particulier la très grande vitesse de convergence de CUS et EUS, ainsi que leur simplicité de mise en oeuvre. La méthode EM323 est issue d'une hybridation entre la méthode EUS et un algorithme d'optimisation unidimensionnel développé par F. Glover : l'algorithme 3-2-3. Nous montrons que ce dernier algorithme obtient des résultats d'une plus grande précision, notamment pour les problèmes non séparables, qui sont le point faible des méthodes issues de la recherche linéaire. Dans une deuxième partie, nous nous intéressons aux problèmes de fouille de données, et plus particulièrement l'analyse de données de puces à ADN. Le but est de classer ces données et de prédire le comportement de nouveaux exemples. Dans un premier temps, une collaboration avec l'hôpital Tenon nous permet d'analyser des données privées concernant le cancer du sein. Nous développons alors une méthode exacte, nommée delta-test, enrichie par la suite d'une méthode permettant la sélection automatique du nombre de variables. Dans un deuxième temps, nous développons une méthode heuristique de sélection de variables, nommée ABEUS, basée sur l'optimisation des performances du classifieur DLDA. Les résultats obtenus sur des données publiques montrent que nos méthodes permettent de sélectionner des sous-ensembles de variables de taille très faible,ce qui est un critère important permettant d'éviter le sur-apprentissage / This PhD thesis explains the recent issue concerning the resolution of high-dimensional problems. We present methods designed to solve them, and their applications for feature selection problems, in the data mining field. In the first part of this thesis, we introduce the stakes of solving high-dimensional problems. We mainly investigate line search methods, because we consider them to be particularly suitable for solving such problems. Then, we present the methods we developed, based on this principle : CUS, EUS and EM323. We emphasize, in particular, the very high convergence speed of CUS and EUS, and their simplicity of implementation. The EM323 method is based on an hybridization between EUS and a one-dimensional optimization algorithm developed by F. Glover : the 3-2-3 algorithm. We show that the results of EM323 are more accurate, especially for non-separable problems, which are the weakness of line search based methods. In the second part, we focus on data mining problems, and especially those concerning microarray data analysis. The objectives are to classify data and to predict the behavior of new samples. A collaboration with the Tenon Hospital in Paris allows us to analyze their private breast cancer data. To this end, we develop an exact method, called delta-test, enhanced by a method designed to automatically select the optimal number of variables. In a second time, we develop an heuristic, named ABEUS, based on the optimization of the DLDA classifier performances. The results obtained from publicly available data show that our methods manage to select very small subsets of variables, which is an important criterion to avoid overfitting Métaheuristiques Problèmes de grande dimension Fouille de données Génomique Recherche linéaire Analyse de puces à ADN Metaheuristics High dimensional problems Data mining Genomic Line search DNA microarray analysis
164	Three Essays in Functional Time Series and Factor Analysis Nisol, Gilles 20 December 2018 (has links) (PDF) The thesis is dedicated to time series analysis for functional data and contains three original parts. In the first part, we derive statistical tests for the presence of a periodic component in a time series of functions. We consider both the traditional setting in which the periodic functional signal is contaminated by functional white noise, and a more general setting of a contaminating process which is weakly dependent. Several forms of the periodic component are considered. Our tests are motivated by the likelihood principle and fall into two broad categories, which we term multivariate and fully functional. Overall, for the functional series that motivate this research, the fully functional tests exhibit a superior balance of size and power. Asymptotic null distributions of all tests are derived and their consistency is established. Their finite sample performance is examined and compared by numerical studies and application to pollution data. In the second part, we consider vector autoregressive processes (VARs) with innovations having a singular covariance matrix (in short singular VARs). These objects appear naturally in the context of dynamic factor models. The Yule-Walker estimator of such a VAR is problematic, because the solution of the corresponding equation system tends to be numerically rather unstable. For example, if we overestimate the order of the VAR, then the singularity of the innovations renders the Yule-Walker equation system singular as well. Moreover, even with correctly selected order, the Yule-Walker system tends be close to singular in finite sample. We show that this has a severe impact on predictions. While the asymptotic rate of the mean square prediction error (MSPE) can be just like in the regular (non-singular) case, the finite sample behavior is suffering. This effect turns out to be particularly dramatic in context of dynamic factor models, where we do not directly observe the so-called common components which we aim to predict. Then, when the data are sampled with some additional error, the MSPE often gets severely inflated. We explain the reason for this phenomenon and show how to overcome the problem. Our numerical results underline that it is very important to adapt prediction algorithms accordingly. In the third part, we set up theoretical foundations and a practical method to forecast multiple functional time series (FTS). In order to do so, we generalize the static factor model to the case where cross-section units are FTS. We first derive a representation result. We show that if the first r eigenvalues of the covariance operator of the cross-section of n FTS are unbounded as n diverges and if the (r+1)th eigenvalue is bounded, then we can represent the each FTS as a sum of a common component driven by r factors and an idiosyncratic component. We suggest a method of estimation and prediction of such a model. We assess the performances of the method through a simulation study. Finally, we show that by applying our method to a cross-section of volatility curves of the stocks of S&P100, we have a better prediction accuracy than by limiting the analysis to individual FTS. / Doctorat en Sciences économiques et de gestion / info:eu-repo/semantics/nonPublished Statistique mathématique Statistique appliquée Factor Analysis Functional Data Analysis Functional Time Series Dynamic Factor Model High-dimensional statistics
165	Generation of semantic layouts for interactive multidimensional data visualization / Geração de layouts semânticos para a visualização interativa de dados multidimensionais Gomez Nieto, Erick Mauricio 24 February 2017 (has links) Visualization methods make use of interactive graphical representations embedded on a display area in order to enable data exploration and analysis. These typically rely on geometric primitives for representing data or building more sophisticated representations to assist the visual analysis process. One of the most challenging tasks in this context is to determinate an optimal layout of these primitives which turns out to be effective and informative. Existing algorithms for building layouts from geometric primitives are typically designed to cope with requirements such as orthogonal alignment, overlap removal, optimal area usage, hierarchical organization, dynamic update among others. However, most techniques are able to tackle just a few of those requirements simultaneously, impairing their use and flexibility. In this dissertation, we propose a set of approaches for building layouts from geometric primitives that concurrently addresses a wider range of requirements. Relying on multidimensional projection and optimization formulations, our methods arrange geometric objects in the visual space so as to generate well-structured layouts that preserve the semantic relation among objects while still making an efficient use of display area. A comprehensive set of quantitative comparisons against existing methods for layout generation and applications on text, image, and video data set visualization prove the effectiveness of our approaches. / Métodos de visualização fazem uso de representações gráficas interativas embutidas em uma área de exibição para exploração e análise de dados. Esses recursos visuais usam primitivas geométricas para representar dados ou compor representações mais sofisticadas que facilitem a extração visual de informações. Uma das tarefas mais desafiadoras é determinar um layout ótimo visando explorar suas capacidades para transmitir informação dentro de uma determinada visualização. Os algoritmos existentes para construir layouts a partir de primitivas geométricas são tipicamente projetados para lidar com requisitos como alinhamento ortogonal, remoção de sobreposição, área usada, organização hierárquica, atualização dinâmica entre outros. No entanto, a maioria das técnicas são capazes de lidar com apenas alguns desses requerimentos simultaneamente, prejudicando sua utilização e flexibilidade. Nesta tese, propomos um conjunto de abordagens para construir layouts a partir de primitivas geométricas que simultaneamente lidam com uma gama mais ampla de requerimentos. Baseando-se em projeções multidimensionais e formulações de otimização, os nossos métodos organizam objetos geométricos no espaço visual para gerar layouts bem estruturados que preservam a relação semântica entre objetos enquanto ainda fazem um uso eficiente da área de exibição. Um conjunto detalhado de comparações quantitativas com métodos existentes para a geração de layouts e aplicações em visualização de conjunto de dados de texto, imagem e vídeo comprova a eficácia das técnicas propostas. Area optimization Dados em alta dimensão High-dimensional data Layout semântico Layouts estruturados Otimização de área Overlap removal Preservação da similaridade Remoção de sobreposição Semantic layout Similarity preserving
166	A concentration inequality based statistical methodology for inference on covariance matrices and operators Kashlak, Adam B. January 2017 (has links) In the modern era of high and infinite dimensional data, classical statistical methodology is often rendered inefficient and ineffective when confronted with such big data problems as arise in genomics, medical imaging, speech analysis, and many other areas of research. Many problems manifest when the practitioner is required to take into account the covariance structure of the data during his or her analysis, which takes on the form of either a high dimensional low rank matrix or a finite dimensional representation of an infinite dimensional operator acting on some underlying function space. Thus, novel methodology is required to estimate, analyze, and make inferences concerning such covariances. In this manuscript, we propose using tools from the concentration of measure literature–a theory that arose in the latter half of the 20th century from connections between geometry, probability, and functional analysis–to construct rigorous descriptive and inferential statistical methodology for covariance matrices and operators. A variety of concentration inequalities are considered, which allow for the construction of nonasymptotic dimension-free confidence sets for the unknown matrices and operators. Given such confidence sets a wide range of estimation and inferential procedures can be and are subsequently developed. For high dimensional data, we propose a method to search a concentration in- equality based confidence set using a binary search algorithm for the estimation of large sparse covariance matrices. Both sub-Gaussian and sub-exponential concentration inequalities are considered and applied to both simulated data and to a set of gene expression data from a study of small round blue-cell tumours. For infinite dimensional data, which is also referred to as functional data, we use a celebrated result, Talagrand’s concentration inequality, in the Banach space setting to construct confidence sets for covariance operators. From these confidence sets, three different inferential techniques emerge: the first is a k-sample test for equality of covariance operator; the second is a functional data classifier, which makes its decisions based on the covariance structure of the data; the third is a functional data clustering algorithm, which incorporates the concentration inequality based confidence sets into the framework of an expectation-maximization algorithm. These techniques are applied to simulated data and to speech samples from a set of spoken phoneme data. Lastly, we take a closer look at a key tool used in the construction of concentration based confidence sets: Rademacher symmetrization. The symmetrization inequality, which arises in the probability in Banach spaces literature, is shown to be connected with optimal transport theory and specifically the Wasserstein distance. This insight is used to improve the symmetrization inequality resulting in tighter concentration bounds to be used in the construction of nonasymptotic confidence sets. A variety of other applications are considered including tests for data symmetry and tightening inequalities in Banach spaces. An R package for inference on covariance operators is briefly discussed in an appendix chapter. 519.5
167	Méthodes et modèles numériques appliqués aux risques du marché et à l’évaluation financière / Numerical methods and models in market risk and financial valuations area Infante Acevedo, José Arturo 09 December 2013 (has links) Ce travail de thèse aborde deux sujets : (i) L'utilisation d'une nouvelle méthode numérique pour l'évaluation des options sur un panier d'actifs, (ii) Le risque de liquidité, la modélisation du carnet d'ordres et la microstructure de marché. Premier thème : Un algorithme glouton et ses applications pour résoudre des équations aux dérivées partielles. L'exemple typique en finance est l'évaluation d'une option sur un panier d'actifs, laquelle peut être obtenue en résolvant l'EDP de Black-Scholes ayant comme dimension le nombre d'actifs considérés. Nous proposons d'étudier un algorithme qui a été proposé et étudié récemment dans [ACKM06, BLM09] pour résoudre des problèmes en grande dimension et essayer de contourner la malédiction de la dimension. L'idée est de représenter la solution comme une somme de produits tensoriels et de calculer itérativement les termes de cette somme en utilisant un algorithme glouton. La résolution des EDP en grande dimension est fortement liée à la représentation des fonctions en grande dimension. Dans le Chapitre 1, nous décrivons différentes approches pour représenter des fonctions en grande dimension et nous introduisons les problèmes en grande dimension en finance qui sont traités dans ce travail de thèse. La méthode sélectionnée dans ce manuscrit est une méthode d'approximation non-linéaire appelée Proper Generalized Decomposition (PGD). Le Chapitre 2 montre l'application de cette méthode pour l'approximation de la solution d'une EDP linéaire (le problème de Poisson) et pour l'approximation d'une fonction de carré intégrable par une somme des produits tensoriels. Un étude numérique de ce dernier problème est présenté dans le Chapitre 3. Le problème de Poisson et celui de l'approximation d'une fonction de carré intégrable serviront de base dans le Chapitre 4 pour résoudre l'équation de Black-Scholes en utilisant l'approche PGD. Dans des exemples numériques, nous avons obtenu des résultats jusqu'en dimension 10. Outre l'approximation de la solution de l'équation de Black-Scholes, nous proposons une méthode de réduction de variance des méthodes Monte Carlo classiques pour évaluer des options financières. Second thème : Risque de liquidité, modélisation du carnet d'ordres, microstructure de marché. Le risque de liquidité et la microstructure de marché sont devenus des sujets très importants dans les mathématiques financières. La dérégulation des marchés financiers et la compétition entre eux pour attirer plus d'investisseurs constituent une des raisons possibles. Dans ce travail, nous étudions comment utiliser cette information pour exécuter de façon optimale la vente ou l'achat des ordres. Les ordres peuvent seulement être placés dans une grille des prix. A chaque instant, le nombre d'ordres en attente d'achat (ou vente) pour chaque prix est enregistré. Dans [AFS10], Alfonsi, Fruth et Schied ont proposé un modèle simple du carnet d'ordres. Dans ce modèle, il est possible de trouver explicitement la stratégie optimale pour acheter (ou vendre) une quantité donnée d'actions avant une maturité. L'idée est de diviser l'ordre d'achat (ou de vente) dans d'autres ordres plus petits afin de trouver l'équilibre entre l'acquisition des nouveaux ordres et leur prix. Ce travail de thèse se concentre sur une extension du modèle du carnet d'ordres introduit par Alfonsi, Fruth et Schied. Ici, l'originalité est de permettre à la profondeur du carnet d'ordres de dépendre du temps, ce qui représente une nouvelle caractéristique du carnet d'ordres qui a été illustré par [JJ88, GM92, HH95, KW96]. Dans ce cadre, nous résolvons le problème de l'exécution optimale pour des stratégies discrètes et continues. Ceci nous donne, en particulier, des conditions suffisantes pour exclure les manipulations des prix au sens de Huberman et Stanzl [HS04] ou de Transaction-Triggered Price Manipulation (voir Alfonsi, Schied et Slynko) / This work is organized in two themes : (i) A novel numerical method to price options on manyassets, (ii) The liquidity risk, the limit order book modeling and the market microstructure.First theme : Greedy algorithms and applications for solving partial differential equations in high dimension Many problems of interest for various applications (material sciences, finance, etc) involve high-dimensional partial differential equations (PDEs). The typical example in finance is the pricing of a basket option, which can be obtained by solving the Black-Scholes PDE with dimension the number of underlying assets. We propose to investigate an algorithm which has been recently proposed and analyzed in [ACKM06, BLM09] to solve such problems and try to circumvent the curse of dimensionality. The idea is to represent the solution as a sum of tensor products and to compute iteratively the terms of this sum using a greedy algorithm. The resolution of high dimensional partial differential equations is highly related to the representation of high dimensional functions. In Chapter 1, we describe various linear approaches existing in literature to represent high dimensional functions and we introduce the high dimensional problems in finance that we will address in this work. The method studied in this manuscript is a non-linear approximation method called the Proper Generalized Decomposition. Chapter 2 shows the application of this method to approximate the so-lution of a linear PDE (the Poisson problem) and also to approximate a square integrable function by a sum of tensor products. A numerical study of this last problem is presented in Chapter 3. The Poisson problem and the approximation of a square integrable function will serve as basis in Chapter 4for solving the Black-Scholes equation using the PGD approach. In numerical experiments, we obtain results for up to 10 underlyings. Second theme : Liquidity risk, limit order book modeling and market microstructure. Liquidity risk and market microstructure have become in the past years an important topic in mathematical finance. One possible reason is the deregulation of markets and the competition between them to try to attract as many investors as possible. Thus, quotation rules are changing and, in general, more information is available. In particular, it is possible to know at each time the awaiting orders on some stocks and to have a record of all the past transactions. In this work we study how to use this information to optimally execute buy or sell orders, which is linked to the traders' behaviour that want to minimize their trading cost. In [AFS10], Alfonsi, Fruth and Schied have proposed a simple LOB model. In this model, it is possible to explicitly derive the optimal strategy for buying (or selling) a given amount of shares before a given deadline. Basically, one has to split the large buy (or sell) order into smaller ones in order to find the best trade-off between attracting new orders and the price of the orders. Here, we focus on an extension of the Limit Order Book (LOB) model with general shape introduced by Alfonsi, Fruth and Schied. The additional feature is a time-varying LOB depth that represents a new feature of the LOB highlighted in [JJ88, GM92, HH95, KW96]. We solve the optimal execution problem in this framework for both discrete and continuous time strategies. This gives in particular sufficient conditions to exclude Price Manipulations in the sense of Huberman and Stanzl [HS04] or Transaction-Triggered Price Manipulations (see Alfonsi, Schied and Slynko). The seconditions give interesting qualitative insights on how market makers may create price manipulations Algorithme glouton Risque de liquidité Carnet d'ordres Microstructure de marché Greedy algorithm Liquidity risk Limit order book Market microstructure
168	A NEW INDEPENDENCE MEASURE AND ITS APPLICATIONS IN HIGH DIMENSIONAL DATA ANALYSIS Ke, Chenlu 01 January 2019 (has links) This dissertation has three consecutive topics. First, we propose a novel class of independence measures for testing independence between two random vectors based on the discrepancy between the conditional and the marginal characteristic functions. If one of the variables is categorical, our asymmetric index extends the typical ANOVA to a kernel ANOVA that can test a more general hypothesis of equal distributions among groups. The index is also applicable when both variables are continuous. Second, we develop a sufficient variable selection procedure based on the new measure in a large p small n setting. Our approach incorporates marginal information between each predictor and the response as well as joint information among predictors. As a result, our method is more capable of selecting all truly active variables than marginal selection methods. Furthermore, our procedure can handle both continuous and discrete responses with mixed-type predictors. We establish the sure screening property of the proposed approach under mild conditions. Third, we focus on a model-free sufficient dimension reduction approach using the new measure. Our method does not require strong assumptions on predictors and responses. An algorithm is developed to find dimension reduction directions using sequential quadratic programming. We illustrate the advantages of our new measure and its two applications in high dimensional data analysis by numerical studies across a variety of settings. High dimensional data analysis Independence Reproducing Kernel Hilbert Space Sufficient Dimension Reduction Sufficient Variable Selection Categorical Data Analysis Multivariate Analysis Statistics and Probability
169	Spatiotemporal Sensing and Informatics for Complex Systems Monitoring, Fault Identification and Root Cause Diagnostics Liu, Gang 16 September 2015 (has links) In order to cope with system complexity and dynamic environments, modern industries are investing in a variety of sensor networks and data acquisition systems to increase information visibility. Multi-sensor systems bring the proliferation of high-dimensional functional Big Data that capture rich information on the evolving dynamics of natural and engineered processes. With spatially and temporally dense data readily available, there is an urgent need to develop advanced methodologies and associated tools that will enable and assist (i) the handling of the big data communicated by the contemporary complex systems, (ii) the extraction and identification of pertinent knowledge about the environmental and operational dynamics driving these systems, and (iii) the exploitation of the acquired knowledge for more enhanced design, analysis, monitoring, diagnostics and control. My methodological and theoretical research as well as a considerable portion of my applied and collaborative work in this dissertation aims at addressing high-dimensional functional big data communicated by the systems. An innovative contribution of my work is the establishment of a series of systematic methodologies to investigate the complex system informatics including multi-dimensional modeling, feature extraction and selection, model-based monitoring and root cause diagnostics. This study presents systematic methodologies to investigate spatiotemporal informatics of complex systems from multi-dimensional modeling and feature extraction to model-driven monitoring, fault identification and root cause diagnostics. In particular, we developed a multiscale adaptive basis function model to represent and characterize the high-dimensional nonlinear functional profiles, thereby reducing the large amount of data to a parsimonious set of variables (i.e., model parameters) while preserving the information. Furthermore, the complex interdependence structure among variables is identified by a novel self-organizing network algorithm, in which the homogeneous variables are clustered into sub-network communities. Then we minimize the redundancy of variables in each cluster and integrate the new set of clustered variables with predictive models to identify a sparse set of sensitive variables for process monitoring and fault diagnostics. We evaluated and validated our methodologies using real-world case studies that extract parameters from representation models of vectorcardiogram (VCG) signals for the diagnosis of myocardial infarctions. The proposed systematic methodologies are generally applicable for modeling, monitoring and diagnosis in many disciplines that involve a large number of highly-redundant variables extracted from the big data. The self-organizing approach was also innovatively developed to derive the steady geometric structure of a network from the recurrence-based adjacency matrix. As such, novel network-theoretic measures can be achieved based on actual node-to-node distances in the self-organized network topology. High-dimensional Modeling High-redundant Variables Model-driven Parametric Monitoring Self-organized Clustering Databases and Information Systems Industrial Engineering Other Earth Sciences
170	Some questions in risk management and high-dimensional data analysis Wang, Ruodu 04 May 2012 (has links) This thesis addresses three topics in the area of statistics and probability, with applications in risk management. First, for the testing problems in the high-dimensional (HD) data analysis, we present a novel method to formulate empirical likelihood tests and jackknife empirical likelihood tests by splitting the sample into subgroups. New tests are constructed to test the equality of two HD means, the coefficient in the HD linear models and the HD covariance matrices. Second, we propose jackknife empirical likelihood methods to formulate interval estimations for important quantities in actuarial science and risk management, such as the risk-distortion measures, Spearman's rho and parametric copulas. Lastly, we introduce the theory of completely mixable (CM) distributions. We give properties of the CM distributions, show that a few classes of distributions are CM and use the new technique to find the bounds for the sum of individual risks with given marginal distributions but unspecific dependence structure. The result partially solves a problem that had been a challenge for decades, and directly leads to the bounds on quantities of interest in risk management, such as the variance, the stop-loss premium, the price of the European options and the Value-at-Risk associated with a joint portfolio. Empirical likelihood Hypothesis testing High-dimensional data Risk measures Copulas Dimensional analysis Analysis of covariance Mathematical statistics Data structures (Computer science) Risk management

Search results