Global ETD Search

81	Functional Principal Component Analysis for Discretely Observed Functional Data and Sparse Fisher’s Discriminant Analysis with Thresholded Linear Constraints Wang, Jing 01 December 2016 (has links) We propose a new method to perform functional principal component analysis (FPCA) for discretely observed functional data by solving successive optimization problems. The new framework can be applied to both regularly and irregularly observed data, and to both dense and sparse data. Our method does not require estimates of the individual sample functions or the covariance functions. Hence, it can be used to analyze functional data with multidimensional arguments (e.g. random surfaces). Furthermore, it can be applied to many processes and models with complicated or nonsmooth covariance functions. In our method, smoothness of eigenfunctions is controlled by directly imposing roughness penalties on eigenfunctions, which makes it more efficient and flexible to tune the smoothness. Efficient algorithms for solving the successive optimization problems are proposed. We provide the existence and characterization of the solutions to the successive optimization problems. The consistency of our method is also proved. Through simulations, we demonstrate that our method performs well in the cases with smooth samples curves, with discontinuous sample curves and nonsmooth covariance and with sample functions having two dimensional arguments (random surfaces), repectively. We apply our method to classification problems of retinal pigment epithelial cells in eyes of mice and to longitudinal CD4 counts data. In the second part of this dissertation, we propose a sparse Fisher’s discriminant analysis method with thresholded linear constraints. Various regularized linear discriminant analysis (LDA) methods have been proposed to address the problems of the LDA in high-dimensional settings. Asymptotic optimality has been established for some of these methods when there are only two classes. A difficulty in the asymptotic study for the multiclass classification is that for the two-class classification, the classification boundary is a hyperplane and an explicit formula for the classification error exists, however, in the case of multiclass, the boundary is usually complicated and no explicit formula for the error generally exists. Another difficulty in proving the asymptotic consistency and optimality for sparse Fisher’s discriminant analysis is that the covariance matrix is involved in the constraints of the optimization problems for high order components. It is not easy to estimate a general high-dimensional covariance matrix. Thus, we propose a sparse Fisher’s discriminant analysis method which avoids the estimation of the covariance matrix, provide asymptotic consistency results and the corresponding convergence rates for all components. To prove the asymptotic optimality, we provide an asymptotic upper bound for a general linear classification rule in the case of muticlass which is applied to our method to obtain the asymptotic optimality and the corresponding convergence rate. In the special case of two classes, our method achieves the same as or better convergence rates compared to the existing method. The proposed method is applied to multivariate functional data with wavelet transformations. Functional PCA discretely observed functional data successive optimization problems roughness penalty consistency sparse Fisher’s discriminant analysis thresholded linear constraints asymptotic consistency asymptotic optimality convergence rate
82	Sur l'estimation semi paramétrique robuste pour statistique fonctionnelle / On the semiparametric robust estimation in functional statistic Attaoui, Said 10 December 2012 (has links) Dans cette thèse, nous nous proposons d'étudier quelques paramètres fonctionnels lorsque les données sont générées à partir d'un modèle de régression à indice simple. Nous étudions deux paramètres fonctionnels. Dans un premier temps nous supposons que la variable explicative est à valeurs dans un espace de Hilbert (dimension infinie) et nous considérons l'estimation de la densité conditionnelle par la méthode de noyau. Nous traitons les propriétés asymptotiques de cet estimateur dans les deux cas indépendant et dépendant. Pour le cas où les observations sont indépendantes identiquement distribuées (i.i.d.), nous obtenons la convergence ponctuelle et uniforme presque complète avec vitesse de l'estimateur construit. Comme application nous discutons l'impact de ce résultat en prévision non paramétrique fonctionnelle à partir de l'estimation de mode conditionnelle. La dépendance est modélisée via la corrélation quasi-associée. Dans ce contexte nous établissons la convergence presque complète ainsi que la normalité asymptotique de l'estimateur à noyau de la densité condtionnelle convenablement normalisée. Nous donnons de manière explicite la variance asymptotique. Notons que toutes ces propriétés asymptotiques ont été obtenues sous des conditions standard et elles mettent en évidence le phénomène de concentration de la mesure de probabilité de la variable fonctionnelle sur des petites boules. Dans un second temps, nous supposons que la variable explicative est vectorielle et nous nous intéressons à un modèle de prévision assez général qui est la régression robuste. A partir d'observations quasi-associées, on construit un estimateur à noyau pour ce paramètre fonctionnel. Comme résultat asymptotique on établit la vitesse de convergence presque complète uniforme de l'estimateur construit. Nous insistons sur le fait que les deux modèles étudiés dans cette thèse pourraient être utilisés pour l'estimation de l'indice simple lorsque ce dernier est inconnu, en utilisant la méthode d'M-estimation ou la méthode de pseudo-maximum de vraisemblance, qui est un cas particulier de la première méthode. / In this thesis, we propose to study some functional parameters when the data are generated from a model of regression to a single index. We study two functional parameters. Firstly, we suppose that the explanatory variable take its values in Hilbert space (infinite dimensional space) and we consider the estimate of the conditional density by the kernel method. We establish some asymptotic properties of this estimator in both independent and dependent cases. For the case where the observations are independent identically distributed (i.i.d.), we obtain the pointwise and uniform almost complete convergence with rateof the estimator. As an application we discuss the impact of this result in fuctional nonparametric prevision for the estimation of the conditional mode. In the dependent case we modelize the later via the quasi-associated correlation. Note that all these asymptotic properties are obtained under standard conditions and they highlight the phenomenon of concentration properties on small balls probability measure of the functional variable. Secondly we suppose that the explanatory variable takes values in the _nite dimensional space and we interest in a rather general prevision model whichis the robust regression. From the quasi-associated data, we build a kernel estimator for this functional parameter. As an asymptotic result we establish the uniform almost complete convergence rate of the estimator. We point out by the fact that these two models studied in this thesis could be used for the estimation of the single index of the model when the latter is unknown, by using the method of M-estimation or the pseudo-maximum likelihood method which is a particular case of the first method. Statisque fonctionnelle Estimation semi-paramétrique Estimation non paramétrique Indice simple Régression robuste Functional data Semiparametric estimation Nonparametric estimation Single index Quasi-associated dependent variables
83	Méthodes d’analyse fonctionnelle et multivariée appliquées à l’étude du fonctionnement écologique des assemblages phytoplanctoniques de l’étang de Berre Malkassian, Anthony 03 December 2012 (has links) L'étude de la relation entre les variations d'abondance du phytoplancton et les facteurs environnementaux (naturels ou anthropiques) dans les zones saumâtres peu profondes est essentielle à la compréhension et à la gestion de cet écosystème complexe. Les relations existant entre les variables physico-chimiques (température, salinité et les nutriments) et les assemblages de phytoplancton de l'étang de Berre ont été analysées à partir d'un suivi écologique mensuel de 16 années (1994-2010). A l'aide des données recueillies par cette étude à long terme, des questions en relation avec la gestion de ce milieu ont été abordées grâce à l'application d'analyses statistiques et à la représentation originale des données. Depuis 2004, la nouvelle politique de relargage d'eau douce a provoqué de forts changements dans la salinité globale de la lagune : une diminution de la stratification et une raréfaction des phénomènes d'anoxie dans sa partie la plus profonde. Un changement dans la structure de la communauté phytoplanctonique a également été observé en association avec l'évolution des conditions environnementales. Une augmentation de la richesse spécifique phytoplanctonique, et plus précisément, l'émergence d'espèces à affinité marine a permis de mettre en évidence la première étape d'une marinisation de la lagune. Ces résultats soulignent l'impact significatif d'un nouvelle politique de gestion de cette zone côtière particulière. Nous nous sommes ensuite intéressés à la dynamique du phytoplancton à l'échelle de la journée reflet des variations rapides de l'environnement. / The study of the relationship between variations in phytoplankton abundance and environmental forces (natural or anthropogenic) in shallow brackish areas is essential to both understanding and managing this complex ecosystem. Over a 16 year (1994-2011) monthly monitoring program the relationships between physicochemical variables (temperature, salinity and nutrients) and phytoplankton assemblages of the Berre Lagoon were analyzed. Using data collected from this long-term study, we have addressed environmental management issues through the application of advanced statistical analyses and original data displays. These analyses and data displays can readily be applied to other data sets related to the environment, with the aim of informing both researcher and practitioner. Since 2004, a new policy for freshwater discharge has induced strong changes in the global salinity of the lagoon : a weakened stratification and a rarefaction of anoxia phenomena in its deepest part. A shift in the structure of the phytoplankton community has been observed in association with changes in environmental conditions. An increase of phytoplanktonic species richness, and more precisely, the emergence of species with marine affinity highlights the first step of a marinization of the lagoon. The results underline the significant impact of a new management policy in this specific coastal zone. We then focused on the response of phytoplankton to quick environmental variations. An original approach for automated high frequency analysis of phytoplankton was adopted with the use of an autonomous flow cytometer (CytoSense). Phytoplancton Monitoring Analyse multivariée Analyse de données fonctionnelles Diversité Etang de Berre Suivi écologique Phytoplankton Monitoring Multivariate analysis Functional data analysis Diversity Berre Lagoon Ecological survey
84	Using functional boxplots to visualize reflectance data and distinguish between areas of native grasses and invasive old world bluestems in a Kansas tall grass prairie Highland, Garth January 1900 (has links) Master of Science / Department of Statistics / Leigh Murray / Using remotely sensed reflectance data is an appealing tool for controlling invasive species of grasses by rangeland managers. Recent developments in functional data analysis include the functional boxplot (FBP) which is shown here to be a useful tool in the visualization of reflectance data. Functional boxplots are a novel method of visually inspecting functional data and determining the presence of outliers in the data. Implementation and interpretation of FBPs are both straightforward and intuitive. The goal of this study is to examine the use of FBPs for visualizing reflectance data, and to determine the efficacy of using the FBP to distinguish between native tall grasses and invasive Old World Bluestem (OWB, Bothriochloa spp.) monocultures in a Kansas prairie. Validation trials were conducted in order to determine the stability of the FBP when used to analyze spectral data. FBPs were shown to be highly stable for use with both native and OWB grasses at all times and subsets of wavelengths tested. Identification trials were conducted by introducing a single OWB observation to a test set of native tall grass observations and constructing a FBP. Results indicate that using observations recorded early in the growing season, the functional boxplot is able to successfully identify the OWB observation as an outlier in a test set of native tall grass observations with an estimated probability 100% and 95.45% when considering the visible and cellular spectrums, respectively. A 95% lower bound for the probability of successfully identifying the OWB observation using the cellular spectrum in May is found to be 89.67%. Functional boxplot Reflectance data Spectral reflectance Functional data analysis Data visualization Agronomy (0285) Range Management (0777) Statistics (0463)
85	Modélisation statistique de données fonctionnelles environnementales : application à l'analyse de profils océanographiques. / Statistical modeling of environmental functional data : application to the analyse of oceanographic profiles. Bayle, Severine 12 June 2014 (has links) Afin d'étudier les processus biogéochimiques de l'Océan Austral, des balises posées sur des éléphants de mer ont permis de récolter en 2009-2010 des profils de variables océanographiques (Chlorophylle a (Chl a), température, salinité, lumière) dans une zone s'étalant du sud des îles Kerguelen jusqu'au continent Antarctique. Cette thèse se penche en particulier sur les données de Chl a, car celle-ci est contenue dans les organismes photosynthétiques qui jouent un rôle essentiel de pompe à carbone. Mais les profils verticaux de Chl a, récoltés peu fréquemment, ne permettent pas d'obtenir une cartographie de cette variable dans cette zone de l'océan. Cependant, nous disposons de profils de lumière, échantillonnés plus souvent. L'objectif était alors de développer une méthodologie permettant de reconstruire de manière indirecte les profils de Chl a à partir des profils de lumière, et qui prenne en compte les caractéristiques de ce type de données qui se présentent naturellement comme des données fonctionnelles. Pour cela, nous avons abordé la décomposition des profils à reconstruire ou explicatifs sur une base de splines, ainsi que les questions d'ajustement associées. Un modèle linéaire fonctionnel a été utilisé, permettant de prédire des profils de Chl a à partir des dérivées des profils de lumière. Il est montré que l'utilisation d'un tel modèle permet d'obtenir une bonne qualité de reconstruction pour accéder aux variations hautes fréquences des profils de Chl a à fine échelle. Enfin, une interpolation par krigeage fonctionnel permet de prédire la concentration en Chl a de nuit, car les mesures de lumière acquises à ce moment-là ne peuvent pas être exploitées. / To study biogeochemical processes in the Southern Ocean, tags placed on elephant seals allowed to collect during 2009-2010 oceanographic variables profiles (Chlorophyll a (Chl a), temperature, salinity, light) in an area ranging from southern Kerguelen until the Antarctic continent. This thesis focuses on Chl a data as it is contained in photosynthetic organisms and these ones play an essential role in the oceanic carbon cycle. The infrequently collected vertical Chl a profiles don't provide a mapping of this variable in this area of the ocean. However, we have light profiles sampled more often. The aim of this thesis was then to develop a methodology for reconstructing indirectly Chl a profiles from light profiles, and that takes into account characteristics of this kind of data that naturally occur as functional data. For this, we adressed the profiles decomposition to rebuild or explanations on splines basis, as well as issues related adjustment. A functional linear model was used to predict Chl a profiles from light profiles derivatives. It was shown that the use of such a model provides a good quality of reconstruction to access high frequency variations of Chl a profiles at fine scale. Finally, a functional kriging interpolation predicted the Chl a concentration during night, as light measurements acquired at that time can't be exploited. In the future, the methodology aims to be applied to any type of functional data. Analyse de Données Fonctionnelles Modèle linéaire fonctionnel Spline Chlorophylle-A Krigeage fonctionnel Océan Austral Mésoéchelle Functional Data Analysis Functional linear model Spline Chlorophyll-A Functional kriging Southern Ocean Mesoscale 550
86	Modélisation statistique pour données fonctionnelles : approches non-asymptotiques et méthodes adaptatives / Statistical modeling for functional data : non-asymptotic approaches and adaptive methods Roche, Angelina 07 July 2014 (has links) L'objet principal de cette thèse est de développer des estimateurs adaptatifs en statistique pour données fonctionnelles. Dans une première partie, nous nous intéressons au modèle linéaire fonctionnel et nous définissons un critère de sélection de la dimension pour des estimateurs par projection définis sur des bases fixe ou aléatoire. Les estimateurs obtenus vérifient une inégalité de type oracle et atteignent la vitesse de convergence minimax pour le risque lié à l'erreur de prédiction. Pour les estimateurs définis sur une collection de modèles aléatoires, des outils de théorie de la perturbation ont été utilisés pour contrôler les projecteurs aléatoires de manière non-asymptotique. D'un point de vue numérique, cette méthode de sélection de la dimension est plus rapide et plus stable que les méthodes usuelles de validation croisée. Dans une seconde partie, nous proposons un critère de sélection de fenêtre inspiré des travaux de Goldenshluger et Lepski, pour des estimateurs à noyau de la fonction de répartition conditionnelle lorsque la covariable est fonctionnelle. Le risque de l'estimateur obtenu est majoré de manière non-asymptotique. Des bornes inférieures sont prouvées ce qui nous permet d'établir que notre estimateur atteint la vitesse de convergence minimax, à une perte logarithmique près. Dans une dernière partie, nous proposons une extension au cadre fonctionnel de la méthodologie des surfaces de réponse, très utilisée dans l'industrie. Ce travail est motivé par une application à la sûreté nucléaire. / The main purpose of this thesis is to develop adaptive estimators for functional data.In the first part, we focus on the functional linear model and we propose a dimension selection device for projection estimators defined on both fixed and data-driven bases. The prediction error of the resulting estimators satisfies an oracle-type inequality and reaches the minimax rate of convergence. For the estimator defined on a data-driven approximation space, tools of perturbation theory are used to solve the problems related to the random nature of the collection of models. From a numerical point of view, this method of dimension selection is faster and more stable than the usual methods of cross validation.In a second part, we consider the problem of bandwidth selection for kernel estimators of the conditional cumulative distribution function when the covariate is functional. The method is inspired by the work of Goldenshluger and Lepski. The risk of the estimator is non-asymptotically upper-bounded. We also prove lower-bounds and establish that our estimator reaches the minimax convergence rate, up to an extra logarithmic term.In the last part, we propose an extension to a functional context of the response surface methodology, widely used in the industry. This work is motivated by an application to nuclear safety. Données fonctionnelles Estimateurs adaptatifs Régression Sélection de modèle Méthode de Goldenshluger-Lepski Méthode des surfaces de réponse Functional data analysis Adaptive estimators Regression Model selection Goldenshluger and Lepski's method Response surface methodology
87	Essays on econometric modelling of temporal networks / Essais sur la modélisation économétrique des réseaux temporels Iacopini, Matteo 05 July 2018 (has links) La théorie des graphes a longtemps été étudiée en mathématiques et en probabilité en tant qu’outil pour décrire la dépendance entre les nœuds. Cependant, ce n’est que récemment qu’elle a été mise en œuvre sur des données, donnant naissance à l’analyse statistique des réseaux réels.La topologie des réseaux économiques et financiers est remarquablement complexe: elle n’est généralement pas observée, et elle nécessite ainsi des procédures inférentielles adéquates pour son estimation, d’ailleurs non seulement les nœuds, mais la structure de la dépendance elle-même évolue dans le temps. Des outils statistiques et économétriques pour modéliser la dynamique de changement de la structure du réseau font défaut, malgré leurs besoins croissants dans plusieurs domaines de recherche. En même temps, avec le début de l’ère des “Big data”, la taille des ensembles de données disponibles devient de plus en plus élevée et leur structure interne devient de plus en plus complexe, entravant les processus inférentiels traditionnels dans plusieurs cas. Cette thèse a pour but de contribuer à ce nouveau champ littéraire qui associe probabilités, économie, physique et sociologie en proposant de nouvelles méthodologies statistiques et économétriques pour l’étude de l’évolution temporelle des structures en réseau de moyenne et haute dimension. / Graph theory has long been studied in mathematics and probability as a tool for describing dependence between nodes. However, only recently it has been implemented on data, giving birth to the statistical analysis of real networks.The topology of economic and financial networks is remarkably complex: it is generally unobserved, thus requiring adequate inferential procedures for it estimation, moreover not only the nodes, but the structure of dependence itself evolves over time. Statistical and econometric tools for modelling the dynamics of change of the network structure are lacking, despite their increasing requirement in several fields of research. At the same time, with the beginning of the era of “Big data” the size of available datasets is becoming increasingly high and their internal structure is growing in complexity, hampering traditional inferential processes in multiple cases.This thesis aims at contributing to this newborn field of literature which joins probability, economics, physics and sociology by proposing novel statistical and econometric methodologies for the study of the temporal evolution of network structures of medium-high dimension. Théorie des graphes Analyse statistique Réseaux réels Tensor calculus Bayesian statistics High-dimension Networks Functional data analysis Nonparametric statistics Copula Time series 510
88	Three Essays in Functional Time Series and Factor Analysis Nisol, Gilles 20 December 2018 (has links) (PDF) The thesis is dedicated to time series analysis for functional data and contains three original parts. In the first part, we derive statistical tests for the presence of a periodic component in a time series of functions. We consider both the traditional setting in which the periodic functional signal is contaminated by functional white noise, and a more general setting of a contaminating process which is weakly dependent. Several forms of the periodic component are considered. Our tests are motivated by the likelihood principle and fall into two broad categories, which we term multivariate and fully functional. Overall, for the functional series that motivate this research, the fully functional tests exhibit a superior balance of size and power. Asymptotic null distributions of all tests are derived and their consistency is established. Their finite sample performance is examined and compared by numerical studies and application to pollution data. In the second part, we consider vector autoregressive processes (VARs) with innovations having a singular covariance matrix (in short singular VARs). These objects appear naturally in the context of dynamic factor models. The Yule-Walker estimator of such a VAR is problematic, because the solution of the corresponding equation system tends to be numerically rather unstable. For example, if we overestimate the order of the VAR, then the singularity of the innovations renders the Yule-Walker equation system singular as well. Moreover, even with correctly selected order, the Yule-Walker system tends be close to singular in finite sample. We show that this has a severe impact on predictions. While the asymptotic rate of the mean square prediction error (MSPE) can be just like in the regular (non-singular) case, the finite sample behavior is suffering. This effect turns out to be particularly dramatic in context of dynamic factor models, where we do not directly observe the so-called common components which we aim to predict. Then, when the data are sampled with some additional error, the MSPE often gets severely inflated. We explain the reason for this phenomenon and show how to overcome the problem. Our numerical results underline that it is very important to adapt prediction algorithms accordingly. In the third part, we set up theoretical foundations and a practical method to forecast multiple functional time series (FTS). In order to do so, we generalize the static factor model to the case where cross-section units are FTS. We first derive a representation result. We show that if the first r eigenvalues of the covariance operator of the cross-section of n FTS are unbounded as n diverges and if the (r+1)th eigenvalue is bounded, then we can represent the each FTS as a sum of a common component driven by r factors and an idiosyncratic component. We suggest a method of estimation and prediction of such a model. We assess the performances of the method through a simulation study. Finally, we show that by applying our method to a cross-section of volatility curves of the stocks of S&P100, we have a better prediction accuracy than by limiting the analysis to individual FTS. / Doctorat en Sciences économiques et de gestion / info:eu-repo/semantics/nonPublished Statistique mathématique Statistique appliquée Factor Analysis Functional Data Analysis Functional Time Series Dynamic Factor Model High-dimensional statistics
89	Análise de dados funcionais aplicada à geração de descritores de assinaturas de dimensão fractal multiescala / Functional Data Analysis Applied to Descriptors Generation of Multiscale Fractal Dimension Signatures. Florindo, João Batista 19 January 2009 (has links) Esta dissertação faz um estudo da aplicação da técnica estatística denominada Análise de Dados Funcionais (ADF) à geração de descritores usados em reconhecimento de padrões, mais especificamente, no reconhecimento de objetos de interesse em imagens. Estes objetos podem ser representados por vetores de características, também chamados de assinaturas, obtidos por uma técnica chamada de Dimensão Fractal Multiescala (DFM). Ocorre que estes vetores apresentam alta dimensionalidade (número de elementos), fazendo-se assim necessário o uso de uma abordagem que reduza este número de valores, sem que haja uma grande perda da informação transmitida pela assinatura. Neste contexto, diversas técnicas de extração de um reduzido conjunto de descritores da assinatura são apresentadas pela literatura. Entre estas, as mais populares são Fourier e \\emph, ambas relativamente simples de se apresentar e com resultados satisfatórios. A proposta aqui apresentada é de se utilizar ADF em combinação com DFM na geração de descritores de padrões. Os resultados obtidos com o uso desta abordagem na geração de descritores demostraram que a técnica possibilita bons resultados, mesmo em situações em que não é possível o uso de muitos descritores. Os experimentos demostraram que ADF apresenta um bom potencial para aplicação neste tipo de problema, permitindo que o método de classificação alcance bons resultados mesmo com poucos descritores. São sugeridos trabalhos futuros em que ADF possa ser usada, pesquisando-se por métodos ainda mais eficazes. / This work studies the application of a statistical technique named Functional Data Analysis (FDA) for the generation of descriptors. These descriptors can be used for pattern recognition, more specifically, for the recognition of relevant objects in an image. These objects can be represented by features vectors, also known as signatures, obtained by a technique named Multi-scale Fractal Dimension (MFD). These vectors present a high dimensionality (number of elements), causing to be necessary the use of an approach for the reduction of this number of values, but without a large loss of information carried by the signature. In this context, several techniques for the extraction of a reduced set of signature descriptors are studied in the literature. Among these techniques, the most classic are Fourier and wavelets, both with simple presentation and providing satisfactory results. The proposal presented here is the use of FDA combined with MFD for the generation of pattern descriptors. The results obtained by the use of this approach for the generation of descriptors showed that this technique allows the obtention of good results, even in situations in wich is not possible the use of many descriptors. FDA was also applied to the extraction of descriptors of MFD texture signatures. Also in this case, the results were interesting. The experiments showed the FDA presents a good potential for the application to this type of problem, allowing the obtention of good results even by using a few descriptors. It is suggested future works in which FDA can be used, researching for still more efficient methods.
90	A concentration inequality based statistical methodology for inference on covariance matrices and operators Kashlak, Adam B. January 2017 (has links) In the modern era of high and infinite dimensional data, classical statistical methodology is often rendered inefficient and ineffective when confronted with such big data problems as arise in genomics, medical imaging, speech analysis, and many other areas of research. Many problems manifest when the practitioner is required to take into account the covariance structure of the data during his or her analysis, which takes on the form of either a high dimensional low rank matrix or a finite dimensional representation of an infinite dimensional operator acting on some underlying function space. Thus, novel methodology is required to estimate, analyze, and make inferences concerning such covariances. In this manuscript, we propose using tools from the concentration of measure literature–a theory that arose in the latter half of the 20th century from connections between geometry, probability, and functional analysis–to construct rigorous descriptive and inferential statistical methodology for covariance matrices and operators. A variety of concentration inequalities are considered, which allow for the construction of nonasymptotic dimension-free confidence sets for the unknown matrices and operators. Given such confidence sets a wide range of estimation and inferential procedures can be and are subsequently developed. For high dimensional data, we propose a method to search a concentration in- equality based confidence set using a binary search algorithm for the estimation of large sparse covariance matrices. Both sub-Gaussian and sub-exponential concentration inequalities are considered and applied to both simulated data and to a set of gene expression data from a study of small round blue-cell tumours. For infinite dimensional data, which is also referred to as functional data, we use a celebrated result, Talagrand’s concentration inequality, in the Banach space setting to construct confidence sets for covariance operators. From these confidence sets, three different inferential techniques emerge: the first is a k-sample test for equality of covariance operator; the second is a functional data classifier, which makes its decisions based on the covariance structure of the data; the third is a functional data clustering algorithm, which incorporates the concentration inequality based confidence sets into the framework of an expectation-maximization algorithm. These techniques are applied to simulated data and to speech samples from a set of spoken phoneme data. Lastly, we take a closer look at a key tool used in the construction of concentration based confidence sets: Rademacher symmetrization. The symmetrization inequality, which arises in the probability in Banach spaces literature, is shown to be connected with optimal transport theory and specifically the Wasserstein distance. This insight is used to improve the symmetrization inequality resulting in tighter concentration bounds to be used in the construction of nonasymptotic confidence sets. A variety of other applications are considered including tests for data symmetry and tightening inequalities in Banach spaces. An R package for inference on covariance operators is briefly discussed in an appendix chapter. 519.5

Search results