Global ETD Search

111	Scalable Estimation and Testing for Complex, High-Dimensional Data Lu, Ruijin 22 August 2019 (has links) With modern high-throughput technologies, scientists can now collect high-dimensional data of various forms, including brain images, medical spectrum curves, engineering signals, etc. These data provide a rich source of information on disease development, cell evolvement, engineering systems, and many other scientific phenomena. To achieve a clearer understanding of the underlying mechanism, one needs a fast and reliable analytical approach to extract useful information from the wealth of data. The goal of this dissertation is to develop novel methods that enable scalable estimation, testing, and analysis of complex, high-dimensional data. It contains three parts: parameter estimation based on complex data, powerful testing of functional data, and the analysis of functional data supported on manifolds. The first part focuses on a family of parameter estimation problems in which the relationship between data and the underlying parameters cannot be explicitly specified using a likelihood function. We introduce a wavelet-based approximate Bayesian computation approach that is likelihood-free and computationally scalable. This approach will be applied to two applications: estimating mutation rates of a generalized birth-death process based on fluctuation experimental data and estimating the parameters of targets based on foliage echoes. The second part focuses on functional testing. We consider using multiple testing in basis-space via p-value guided compression. Our theoretical results demonstrate that, under regularity conditions, the Westfall-Young randomization test in basis space achieves strong control of family-wise error rate and asymptotic optimality. Furthermore, appropriate compression in basis space leads to improved power as compared to point-wise testing in data domain or basis-space testing without compression. The effectiveness of the proposed procedure is demonstrated through two applications: the detection of regions of spectral curves associated with pre-cancer using 1-dimensional fluorescence spectroscopy data and the detection of disease-related regions using 3-dimensional Alzheimer's Disease neuroimaging data. The third part focuses on analyzing data measured on the cortical surfaces of monkeys' brains during their early development, and subjects are measured on misaligned time markers. In this analysis, we examine the asymmetric patterns and increase/decrease trend in the monkeys' brains across time. / Doctor of Philosophy / With modern high-throughput technologies, scientists can now collect high-dimensional data of various forms, including brain images, medical spectrum curves, engineering signals, and biological measurements. These data provide a rich source of information on disease development, engineering systems, and many other scientific phenomena. The goal of this dissertation is to develop novel methods that enable scalable estimation, testing, and analysis of complex, high-dimensional data. It contains three parts: parameter estimation based on complex biological and engineering data, powerful testing of high-dimensional functional data, and the analysis of functional data supported on manifolds. The first part focuses on a family of parameter estimation problems in which the relationship between data and the underlying parameters cannot be explicitly specified using a likelihood function. We introduce a computation-based statistical approach that achieves efficient parameter estimation scalable to high-dimensional functional data. The second part focuses on developing a powerful testing method for functional data that can be used to detect important regions. We will show nice properties of our approach. The effectiveness of this testing approach will be demonstrated using two applications: the detection of regions of the spectrum that are related to pre-cancer using fluorescence spectroscopy data and the detection of disease-related regions using brain image data. The third part focuses on analyzing brain cortical thickness data, measured on the cortical surfaces of monkeys’ brains during early development. Subjects are measured on misaligned time-markers. By using functional data estimation and testing approach, we are able to: (1) identify asymmetric regions between their right and left brains across time, and (2) identify spatial regions on the cortical surface that reflect increase or decrease in cortical measurements over time. Functional data testing randomization method basis decomposition approximate Bayesian computation (ABC) wavelet decomposition Gaussian Process surrogate model fluctuation analysis mutation probability estimation birth-death process model reg
112	Statistical Analysis of Structured High-dimensional Data Sun, Yizhi 05 October 2018 (has links) High-dimensional data such as multi-modal neuroimaging data and large-scale networks carry excessive amount of information, and can be used to test various scientific hypotheses or discover important patterns in complicated systems. While considerable efforts have been made to analyze high-dimensional data, existing approaches often rely on simple summaries which could miss important information, and many challenges on modeling complex structures in data remain unaddressed. In this proposal, we focus on analyzing structured high-dimensional data, including functional data with important local regions and network data with community structures. The first part of this dissertation concerns the detection of ``important'' regions in functional data. We propose a novel Bayesian approach that enables region selection in the functional data regression framework. The selection of regions is achieved through encouraging sparse estimation of the regression coefficient, where nonzero regions correspond to regions that are selected. To achieve sparse estimation, we adopt compactly supported and potentially over-complete basis to capture local features of the regression coefficient function, and assume a spike-slab prior to the coefficients of the bases functions. To encourage continuous shrinkage of nearby regions, we assume an Ising hyper-prior which takes into account the neighboring structure of the bases functions. This neighboring structure is represented by an undirected graph. We perform posterior sampling through Markov chain Monte Carlo algorithms. The practical performance of the proposed approach is demonstrated through simulations as well as near-infrared and sonar data. The second part of this dissertation focuses on constructing diversified portfolios using stock return data in the Center for Research in Security Prices (CRSP) database maintained by the University of Chicago. Diversification is a risk management strategy that involves mixing a variety of financial assets in a portfolio. This strategy helps reduce the overall risk of the investment and improve performance of the portfolio. To construct portfolios that effectively diversify risks, we first construct a co-movement network using the correlations between stock returns over a training time period. Correlation characterizes the synchrony among stock returns thus helps us understand whether two or multiple stocks have common risk attributes. Based on the co-movement network, we apply multiple network community detection algorithms to detect groups of stocks with common co-movement patterns. Stocks within the same community tend to be highly correlated, while stocks across different communities tend to be less correlated. A portfolio is then constructed by selecting stocks from different communities. The average return of the constructed portfolio over a testing time period is finally compared with the SandP 500 market index. Our constructed portfolios demonstrate outstanding performance during a non-crisis period (2004-2006) and good performance during a financial crisis period (2008-2010). / PHD / High dimensional data, which are composed by data points with a tremendous number of features (a.k.a. attributes, independent variables, explanatory variables), brings challenges to statistical analysis due to their “high-dimensionality” and complicated structure. In this dissertation work, I consider two types of high-dimension data. The first type is functional data in which each observation is a function. The second type is network data whose internal structure can be described as a network. I aim to detect “important” regions in functional data by using a novel statistical model, and I treat stock market data as network data to construct quality portfolios efficiently Bayesian Variable Selection Community Detection Compactly Supported Basis Functional Data Analysis Ising Prior MCMC Network Data Analysis Portfolio Theory Region Selection
113	Modèles de mélange pour la régression en grande dimension, application aux données fonctionnelles / High-dimensional mixture regression models, application to functional data Devijver, Emilie 02 July 2015 (has links) Les modèles de mélange pour la régression sont utilisés pour modéliser la relation entre la réponse et les prédicteurs, pour des données issues de différentes sous-populations. Dans cette thèse, on étudie des prédicteurs de grande dimension et une réponse de grande dimension. Tout d’abord, on obtient une inégalité oracle ℓ1 satisfaite par l’estimateur du Lasso. On s’intéresse à cet estimateur pour ses propriétés de régularisation ℓ1. On propose aussi deux procédures pour pallier ce problème de classification en grande dimension. La première procédure utilise l’estimateur du maximum de vraisemblance pour estimer la densité conditionnelle inconnue, en se restreignant aux variables actives sélectionnées par un estimateur de type Lasso. La seconde procédure considère la sélection de variables et la réduction de rang pour diminuer la dimension. Pour chaque procédure, on obtient une inégalité oracle, qui explicite la pénalité nécessaire pour sélectionner un modèle proche de l’oracle. On étend ces procédures au cas des données fonctionnelles, où les prédicteurs et la réponse peuvent être des fonctions. Dans ce but, on utilise une approche par ondelettes. Pour chaque procédure, on fournit des algorithmes, et on applique et évalue nos méthodes sur des simulations et des données réelles. En particulier, on illustre la première méthode par des données de consommation électrique. / Finite mixture regression models are useful for modeling the relationship between a response and predictors, arising from different subpopulations. In this thesis, we focus on high-dimensional predictors and a high-dimensional response. First of all, we provide an ℓ1-oracle inequality satisfied by the Lasso estimator. We focus on this estimator for its ℓ1-regularization properties rather than for the variable selection procedure. We also propose two procedures to deal with this issue. The first procedure leads to estimate the unknown conditional mixture density by a maximum likelihood estimator, restricted to the relevant variables selected by an ℓ1-penalized maximum likelihood estimator. The second procedure considers jointly predictor selection and rank reduction for obtaining lower-dimensional approximations of parameters matrices. For each procedure, we get an oracle inequality, which derives the penalty shape of the criterion, depending on the complexity of the random model collection. We extend these procedures to the functional case, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms, apply and evaluate our methods both on simulations and real datasets. In particular, we illustrate the first procedure on an electricity load consumption dataset. Modèles de mélange en régression Classification non supervisée Grande dimension Sélection de variables Sélection de modèles Inégalité oracle Données fonctionnelles Consommation électrique Ondelettes Mixture regression models Clustering High dimension Variable selection Model selection Oracle inequality Functional data Electricity consumption Wavelets
114	Modélisation flexible du risque d’événements iatrogènes radio-induits / Flexible modeling of radiation-induced adverse events risk Benadjaoud, Mohamed Amine 27 March 2015 (has links) La radiothérapie occupe une place majeure dans l’arsenal thérapeutique des cancers.Malgré des progrès technologiques importants depuis près de vingt ans, des tissus sains au voisinage ou à distance de la tumeur cible continuent à être inévitablement irradiés à des niveaux de doses très différents. Ces doses sont à l’origine d’effets secondaires précoces (Œdème, radionécrose, Dysphagie, Cystite) ou tardifs (rectorragies, télangiectasie, effets carcinogènes, les pathologie cérébrovasculaires).Il est donc primordial de quantifier et de prévenir ces effets secondaires afin d'améliorer la qualité de vie des patients pendant et après leur traitement.La modélisation du risque d'événements iatrogènes radio-induits repose sur la connaissance précise de la distribution de doses au tissu sain d'intérêt ainsi que sur un modèle de risque capable d'intégrer un maximum d'informations sur le profil d'irradiation et des autres facteurs de risques non dosimétriques. L'objectif de ce travail de thèse a été de développer des méthodes de modélisation capables de répondre à des questions spécifiques aux deux aspects, dosimétriques et statistiques, intervenant dans la modélisation du risque de survenue d'événements iatrogènes radio-induits.Nous nous sommes intéressé dans un premier temps au développement d'un modèle de calcul permettant de déterminer avec précision la dose à distance due au rayonnements de diffusion et de fuite lors d'un traitement par radiothérapie externe et ce, pour différentes tailles des champs et à différentes distances de l'axe du faisceau. Ensuite, nous avons utilisé des méthodes d'analyse de données fonctionnelles pour développer un modèle de risque de toxicité rectales après irradiation de la loge prostatique. Le modèle proposé a montré des performances supérieures aux modèles de risque existants particulièrement pour décrire le risque de toxicités rectales de grade 3. Dans le contexte d'une régression de Cox flexible sur données réelles, nous avons proposé une application originale des méthodes de statistique fonctionnelle permettant d'améliorer les performances d'une modélisation via fonctions B-splines de la relation dose-effet entre la dose de radiation à la thyroïde.Nous avons également proposé dans le domaine de la radiobiologie une méthodes basée sur l’analyse en composantes principales multiniveau pour quantifier la part de la variabilité expérimentale dans la variabilité des courbes de fluorescence mesurées. / Radiotherapy plays a major role in the therapeutic arsenal against cancer. Despite significant advances in technology for nearly twenty years, healthy tissues near or away from the target tumor remain inevitably irradiated at very different levels of doses. These doses are at the origin of early side effects (edema, radiation necrosis, dysphagia, cystitis) or late (rectal bleeding, telangiectasia, carcinogenic, cerebrovascular diseases). It is therefore essential to quantify and prevent these side effects to improve the patient quality of life after their cancer treatment.The objective of this thesis was to propose modelling methods able to answer specific questions asked in both aspects, dosimetry and statistics, involved in the modeling risk of developing radiation-induced iatrogenic pathologies.Our purpose was firstly to assess the out-of-field dose component related to head scatter radiation in high-energy photon therapy beams and then derive a multisource model for this dose component. For measured doses under out-of-field conditions, the average local difference between the calculated and measured photon dose is 10%, including doses as low as 0.01% of the maximum dose on the beam axis. We secondly described a novel method to explore radiation dose-volume effects. Functional data analysis is used to investigate the information contained in differential dose-volume histograms. The method is applied to the normal tissue complication probability modeling of rectal bleeding for In the flexible Cox model context, we proposed a new dimension reduction technique based on a functional principal component analysis to estimate a dose-response relationship. A two-stage knots selection scheme was performed: a potential set of knots is chosen based on information from the rotated functional principal components and the final knots selection is then based on statistical model selection. Finally, a multilevel functional principal component analysis was applied to radiobiological data in order to quantify the experimental Variability for replicate measurements of fluorescence signals of telomere length. Rayonnements ionisants Physique médicale Analyse de données fonctionnelles Estimation non paramétrique Analyse de survie Événements iatrogènes radio-induits Dose à distance Radio-induced iatrogenic events Out-of-field radiation dosimetry Functional data analysis Normal tissue complication probability Flexible survival analysis Radiobiology
115	Méthodologie de traitement et d'analyse de signaux expérimentaux d'émission acoustique : application au comportement d'un élément combustible en situation accidentelle / Methodology of treatment and analysis of experimental acoustic emission signals : application to the behavior of a fuel element in accident situation Traore, Oumar Issiaka 15 January 2018 (has links) L’objectif de cette thèse est de contribuer à l’amélioration du processus de dépouillement d’essais de sûreté visant étudier le comportement d'un combustible nucléaire en contexte d’accident d’injection de réactivité (RIA), via la technique de contrôle par émission acoustique. Il s’agit notamment d’identifier clairement les mécanismes physiques pouvant intervenir au cours des essais à travers leur signature acoustique. Dans un premier temps, au travers de calculs analytiques et des simulation numériques conduites au moyen d’une méthode d’éléments finis spectraux, l’impact du dispositif d’essais sur la propagation des ondes est étudié. Une fréquence de résonance du dispositif est identifiée. On établit également que les mécanismes basses fréquences ne sont pas impactés par le dispositif d'essais. En second lieu, diverses techniques de traitement du signal (soustraction spectrale, analyse spectrale singulière, ondelettes. . . ) sont expérimentées, afin de proposer des outils permettant de traiter différent types de bruit survenant lors des essais RIA. La soustraction spectrale s’avère être la méthode la plus robuste aux changements de nature du bruit, avec un fort potentiel d’amélioration du rapport signal-à-bruit. Enfin, des méthodes d’analyse de données multivariées et d’analyse de données fonctionnelles ont été appliquées, afin de proposer un algorithme de classification statistique permettant de mieux comprendre la phénoménologie des accidents de type RIA et d’identifier les mécanismes physiques. Selon l’approche (multivariée ou fonctionnelle), les algorithmes obtenus permettent de reconnaître le mécanisme associé à une salve dans plus de 80% des cas. / The objective of the thesis is to contribute to the improvement of the monitoring process of nuclear safety experiments dedicated to study the behavior of the nuclear fuel in a reactivity initiated accident (RIA) context, by using the acoustic emission technique. In particular, we want to identify the physical mechanisms occurring during the experiments through their acoustic signatures. Firstly, analytical derivations and numerical simulations using the spectral finite element method have been performed in order to evaluate the impact of the wave travelpath in the test device on the recorded signals. A resonant frequency has been identified and it has been shown that the geometry and the configuration of the test device may not influence the wave propagation in the low frequency range. Secondly, signal processing methods (spectral subtraction, singular spectrum analysis, wavelets,…) have been explored in order to propose different denoising strategies according to the type of noise observed during the experiments. If we consider only the global SNR improvement ratio, the spectral subtraction method is the most robust to changes in the stochastic behavior of noise. Finally, classical multivariate and functional data analysis tools are used in order to create a machine learning algorithm dedicated to contribute to a better understanding of the phenomenology of RIA accidents. According to the method (multivariate or functional), the obtained algorithms allow to identify the mechanisms in more than 80 % of cases. Émission Acoustique Soustraction spectrale Analyse de données fonctionnelles Analyse de données multivariées Data mining Clustering Environnement nucléaire Modélisation numérique Acoustic emission Spectral subtraction Functional data analysis Multivariate data analysis Data mining Clustering Nuclear environment Numerical modeling 534
116	L'approche Support Vector Machines (SVM) pour le traitement des données fonctionnelles / Support Vector Machines (SVM) for Fonctional Data Analysis Henchiri, Yousri 16 October 2013 (has links) L'Analyse des Données Fonctionnelles est un domaine important et dynamique en statistique. Elle offre des outils efficaces et propose de nouveaux développements méthodologiques et théoriques en présence de données de type fonctionnel (fonctions, courbes, surfaces, ...). Le travail exposé dans cette thèse apporte une nouvelle contribution aux thèmes de l'apprentissage statistique et des quantiles conditionnels lorsque les données sont assimilables à des fonctions. Une attention particulière a été réservée à l'utilisation de la technique Support Vector Machines (SVM). Cette technique fait intervenir la notion d'Espace de Hilbert à Noyau Reproduisant. Dans ce cadre, l'objectif principal est d'étendre cette technique non-paramétrique d'estimation aux modèles conditionnels où les données sont fonctionnelles. Nous avons étudié les aspects théoriques et le comportement pratique de la technique présentée et adaptée sur les modèles de régression suivants. Le premier modèle est le modèle fonctionnel de quantiles de régression quand la variable réponse est réelle, les variables explicatives sont à valeurs dans un espace fonctionnel de dimension infinie et les observations sont i.i.d.. Le deuxième modèle est le modèle additif fonctionnel de quantiles de régression où la variable d'intérêt réelle dépend d'un vecteur de variables explicatives fonctionnelles. Le dernier modèle est le modèle fonctionnel de quantiles de régression quand les observations sont dépendantes. Nous avons obtenu des résultats sur la consistance et les vitesses de convergence des estimateurs dans ces modèles. Des simulations ont été effectuées afin d'évaluer la performance des procédures d'inférence. Des applications sur des jeux de données réelles ont été considérées. Le bon comportement de l'estimateur SVM est ainsi mis en évidence. / Functional Data Analysis is an important and dynamic area of statistics. It offers effective new tools and proposes new methodological and theoretical developments in the presence of functional type data (functions, curves, surfaces, ...). The work outlined in this dissertation provides a new contribution to the themes of statistical learning and quantile regression when data can be considered as functions. Special attention is devoted to use the Support Vector Machines (SVM) technique, which involves the notion of a Reproducing Kernel Hilbert Space. In this context, the main goal is to extend this nonparametric estimation technique to conditional models that take into account functional data. We investigated the theoretical aspects and practical attitude of the proposed and adapted technique to the following regression models.The first model is the conditional quantile functional model when the covariate takes its values in a bounded subspace of the functional space of infinite dimension, the response variable takes its values in a compact of the real line, and the observations are i.i.d.. The second model is the functional additive quantile regression model where the response variable depends on a vector of functional covariates. The last model is the conditional quantile functional model in the dependent functional data case. We obtained the weak consistency and a convergence rate of these estimators. Simulation studies are performed to evaluate the performance of the inference procedures. Applications to chemometrics, environmental and climatic data analysis are considered. The good behavior of the SVM estimator is thus highlighted. Analyse des Données Fonctionnelles Support Vector Machines Quantiles de régression Apprentissage statistique Apprentissage supervisé Espace de Hilbert à noyau reproduisant Functional Data Analysis Support Vector Machines Quantile Regression Statistical learning Supervised learning Reproducing kernel Hilbert space
117	A multi-scale assessment of spatial-temporal change in the movement ecology and habitat of a threatened Grizzly Bear (Ursus arctos) population in Alberta, Canada Bourbonnais, Mathieu Louis 31 August 2018 (has links) Given current rates of anthropogenic environmental change, combined with the increasing lethal and non-lethal mortality threat that human activities pose, there is a vital need to understand wildlife movement and behaviour in human-dominated landscapes to help inform conservation efforts and wildlife management. As long-term monitoring of wildlife populations using Global Positioning System (GPS) telemetry increases, there are new opportunities to quantify change in wildlife movement and behaviour. The objective of this PhD research is to develop novel methodological approaches for quantifying change in spatial-temporal patterns of wildlife movement and habitat by leveraging long time series of GPS telemetry and remotely sensed data. Analyses were focused on the habitat and movement of individuals in the threatened grizzly bear (Ursus arctos) population of Alberta, Canada, which occupies a human-dominated and heterogeneous landscape. Using methods in functional data analysis, a multivariate regionalization approach was developed that effectively summarizes complex spatial-temporal patterns associated with landscape disturbance, as well as recovery, which is often left unaccounted in studies quantifying patterns associated with disturbance. Next, the quasi-experimental framework afforded by a hunting moratorium was used to compare the influence of lethal (i.e., hunting) and non-lethal (i.e., anthropogenic disturbance) human-induced risk on antipredator behaviour of an apex predator, the grizzly bear. In support of the predation risk allocation hypothesis, male bears significantly decrease risky daytime behaviours by 122% during periods of high lethal human-induced risk. Rapid behavioural restoration occurred following the end of the hunt, characterized by diel bimodal movement patterns which may promote coexistence of large predators in human-dominated landscapes. A multi-scale approach using hierarchical Bayesian models, combined with post hoc trend tests and change point detection, was developed to test the influence of landscape disturbance and conditions on grizzly bear home range and movement selection over time. The results, representing the first longitudinal empirical analysis of grizzly bear habitat selection, revealed selection for habitat security at broad scales and for resource availability and habitat permeability at finer spatial scales, which has influenced potential landscape connectivity over time. Finally, combining approaches in movement ecology and conservation physiology, a body condition index was used to characterize how the physiological condition (i.e., internal state) of grizzly bears influences behavioral patterns due to costs and benefits associated with risk avoidance and resource acquisition. The results demonstrated individuals in poorer condition were more likely to engage in risky behaviour associated with anthropogenic disturbance, which highlights complex challenges for carnivore conservation and management of human-carnivore conflict. In summary, this dissertation contributes 1) a multivariate regionalization approach for quantifying spatial-temporal patterns of landscape disturbance and recovery applicable across diverse natural systems, 2) support for the growing theory that apex predators modify behavioural patterns to account for temporal overlap with lethal and non-lethal human-induced risk associated with humans, 3) an integrated approach for considering multi-scale spatial-temporal change in patterns of wildlife habitat selection and landscape connectivity associated with landscape change, 4) a cross-disciplinary framework for considering the impacts of the internal state on behavioural patterns and risk tolerance. / Graduate Movement ecology Habitat Bayesian models Functional data analysis Geographic Information Sciences Remote Sensing Landsat Spatial-temporal patterns Grizzly bears Landscape disturbance and change Risk Hidden Markov model Hunting GPS telemetry
118	Algorithmes stochastiques pour la statistique robuste en grande dimension / Stochastic algorithms for robust statistics in high dimension Godichon-Baggioni, Antoine 17 June 2016 (has links) Cette thèse porte sur l'étude d'algorithmes stochastiques en grande dimension ainsi qu'à leur application en statistique robuste. Dans la suite, l'expression grande dimension pourra aussi bien signifier que la taille des échantillons étudiés est grande ou encore que les variables considérées sont à valeurs dans des espaces de grande dimension (pas nécessairement finie). Afin d'analyser ce type de données, il peut être avantageux de considérer des algorithmes qui soient rapides, qui ne nécessitent pas de stocker toutes les données, et qui permettent de mettre à jour facilement les estimations. Dans de grandes masses de données en grande dimension, la détection automatique de points atypiques est souvent délicate. Cependant, ces points, même s'ils sont peu nombreux, peuvent fortement perturber des indicateurs simples tels que la moyenne ou la covariance. On va se concentrer sur des estimateurs robustes, qui ne sont pas trop sensibles aux données atypiques. Dans une première partie, on s'intéresse à l'estimation récursive de la médiane géométrique, un indicateur de position robuste, et qui peut donc être préférée à la moyenne lorsqu'une partie des données étudiées est contaminée. Pour cela, on introduit un algorithme de Robbins-Monro ainsi que sa version moyennée, avant de construire des boules de confiance non asymptotiques et d'exhiber leurs vitesses de convergence $L^{p}$ et presque sûre.La deuxième partie traite de l'estimation de la "Median Covariation Matrix" (MCM), qui est un indicateur de dispersion robuste lié à la médiane, et qui, si la variable étudiée suit une loi symétrique, a les mêmes sous-espaces propres que la matrice de variance-covariance. Ces dernières propriétés rendent l'étude de la MCM particulièrement intéressante pour l'Analyse en Composantes Principales Robuste. On va donc introduire un algorithme itératif qui permet d'estimer simultanément la médiane géométrique et la MCM ainsi que les $q$ principaux vecteurs propres de cette dernière. On donne, dans un premier temps, la forte consistance des estimateurs de la MCM avant d'exhiber les vitesses de convergence en moyenne quadratique.Dans une troisième partie, en s'inspirant du travail effectué sur les estimateurs de la médiane et de la "Median Covariation Matrix", on exhibe les vitesses de convergence presque sûre et $L^{p}$ des algorithmes de gradient stochastiques et de leur version moyennée dans des espaces de Hilbert, avec des hypothèses moins restrictives que celles présentes dans la littérature. On présente alors deux applications en statistique robuste: estimation de quantiles géométriques et régression logistique robuste.Dans la dernière partie, on cherche à ajuster une sphère sur un nuage de points répartis autour d'une sphère complète où tronquée. Plus précisément, on considère une variable aléatoire ayant une distribution sphérique tronquée, et on cherche à estimer son centre ainsi que son rayon. Pour ce faire, on introduit un algorithme de gradient stochastique projeté et son moyenné. Sous des hypothèses raisonnables, on établit leurs vitesses de convergence en moyenne quadratique ainsi que la normalité asymptotique de l'algorithme moyenné. / This thesis focus on stochastic algorithms in high dimension as well as their application in robust statistics. In what follows, the expression high dimension may be used when the the size of the studied sample is large or when the variables we consider take values in high dimensional spaces (not necessarily finite). In order to analyze these kind of data, it can be interesting to consider algorithms which are fast, which do not need to store all the data, and which allow to update easily the estimates. In large sample of high dimensional data, outliers detection is often complicated. Nevertheless, these outliers, even if they are not many, can strongly disturb simple indicators like the mean and the covariance. We will focus on robust estimates, which are not too much sensitive to outliers.In a first part, we are interested in the recursive estimation of the geometric median, which is a robust indicator of location which can so be preferred to the mean when a part of the studied data is contaminated. For this purpose, we introduce a Robbins-Monro algorithm as well as its averaged version, before building non asymptotic confidence balls for these estimates, and exhibiting their $L^{p}$ and almost sure rates of convergence.In a second part, we focus on the estimation of the Median Covariation Matrix (MCM), which is a robust dispersion indicator linked to the geometric median. Furthermore, if the studied variable has a symmetric law, this indicator has the same eigenvectors as the covariance matrix. This last property represent a real interest to study the MCM, especially for Robust Principal Component Analysis. We so introduce a recursive algorithm which enables us to estimate simultaneously the geometric median, the MCM, and its $q$ main eigenvectors. We give, in a first time, the strong consistency of the estimators of the MCM, before exhibiting their rates of convergence in quadratic mean.In a third part, in the light of the work on the estimates of the median and of the Median Covariation Matrix, we exhibit the almost sure and $L^{p}$ rates of convergence of averaged stochastic gradient algorithms in Hilbert spaces, with less restrictive assumptions than in the literature. Then, two applications in robust statistics are given: estimation of the geometric quantiles and application in robust logistic regression.In the last part, we aim to fit a sphere on a noisy points cloud spread around a complete or truncated sphere. More precisely, we consider a random variable with a truncated spherical distribution, and we want to estimate its center as well as its radius. In this aim, we introduce a projected stochastic gradient algorithm and its averaged version. We establish the strong consistency of these estimators as well as their rates of convergence in quadratic mean. Finally, the asymptotic normality of the averaged algorithm is given. Grande Dimension Données Fonctionnelles Algorithmes Stochastiques Algorithmes Récursifs Algorithmes de Gradient Stochastiques Moyennisation Statistique Robuste Médiane Géométrique High Dimension Functional Data Stochastic Algorithms Recursive Algorithms Stochastic Gradient Algorithms Averaging Robust Statistics Geometric Median 519
119	Exploration de données pour l'optimisation de trajectoires aériennes / Data analysis for aircraft trajectory optimization Rommel, Cédric 26 October 2018 (has links) Cette thèse porte sur l'utilisation de données de vols pour l'optimisation de trajectoires de montée vis-à-vis de la consommation de carburant.Dans un premier temps nous nous sommes intéressé au problème d'identification de modèles de la dynamique de l'avion dans le but de les utiliser pour poser le problème d'optimisation de trajectoire à résoudre. Nous commençont par proposer une formulation statique du problème d'identification de la dynamique. Nous l'interpretons comme un problème de régression multi-tâche à structure latente, pour lequel nous proposons un modèle paramétrique. L'estimation des paramètres est faite par l'application de quelques variations de la méthode du maximum de vraisemblance.Nous suggérons également dans ce contexte d'employer des méthodes de sélection de variable pour construire une structure de modèle de régression polynomiale dépendant des données. L'approche proposée est une extension à un contexte multi-tâche structuré du bootstrap Lasso. Elle nous permet en effet de sélectionner les variables du modèle dans un contexte à fortes corrélations, tout en conservant la structure du problème inhérente à nos connaissances métier.Dans un deuxième temps, nous traitons la caractérisation des solutions du problème d'optimisation de trajectoire relativement au domaine de validité des modèles identifiés. Dans cette optique, nous proposons un critère probabiliste pour quantifier la proximité entre une courbe arbitraire et un ensemble de trajectoires échantillonnées à partir d'un même processus stochastique. Nous proposons une classe d'estimateurs de cette quantitée et nous étudions de façon plus pratique une implémentation nonparamétrique basé sur des estimateurs à noyau, et une implémentation paramétrique faisant intervenir des mélanges Gaussiens. Ce dernier est introduit comme pénalité dans le critère d'optimisation de trajectoire dans l'objectif l'intention d'obtenir directement des trajectoires consommant peu sans trop s'éloigner des régions de validité. / This thesis deals with the use of flight data for the optimization of climb trajectories with relation to fuel consumption.We first focus on methods for identifying the aircraft dynamics, in order to plug it in the trajectory optimization problem. We suggest a static formulation of the identification problem, which we interpret as a structured multi-task regression problem. In this framework, we propose parametric models and use different maximum likelihood approaches to learn the unknown parameters.Furthermore, polynomial models are considered and an extension to the structured multi-task setting of the bootstrap Lasso is used to make a consistent selection of the monomials despite the high correlations among them.Next, we consider the problem of assessing the optimized trajectories relatively to the validity region of the identified models. For this, we propose a probabilistic criterion for quantifying the closeness between an arbitrary curve and a set of trajectories sampled from the same stochastic process. We propose a class of estimators of this quantity and prove their consistency in some sense. A nonparemetric implementation based on kernel density estimators, as well as a parametric implementation based on Gaussian mixtures are presented. We introduce the later as a penalty term in the trajectory optimization problem, which allows us to control the trade-off between trajectory acceptability and consumption reduction. Optimisation de trajectoires Identification de systèmes dynamiques Selection de variables Apprentissage multi-Tâches Estimation de densité Analyse de données fonctionnelles Trajectory optimization System identification Structured feature selection Multi-Task learning Density estimation Functional data analysis 519
120	Contributions à la modélisation de données spatiales et fonctionnelles : applications / Contributions to modeling spatial and functional data : applications Ternynck, Camille 28 November 2014 (has links) Dans ce mémoire de thèse, nous nous intéressons à la modélisation non paramétrique de données spatiales et/ou fonctionnelles, plus particulièrement basée sur la méthode à noyau. En général, les échantillons que nous avons considérés pour établir les propriétés asymptotiques des estimateurs proposés sont constitués de variables dépendantes. La spécificité des méthodes étudiées réside dans le fait que les estimateurs prennent en compte la structure de dépendance des données considérées.Dans une première partie, nous appréhendons l’étude de variables réelles spatialement dépendantes. Nous proposons une nouvelle approche à noyau pour estimer les fonctions de densité de probabilité et de régression spatiales ainsi que le mode. La particularité de cette approche est qu’elle permet de tenir compte à la fois de la proximité entre les observations et de celle entre les sites. Nous étudions les comportements asymptotiques des estimateurs proposés ainsi que leurs applications à des données simulées et réelles.Dans une seconde partie, nous nous intéressons à la modélisation de données à valeurs dans un espace de dimension infinie ou dites "données fonctionnelles". Dans un premier temps, nous adaptons le modèle de régression non paramétrique introduit en première partie au cadre de données fonctionnelles spatialement dépendantes. Nous donnons des résultats asymptotiques ainsi que numériques. Puis, dans un second temps, nous étudions un modèle de régression de séries temporelles dont les variables explicatives sont fonctionnelles et le processus des innovations est autorégressif. Nous proposons une procédure permettant de tenir compte de l’information contenue dans le processus des erreurs. Après avoir étudié le comportement asymptotique de l’estimateur à noyau proposé, nous analysons ses performances sur des données simulées puis réelles.La troisième partie est consacrée aux applications. Tout d’abord, nous présentons des résultats de classification non supervisée de données spatiales (multivariées), simulées et réelles. La méthode de classification considérée est basée sur l’estimation du mode spatial, obtenu à partir de l’estimateur de la fonction de densité spatiale introduit dans le cadre de la première partie de cette thèse. Puis, nous appliquons cette méthode de classification basée sur le mode ainsi que d’autres méthodes de classification non supervisée de la littérature sur des données hydrologiques de nature fonctionnelle. Enfin, cette classification des données hydrologiques nous a amené à appliquer des outils de détection de rupture sur ces données fonctionnelles. / In this dissertation, we are interested in nonparametric modeling of spatial and/or functional data, more specifically based on kernel method. Generally, the samples we have considered for establishing asymptotic properties of the proposed estimators are constituted of dependent variables. The specificity of the studied methods lies in the fact that the estimators take into account the structure of the dependence of the considered data.In a first part, we study real variables spatially dependent. We propose a new kernel approach to estimating spatial probability density of the mode and regression functions. The distinctive feature of this approach is that it allows taking into account both the proximity between observations and that between sites. We study the asymptotic behaviors of the proposed estimates as well as their applications to simulated and real data. In a second part, we are interested in modeling data valued in a space of infinite dimension or so-called "functional data". As a first step, we adapt the nonparametric regression model, introduced in the first part, to spatially functional dependent data framework. We get convergence results as well as numerical results. Then, later, we study time series regression model in which explanatory variables are functional and the innovation process is autoregressive. We propose a procedure which allows us to take into account information contained in the error process. After showing asymptotic behavior of the proposed kernel estimate, we study its performance on simulated and real data.The third part is devoted to applications. First of all, we present unsupervised classificationresults of simulated and real spatial data (multivariate). The considered classification method is based on the estimation of spatial mode, obtained from the spatial density function introduced in the first part of this thesis. Then, we apply this classification method based on the mode as well as other unsupervised classification methods of the literature on hydrological data of functional nature. Lastly, this classification of hydrological data has led us to apply change point detection tools on these functional data. Estimation non paramétrique Estimateur à noyau Densité de probabilité Mode Régression Statistique spatiale Données fonctionnelles Séries temporelles Classification non supervisée Détection de rupture Nonparametric estimation Kernel estimate Probability density Mode Regression Spatial statistics Functional data Time series Unsupervised classification Change point detection

Search results