Global ETD Search

61	Pénalités minimales pour la sélection de modèle / Minimal penalties for model selection Sorba, Olivier 09 February 2017 (has links) Dans le cadre de la sélection de modèle par contraste pénalisé, L. Birgé and P. Massart ont prouvé que le phénomène de pénalité minimale se produit pour la sélection libre parmi des variables gaussiennes indépendantes. Nous étendons certains de leurs résultats à la partition d'un signal gaussien lorsque la famille de partitions envisagées est suffisamment riche, notamment dans le cas des arbres de régression. Nous montrons que le même phénomène se produit dans le cadre de l'estimation de densité. La richesse de la famille de modèle s'apparente à une forme d'isotropie. De ce point de vue le phénomène de pénalité minimale est intrinsèque. Pour corroborer et illustrer ce point de vue, nous montrons que le même phénomène se produit pour une famille de modèles d'orientation aléatoire uniforme. / L. Birgé and P. Massart proved that the minimum penalty phenomenon occurs in Gaussian model selection when the model family arises from complete variable selection among independent variables. We extend some of their results to discrete Gaussian signal segmentation when the model family corresponds to a sufficiently rich family of partitions of the signal's support. This is the case of regression trees. We show that the same phenomenon occurs in the context of density estimation. The richness of the model family can be related to a certain form of isotropy. In this respect the minimum penalty phenomenon is intrinsic. To corroborate this point of view, we show that the minimum penalty phenomenon occurs when the models are chosen randomly under an isotropic law. Contraste pénalisé Segmentation de signal gaussien Détection de ruptures multiples CART Moindres carrés pénalisés Estimation de densité Arbres de régression Penalized contrast Gaussian signal segmentation Multiple changepoints detection CART Penalized least-squares Density estimation Regression trees
62	Non-global regression modelling Huang, Yunkai 21 June 2016 (has links) In this dissertation, a new non-global regression model - the partial linear threshold regression model (PLTRM) - is proposed. Various issues related to the PLTRM are discussed. In the first main section of the dissertation (Chapter 2), we define what is meant by the term “non-global regression model”, and we provide a brief review of the current literature associated with such models. In particular, we focus on their advantages and disadvantages in terms of their statistical properties. Because there are some weaknesses in the existing non-global regression models, we propose the PLTRM. The PLTRM combines non-parametric modelling with the traditional threshold regression models (TRMs), and hence can be thought of as an extension of the later models. We verify the performance of the PLTRM through a series of Monte Carlo simulation experiments. These experiments use a simulated data set that exhibits partial linear and partial nonlinear characteristics, and the PLTRM out-performs several competing parametric and non-parametric models in terms of the Mean Squared Error (MSE) of the within-sample fit. In the second main section of this dissertation (Chapter 3), we propose a method of estimation for the PLTRM. This requires estimating the parameters of the parametric part of the model; estimating the threshold; and fitting the non-parametric component of the model. An “unbalanced penalized least squares” approach is used. This involves using restricted penalized regression spline and smoothing spline techniques for the non-parametric component of the model; the least squares method for the linear parametric part of the model; together with a search procedure to estimate the threshold value. This estimation procedure is discussed for three mutually exclusive situations, which are classified according to the way in which the two components of the PLTRM “join” at the threshold. Bootstrap sampling distributions of the estimators are provided using the parametric bootstrap technique. The various estimators appear to have good sampling properties in most of the situations that are considered. Inference issues such as hypothesis testing and confidence interval construction for the PLTRM are also investigated. In the third main section of the dissertation (Chapter 4), we illustrate the usefulness of the PLTRM, and the application of the proposed estimation methods, by modelling various real-world data sets. These examples demonstrate both the good statistical performance, and the great application potential, of the PLTRM. / Graduate global regression non-global regression threshold regression partial linear model PLTRM spline smoothing unbalanced penalized least squares nonlinear trend piecewise model
63	High-dimensional inference of ordinal data with medical applications Jiao, Feiran 01 May 2016 (has links) Ordinal response variables abound in scientific and quantitative analyses, whose outcomes comprise a few categorical values that admit a natural ordering, so that their values are often represented by non-negative integers, for instance, pain score (0-10) or disease severity (0-4) in medical research. Ordinal variables differ from rational variables in that its values delineate qualitative rather than quantitative differences. In this thesis, we develop new statistical methods for variable selection in a high-dimensional cumulative link regression model with an ordinal response. Our study is partly motivated by the needs for exploring the association structure between disease phenotype and high-dimensional medical covariates. The cumulative link regression model specifies that the ordinal response of interest results from an order-preserving quantization of some latent continuous variable that bears a linear regression relationship with a set of covariates. Commonly used error distributions in the latent regression include the normal distribution, the logistic distribution, the Cauchy distribution and the standard Gumbel distribution (minimum). The cumulative link model with normal (logit, Gumbel) errors is also known as the ordered probit (logit, complementary log-log) model. While the likelihood function has a closed-form solution for the aforementioned error distributions, its strong nonlinearity renders direct optimization of the likelihood to sometimes fail. To mitigate this problem and to facilitate extension to penalized likelihood estimation, we proposed specific minorization-maximization (MM) algorithms for maximum likelihood estimation of a cumulative link model for each of the preceding 4 error distributions. Penalized ordinal regression models play a role when variable selection needs to be performed. In some applications, covariates may often be grouped according to some meaningful way but some groups may be mixed in that they contain both relevant and irrelevant variables, i.e., whose coefficients are non-zero and zero, respectively. Thus, it is pertinent to develop a consistent method for simultaneously selecting relevant groups and the relevant variables within each selected group, which constitutes the so-called bi-level selection problem. We have proposed to use a penalized maximum likelihood approach with a composite bridge penalty to solve the bi-level selection problem in a cumulative link model. An MM algorithm was developed for implementing the proposed method, which is specific to each of the 4 error distributions. The proposed approach is shown to enjoy a number of desirable theoretical properties including bi-level selection consistency and oracle properties, under suitable regularity conditions. Simulations demonstrate that the proposed method enjoys good empirical performance. We illustrated the proposed methods with several real medical applications. bi-level variable selection composite bridge penalty cumulative link model lung image data MM algorithm penalized maximum likelihood
64	Bayesian Semiparametric Models For Nonignorable Missing Datamechanisms In Logistic Regression Ozturk, Olcay 01 May 2011 (has links) (PDF) In this thesis, Bayesian semiparametric models for the missing data mechanisms of nonignorably missing covariates in logistic regression are developed. In the missing data literature, fully parametric approach is used to model the nonignorable missing data mechanisms. In that approach, a probit or a logit link of the conditional probability of the covariate being missing is modeled as a linear combination of all variables including the missing covariate itself. However, nonignorably missing covariates may not be linearly related with the probit (or logit) of this conditional probability. In our study, the relationship between the probit of the probability of the covariate being missing and the missing covariate itself is modeled by using a penalized spline regression based semiparametric approach. An efficient Markov chain Monte Carlo (MCMC) sampling algorithm to estimate the parameters is established. A WinBUGS code is constructed to sample from the full conditional posterior distributions of the parameters by using Gibbs sampling. Monte Carlo simulation experiments under different true missing data mechanisms are applied to compare the bias and efficiency properties of the resulting estimators with the ones from the fully parametric approach. These simulations show that estimators for logistic regression using semiparametric missing data models maintain better bias and efficiency properties than the ones using fully parametric missing data models when the true relationship between the missingness and the missing covariate has a nonlinear form. They are comparable when this relationship has a linear form. QA Probabilities 273-274.76
65	Développement d'outils statistiques pour l'analyse de données transcriptomiques par les réseaux de co-expression de gènes / A systemic approach to statistical analysis to transcriptomic data through co-expression network analysis Brunet, Anne-Claire 17 June 2016 (has links) Les nouvelles biotechnologies offrent aujourd'hui la possibilité de récolter une très grande variété et quantité de données biologiques (génomique, protéomique, métagénomique...), ouvrant ainsi de nouvelles perspectives de recherche pour la compréhension des processus biologiques. Dans cette thèse, nous nous sommes plus spécifiquement intéressés aux données transcriptomiques, celles-ci caractérisant l'activité ou le niveau d'expression de plusieurs dizaines de milliers de gènes dans une cellule donnée. L'objectif était alors de proposer des outils statistiques adaptés pour analyser ce type de données qui pose des problèmes de "grande dimension" (n<<p), car collectées sur des échantillons de tailles très limitées au regard du très grand nombre de variables (ici l'expression des gènes).La première partie de la thèse est consacrée à la présentation de méthodes d'apprentissage supervisé, telles que les forêts aléatoires de Breiman et les modèles de régressions pénalisées, utilisées dans le contexte de la grande dimension pour sélectionner les gènes (variables d'expression) qui sont les plus pertinents pour l'étude de la pathologie d'intérêt. Nous évoquons les limites de ces méthodes pour la sélection de gènes qui soient pertinents, non pas uniquement pour des considérations d'ordre statistique, mais qui le soient également sur le plan biologique, et notamment pour les sélections au sein des groupes de variables fortement corrélées, c'est à dire au sein des groupes de gènes co-exprimés. Les méthodes d'apprentissage classiques considèrent que chaque gène peut avoir une action isolée dans le modèle, ce qui est en pratique peu réaliste. Un caractère biologique observable est la résultante d'un ensemble de réactions au sein d'un système complexe faisant interagir les gènes les uns avec les autres, et les gènes impliqués dans une même fonction biologique ont tendance à être co-exprimés (expression corrélée). Ainsi, dans une deuxième partie, nous nous intéressons aux réseaux de co-expression de gènes sur lesquels deux gènes sont reliés si ils sont co-exprimés. Plus précisément, nous cherchons à mettre en évidence des communautés de gènes sur ces réseaux, c'est à dire des groupes de gènes co-exprimés, puis à sélectionner les communautés les plus pertinentes pour l'étude de la pathologie, ainsi que les "gènes clés" de ces communautés. Cela favorise les interprétations biologiques, car il est souvent possible d'associer une fonction biologique à une communauté de gènes. Nous proposons une approche originale et efficace permettant de traiter simultanément la problématique de la modélisation du réseau de co-expression de gènes et celle de la détection des communautés de gènes sur le réseau. Nous mettons en avant les performances de notre approche en la comparant à des méthodes existantes et populaires pour l'analyse des réseaux de co-expression de gènes (WGCNA et méthodes spectrales). Enfin, par l'analyse d'un jeu de données réelles, nous montrons dans la dernière partie de la thèse que l'approche que nous proposons permet d'obtenir des résultats convaincants sur le plan biologique, plus propices aux interprétations et plus robustes que ceux obtenus avec les méthodes d'apprentissage supervisé classiques. / Today's, new biotechnologies offer the opportunity to collect a large variety and volume of biological data (genomic, proteomic, metagenomic...), thus opening up new avenues for research into biological processes. In this thesis, what we are specifically interested is the transcriptomic data indicative of the activity or expression level of several thousands of genes in a given cell. The aim of this thesis was to propose proper statistical tools to analyse these high dimensional data (n<<p) collected from small samples with regard to the very large number of variables (gene expression variables). The first part of the thesis is devoted to a description of some supervised learning methods, such as random forest and penalized regression models. The following methods can be used for selecting the most relevant disease-related genes. However, the statistical relevance of the selections doesn't determine the biological relevance, and particularly when genes are selected within a group of highly correlated variables or co-expressed genes. Common supervised learning methods consider that every gene can have an isolated action in the model which is not so much realistic. An observable biological phenomenum is the result of a set of reactions inside a complex system which makes genes interact with each other, and genes that have a common biological function tend to be co-expressed (correlation between expression variables). Then, in a second part, we are interested in gene co-expression networks, where genes are linked if they are co-expressed. More precisely, we aim to identify communities of co-expressed genes, and then to select the most relevant disease-related communities as well as the "key-genes" of these communities. It leads to a variety of biological interpretations, because a community of co-expressed genes is often associated with a specific biological function. We propose an original and efficient approach that permits to treat simultaneously the problem of modeling the gene co-expression network and the problem of detecting the communities in network. We put forward the performances of our approach by comparing it to the existing methods that are popular for analysing gene co-expression networks (WGCNA and spectral approaches). The last part presents the results produced by applying our proposed approach on a real-world data set. We obtain convincing and robust results that help us make more diverse biological interpretations than with results produced by common supervised learning methods. Données transcriptomiques Réseaux de gènes Transcriptomic data Co-expression network Variable selection Dimensionality reduction Penalized regression Network clustering Machine learning
66	Schätzverfahren für individuelles Preissetzungsverhalten im Lebensmitteleinzelhandel / Estimation methods for individual pricesetting behavior in the retail sector Schulze Bisping, Christin 17 November 2017 (has links) No description available. 630 Vertikal Preistransmission Heterogenes Preissetzungsverhalten Mehrebenenmodell Logistische Regression Vertical Price Transmission penalized logistic regression heterogeneity Land- und Forstwirtschaft (PPN621302791)
67	Investigating Gene-Gene and Gene-Environment Interactions in the Association Between Overnutrition and Obesity-Related Phenotypes Tessier, François January 2017 (has links) Introduction – Animal studies suggested that NFKB1, SOCS3 and IKBKB genes could be involved in the association between overnutrition and obesity. This study aims to investigate interactions involving these genes and nutrition affecting obesity-related phenotypes. Methods – We used multifactor dimensionality reduction (MDR) and penalized logistic regression (PLR) to better detect gene/environment interactions in data from the Toronto Nutrigenomics and Health Study (n=1639) using dichotomized body mass index (BMI) and waist circumference (WC) as obesity-related phenotypes. Exposure variables included genotypes on 54 single nucleotide polymorphisms, dietary factors and ethnicity. Results – MDR identified interactions between SOCS3 rs6501199 and rs4969172, and IKBKB rs3747811 affecting BMI in whites; SOCS3 rs6501199 and NFKB1 rs1609798 affecting WC in whites; and SOCS3 rs4436839 and IKBKB rs3747811 affecting WC in South Asians. PLR found a main effect of SOCS3 rs12944581 on BMI among South Asians. Conclusion – MDR and PLR gave different results, but support some results from previous studies. Epidemiology Genetics Obesity Nutrition Interaction Overnutrition Multifactor dimensionality reduction Penalized logistic regression Gene-gene interaction Gene-environment interaction Genetic epidemiology
68	Change-point detection and kernel methods / Détection de ruptures et méthodes à noyaux Garreau, Damien 12 October 2017 (has links) Dans cette thèse, nous nous intéressons à une méthode de détection des ruptures dans une suite d’observations appartenant à un ensemble muni d’un noyau semi-défini positif. Cette procédure est une version « à noyaux » d’une méthode des moindres carrés pénalisés. Notre principale contribution est de montrer que, pour tout noyau satisfaisant des hypothèses raisonnables, cette méthode fournit une segmentation proche de la véritable segmentation avec grande probabilité. Ce résultat est obtenu pour un noyau borné et une pénalité linéaire, ainsi qu’une autre pénalité venant de la sélection de modèles. Les preuves reposent sur un résultat de concentration pour des variables aléatoires bornées à valeurs dans un espace de Hilbert, et nous obtenons une version moins précise de ce résultat lorsque l’on supposeseulement que la variance des observations est finie. Dans un cadre asymptotique, nous retrouvons les taux minimax usuels en détection de ruptures lorsqu’aucune hypothèse n’est faite sur la taille des segments. Ces résultats théoriques sont confirmés par des simulations. Nous étudions également de manière détaillée les liens entre différentes notions de distances entre segmentations. En particulier, nous prouvons que toutes ces notions coïncident pour des segmentations suffisamment proches. D’un point de vue pratique, nous montrons que l’heuristique du « saut de dimension » pour choisir la constante de pénalisation est un choix raisonnable lorsque celle-ci est linéaire. Nous montrons également qu’une quantité clé dépendant du noyau et qui apparaît dans nos résultats théoriques influe sur les performances de cette méthode pour la détection d’une unique rupture. Dans un cadre paramétrique, et lorsque le noyau utilisé est invariant partranslation, il est possible de calculer cette quantité explicitement. Grâce à ces calculs, nouveaux pour plusieurs d’entre eux, nous sommes capable d’étudier précisément le comportement de la constante de pénalité maximale. Pour finir, nous traitons de l’heuristique de la médiane, un moyen courant de choisir la largeur de bande des noyaux à base de fonctions radiales. Dans un cadre asymptotique, nous montrons que l’heuristique de la médiane se comporte à la limite comme la médiane d’une distribution que nous décrivons complètement dans le cadre du test à deux échantillons à noyaux et de la détection de ruptures. Plus précisément, nous montrons que l’heuristique de la médiane est approximativement normale centrée en cette valeur. / In this thesis, we focus on a method for detecting abrupt changes in a sequence of independent observations belonging to an arbitrary set on which a positive semidefinite kernel is defined. That method, kernel changepoint detection, is a kernelized version of a penalized least-squares procedure. Our main contribution is to show that, for any kernel satisfying some reasonably mild hypotheses, this procedure outputs a segmentation close to the true segmentation with high probability. This result is obtained under a bounded assumption on the kernel for a linear penalty and for another penalty function, coming from model selection.The proofs rely on a concentration result for bounded random variables in Hilbert spaces and we prove a less powerful result under relaxed hypotheses—a finite variance assumption. In the asymptotic setting, we show that we recover the minimax rate for the change-point locations without additional hypothesis on the segment sizes. We provide empirical evidence supporting these claims. Another contribution of this thesis is the detailed presentation of the different notions of distances between segmentations. Additionally, we prove a result showing these different notions coincide for sufficiently close segmentations.From a practical point of view, we demonstrate how the so-called dimension jump heuristic can be a reasonable choice of penalty constant when using kernel changepoint detection with a linear penalty. We also show how a key quantity depending on the kernelthat appears in our theoretical results influences the performance of kernel change-point detection in the case of a single change-point. When the kernel is translationinvariant and parametric assumptions are made, it is possible to compute this quantity in closed-form. Thanks to these computations, some of them novel, we are able to study precisely the behavior of the maximal penalty constant. Finally, we study the median heuristic, a popular tool to set the bandwidth of radial basis function kernels. Fora large sample size, we show that it behaves approximately as the median of a distribution that we describe completely in the setting of kernel two-sample test and kernel change-point detection. More precisely, we show that the median heuristic is asymptotically normal around this value. Détection de ruptures Méthodes à noyaux Moindres carrés pénalisés Heuristique de la médiane Change-point detection Kernel methods Penalized least-squares Median heuristic 510
69	Evaluating Bag Of Little Bootstraps On Logistic Regression With Unbalanced Data Bark, Henrik January 2023 (has links) The Bag of Little Bootstraps (BLB) was introduced to make the bootstrap method more computationally efficient when used on massive data samples. Since its introduction, a broad spectrum of research on the application of the BLB has been made. However, while the BLB has shown promising results that can be used for logistic regression, these results have been for well-balanced data. There is, therefore, an obvious need for further research into how the BLB performs when the dependent variable is unbalanced and whether possible performance issues can be remedied through methods such as Firths's Penalized Maximum Likelihood Estimation (PMLE). This thesis shows that the dependent variable's imbalances severely affect the BLB's performance when applied in logistic regression. Further, this thesis also shows that PMLE produces mixed and unreliable results when used to remedy the drops in performance. Bootstrapping Logistic Regression Bag of Little Bootstraps BLB Penalized Maximum Likelihood Estimation PMLE Probability Theory and Statistics Sannolikhetsteori och statistik
70	Crash Risk Analysis of Coordinated Signalized Intersections Qiming Guo (17582769) 08 December 2023 (has links) <p dir="ltr">The emergence of time-dependent data provides researchers with unparalleled opportunities to investigate disaggregated levels of safety performance on roadway infrastructures. A disaggregated crash risk analysis uses both time-dependent data (e.g., hourly traffic, speed, weather conditions and signal controls) and fixed data (e.g., geometry) to estimate hourly crash probability. Despite abundant research on crash risk analysis, coordinated signalized intersections continue to require further investigation due to both the complexity of the safety problem and the relatively small number of past studies that investigated the risk factors of coordinated signalized intersections. This dissertation aimed to develop robust crash risk prediction models to better understand the risk factors of coordinated signalized intersections and to identify practical safety countermeasures. The crashes first were categorized into three types (same-direction, opposite-direction, and right-angle) within several crash-generating scenarios. The data needed were organized in hourly observations and included the following factors: road geometric features, traffic movement volumes, speeds, weather precipitation and temperature, and signal control settings. Assembling hourly observations for modeling crash risk was achieved by synchronizing and linking data sources organized at different time resolutions. Three different non-crash sampling strategies were applied to the following three statistical models (Conditional Logit, Firth Logit, and Mixed Logit) and two machine learning models (Random Forest and Penalized Support Vector Machine). Important risk factors, such as the presence of light rain, traffic volume, speed variability, and vehicle arrival pattern of downstream, were identified. The Firth Logit model was selected for implementation to signal coordination practice. This model turned out to be most robust based on its out-of-sample prediction performance and its inclusion of important risk factors. The implementation examples of the recommended crash risk model to building daily risk profiles and to estimating the safety benefits of improved coordination plans demonstrated the model’s practicality and usefulness in improving safety at coordinated signals by practicing engineers.</p> Transport engineering Crash Risk Time-dependent Data Conditional Logit Firth Logit Mixed Logit Random Forest Penalized SVM Coordinated Signalized Intersections

Search results