• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 117
  • 61
  • 21
  • 20
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 266
  • 266
  • 69
  • 67
  • 59
  • 57
  • 52
  • 39
  • 36
  • 32
  • 31
  • 30
  • 30
  • 29
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
231

Information-theoretic variable selection and network inference from microarray data

Meyer, Patrick E. 16 December 2008 (has links)
Statisticians are used to model interactions between variables on the basis of observed<p>data. In a lot of emerging fields, like bioinformatics, they are confronted with datasets<p>having thousands of variables, a lot of noise, non-linear dependencies and, only, tens of<p>samples. The detection of functional relationships, when such uncertainty is contained in<p>data, constitutes a major challenge.<p>Our work focuses on variable selection and network inference from datasets having<p>many variables and few samples (high variable-to-sample ratio), such as microarray data.<p>Variable selection is the topic of machine learning whose objective is to select, among a<p>set of input variables, those that lead to the best predictive model. The application of<p>variable selection methods to gene expression data allows, for example, to improve cancer<p>diagnosis and prognosis by identifying a new molecular signature of the disease. Network<p>inference consists in representing the dependencies between the variables of a dataset by<p>a graph. Hence, when applied to microarray data, network inference can reverse-engineer<p>the transcriptional regulatory network of cell in view of discovering new drug targets to<p>cure diseases.<p>In this work, two original tools are proposed MASSIVE (Matrix of Average Sub-Subset<p>Information for Variable Elimination) a new method of feature selection and MRNET (Minimum<p>Redundancy NETwork), a new algorithm of network inference. Both tools rely on<p>the computation of mutual information, an information-theoretic measure of dependency.<p>More precisely, MASSIVE and MRNET use approximations of the mutual information<p>between a subset of variables and a target variable based on combinations of mutual informations<p>between sub-subsets of variables and the target. The used approximations allow<p>to estimate a series of low variate densities instead of one large multivariate density. Low<p>variate densities are well-suited for dealing with high variable-to-sample ratio datasets,<p>since they are rather cheap in terms of computational cost and they do not require a large<p>amount of samples in order to be estimated accurately. Numerous experimental results<p>show the competitiveness of these new approaches. Finally, our thesis has led to a freely<p>available source code of MASSIVE and an open-source R and Bioconductor package of<p>network inference. / Doctorat en sciences, Spécialisation Informatique / info:eu-repo/semantics/nonPublished
232

Biosignals for driver's stress level assessment : functional variable selection and fractal characterization / Biosignaux pour l’évaluation du niveau de stress du conducteur : sélection des variables fonctionnelles et caractérisation fractale de l’activité électrodermale

El Haouij, Neska 04 July 2018 (has links)
La sécurité et le confort dans une tâche de conduite automobile sont des facteurs clés qui intéressent plusieurs acteurs (constructeurs, urbanistes, départements de transport), en particulier dans le contexte actuel d’urbanisation croissante. Il devient dès lors crucial d'évaluer l'état affectif du conducteur lors de la conduite, en particulier son niveau de stress qui influe sur sa prise de décision et donc sur ses performances de conduite. Dans cette thèse, nous nous concentrons sur l'étude des changements de niveau de stress ressenti durant une expérience de conduite réelle qui alterne ville, autoroute et repos. Les méthodes classiques sont basées sur des descripteurs proposés par des experts, appliquées sur des signaux physiologiques. Ces signaux sont prétraités, les descripteurs ad-hoc sont extraits et sont fusionnés par la suite pour reconnaître le niveau de stress. Dans ce travail, nous avons adapté une méthode de sélection de variables fonctionnelles, basée sur les forêts aléatoires, avec élimination récursive des descripteurs (RF-RFE). En effet, les biosignaux, considérés comme variables fonctionnelles, sont tout d’abord projetés sur une base d’ondelettes. L’algorithme RF-RFE est ensuite utilisé pour sélectionner les groupes d’ondelettes, correspondant aux variables fonctionnelles, selon un score d’endurance. Le choix final de ces variables est basé sur ce score proposé afin de quantifier la capacité d’une variable à être sélectionnée et dans les premiers rangs. Dans une première étape, nous avons analysé la fréquence cardiaque (HR), électromyogramme (EMG), fréquence respiratoire (BR) et activité électrodermale (EDA), issus de 10 expériences de conduite menées à Boston, de la base de données du MIT, drivedb. Dans une seconde étape, nous avons conduit 13 expériences in-vivo similaires, en alternant conduite dans la ville et sur autoroute dans la région de Grand Tunis. La base de données résultante, AffectiveROAD contient -comme dans drivedb- les biosignaux tels que le HR, BR, EDA mais aussi la posture. Le prototype de plateforme de réseau de capteurs développé, a permis de collecter des données environnementales à l’intérieur du véhicule (température, humidité, pression, niveau sonore et GPS) qui sont également inclues dans AffectiveROAD. Une métrique subjective de stress, basée sur l’annotation d’un observateur et validée a posteriori par le conducteur au vu des enregistrements vidéo acquis lors de l’expérience de conduite, complète cette base de données. Nous définissons ici la notion de stress par ce qui résume excitation, attention, charge mentale, perception de complexité de l'environnement par le conducteur. La sélection de variables fonctionnelles dans le cas de drivedb a révélé que l'EDA mesurée au pied est l'indicateur le plus révélateur du niveau de stress du conducteur, suivi de la fréquence respiratoire. La méthode RF-RFE associée à des descripteurs non experts, conduit à des performances comparables à celles obtenues par la méthode basée sur les descripteurs sélectionnés par les experts. En analysant les données d’AffectiveROAD, la posture et l’EDA mesurée sur le poignet droit du conducteur ont émergé comme les variables les plus pertinentes. Une analyse plus approfondie de l'EDA a par la suite été menée car ce signal a été retenu, pour les deux bases de données, parmi les variables fonctionnelles sélectionnées pour la reconnaissance du niveau de stress. Ceci est cohérent avec diverses études sur la physiologie humaine qui voient l’EDA comme un indicateur clé des émotions. Nous avons ainsi exploré le caractère fractal de ce biosignal à travers une analyse d'auto-similarité et une estimation de l'exposant de Hurst basée sur les ondelettes. L'analyse montre un comportement d’auto-similarité des enregistrements de l'EDA pour les deux bases de données, sur une large gamme d’échelles. Ceci en fait un indicateur potentiel temps réel du stress du conducteur durant une expérience de conduite réelle. / The safety and comfort in a driving task are key factors of interest to several actors (vehicle manufacturers, urban space designers, and transportation service providers), especially in a context of an increasing urbanization. It is thus crucial to assess the driver’s affective state while driving, in particular his state of stress which impacts the decision making and thus driving task performance. In this thesis, we focus on the study of stress level changes, during real-world driving, experienced in city versus highway areas. Classical methods are based on features selected by experts, applied to physiological signals. These signals are preprocessed using specific tools for each signal, then ad-hoc features are extracted and finally a data fusion for stress-level recognition is performed. In this work, we adapted a functional variable selection method, based on Random Forests Recursive Feature Elimination (RF-RFE). In fact, the biosignals considered as functional variables, are first decomposed using wavelet basis. The RF-RFE algorithms are then used to select groups of wavelets coefficients, corresponding to the functional variables, according to an endurance score. The final choice of the selected variables relies on this proposed score that allows to quantify the ability of a variable to be selected and this, in first ranges. At a first stage, we analyzed physiological signals such as: Heart Rate (HR), Electromyogram (EMG), Breathing Rate (BR), and the Electrodermal Activity (EDA), related to 10 driving experiments, extracted from the open database of MIT: drivedb, carried out in Boston area. At a second stage, we have designed and conducted similar city and highway driving experiments in the greater Tunis area. The resulting database, AffectiveROAD, includes, as in drivedb, biosignals as HR, BR and EDA and additional measurement of the driver posture. The developed prototype of the sensors network platform allowed also to gather data characterizing the vehicle internal environment (temperature, humidity, pressure, sound level, and geographical coordinates) which are included in AffectiveROAD database. A subjective stress metric, based on driver video-based validation of the observer’s annotation, is included in AffectiveROAD database. We define here the term stress as the human affective state, including affect arousal, attention, mental workload, and the driver’s perception of the environment complexity. The functional variable selection, applied to drivedb, revealed that the EDA captured on foot followed by the BR, are relevant in the driver’s stress level classification. The RF-RFE method along with non-expert based features offered comparable performances to those obtained by the classical method. When analyzing the AffectiveROAD data, the posture and the EDA captured on the driver’s right wrist emerged as the most enduring variables. For both databases, the placement of the EDA sensor came out as an important consideration in the stress level assessment. A deeper analysis of the EDA was carried out since its emergence as a key indicator in stress level recognition, for the two databases. This is consistent with various human physiology studies reporting that the EDA is a key indicator of emotions. For that, we investigated the fractal properties of this biosignal using a self-similarity analysis of EDA measurements based on Hurst exponent (H) estimated using wavelet-based method. Such study shows that EDA recordings exhibits self-similar behavior for large scales, for the both databases. This proposes that it can be considered as a potential real-time indicator of stress in real-world driving experience.
233

Spectral and textural analysis of high resolution data for the automatic detection of grape vine diseases / Analyses spectrale et texturale de données haute résolution pour la détection automatique des maladies de la vigne

Al saddik, Hania 04 July 2019 (has links)
La Flavescence dorée est une maladie contagieuse et incurable de la vigne détectable sur les feuilles. Le projet DAMAV (Détection Automatique des MAladies de la Vigne) a été mis en place, avec pour objectif de développer une solution de détection automatisée des maladies de la vigne à l’aide d’un micro-drone. Cet outil doit permettre la recherche des foyers potentiels de la Flavescence dorée, puis plus généralement de toute maladie détectable sur le feuillage à l’aide d’un outil multispectral dédié haute résolution.Dans le cadre de ce projet, cette thèse a pour objectif de participer à la conception et à l’implémentation du système d’acquisition multispectral et de développer les algorithmes de prétraitement d’images basés sur les caractéristiques spectrales et texturales les plus pertinentes reliées à la Flavescence dorée.Plusieurs variétés de vigne ont été considérées telles que des variétés rouges et blanches; de plus, d’autres maladies que ‘Flavescence dorée’ (FD) telles que Esca et ‘Bois noir’ (BN) ont également été testées dans des conditions de production réelles. Le travail de doctorat a été essentiellement réalisé au niveau feuille et a impliqué une étape d’acquisition suivie d’une étape d’analyse des données.La plupart des techniques d'imagerie, même multispectrales, utilisées pour détecter les maladies dans les grandes cultures ou les vignobles, opèrent dans le domaine du visible. Dans DAMAV, il est conseillé que la maladie soit détectée le plus tôt possible. Des informations spectrales sont nécessaires, notamment dans l’infrarouge. Les réflectances des feuilles des plantes peuvent être obtenues sur les longueurs d'onde les plus courtes aux plus longues. Ces réflectances sont intimement liées aux composants internes des feuilles. Cela signifie que la présence d'une maladie peut modifier la structure interne des feuilles et donc altérer sa signature.Un spectromètre a été utilisé sur le terrain pour caractériser les signatures spectrales des feuilles à différents stades de croissance. Afin de déterminer les réflectances optimales pour la détection des maladies (FD, Esca, BN), une nouvelle méthodologie de conception d'indices de maladies basée sur deux techniques de réduction de dimensions, associées à un classifieur, a été mise en place. La première technique de sélection de variables utilise les Algorithmes Génétiques (GA) et la seconde s'appuie sur l'Algorithme de Projections Successives (SPA). Les nouveaux indices de maladies résultants surpassent les indices de végétation traditionnels et GA était en général meilleur que SPA. Les variables finalement choisies peuvent ainsi être mises en oeuvre en tant que filtres dans le capteur MS.Les informations de réflectance étaient satisfaisantes pour la recherche d’infections (plus que 90% de précision pour la meilleure méthode) mais n’étaient pas suffisantes. Ainsi, les images acquises par l’appareil MS peuvent être ensuite traitées par des techniques bas-niveau basées sur le calcul de paramètres de texture puis injectés dans un classifieur. Plusieurs techniques de traitement de texture ont été testées mais uniquement sur des images couleur. Une nouvelle méthode combinant plusieurs paramètres texturaux a été élaborée pour en choisir les meilleurs. Nous avons constaté que les informations texturales pouvaient constituer un moyen complémentaire non seulement pour différencier les feuilles de vigne saines des feuilles infectées (plus que 85% de précision), mais également pour classer le degré d’infestation des maladies (plus que 74% de précision) et pour distinguer entre les maladies (plus que 75% de précision). Ceci conforte l’hypothèse qu’une caméra multispectrale permet la détection et l’identification de maladies de la vigne en plein champ. / ‘Flavescence dorée’ is a contagious and incurable disease present on the vine leaves. The DAMAV project (Automatic detection of Vine Diseases) aims to develop a solution for automated detection of vine diseases using a micro-drone. The goal is to offer a turnkey solution for wine growers. This tool will allow the search for potential foci, and then more generally any type of detectable vine disease on the foliage. To enable this diagnosis, the foliage is proposed to be studied using a dedicated high-resolution multispectral camera.The objective of this PhD-thesis in the context of DAMAV is to participate in the design and implementation of a Multi-Spectral (MS) image acquisition system and to develop the image pre-processing algorithms, based on the most relevant spectral and textural characteristics related to ‘Flavescence dorée’.Several grapevine varieties were considered such as red-berried and white-berried ones; furthermore, other diseases than ‘Flavescence dorée’ (FD) such as Esca and ‘Bois noir’ (BN) were also tested under real production conditions. The PhD work was basically performed at a leaf-level scale and involved an acquisition step followed by a data analysis step.Most imaging techniques, even MS, used to detect diseases in field crops or vineyards, operate in the visible electromagnetic radiation range. In DAMAV, it is advised to detect the disease as early as possible. It is therefore necessary to investigate broader information in particular in the infra-red. Reflectance responses of plants leaves can be obtained from short to long wavelengths. These reflectance signatures describe the internal constituents of leaves. This means that the presence of a disease can modify the internal structure of the leaves and hence cause an alteration of its reflectance signature.A spectrometer is used in our study to characterize reflectance responses of leaves in the field. Several samples at different growth stages were used for the tests. To define optimal reflectance features for grapevine disease detection (FD, Esca, BN), a new methodology that designs spectral disease indices based on two dimension reduction techniques, coupled with a classifier, has been developed. The first feature selection technique uses the Genetic Algorithms (GA) and the second one relies on the Successive Projection Algorithm (SPA). The new resulting spectral disease indices outperformed traditional vegetation indices and GA performed in general better than SPA. The features finally chosen can thus be implemented as filters in the MS sensor.In general, the reflectance information was satisfying for finding infections (higher than 90% of accuracy for the best method) but wasn’t enough. Thus, the images acquired with the developed MS device can further be pre-processed by low level techniques based on the calculation of texture parameters injected into a classifier. Several texture processing techniques have been tested but only on colored images. A method that combines many texture features is elaborated, allowing to choose the best ones. We found that the combination of optimal textural information could provide a complementary mean for not only differentiating healthy from infected grapevine leaves (higher than 85% of accuracy), but also for grading the disease severity stages (higher than 73% of accuracy) and for discriminating among diseases (higher than 72% of accuracy). This is in accordance with the hypothesis that a multispectral camera can enable detection and identification of diseases in grapevine fields.
234

Computing Random Forests Variable Importance Measures (VIM) on Mixed Numerical and Categorical Data / Beräkning av Random Forests variable importance measures (VIM) på kategoriska och numeriska prediktorvariabler

Hjerpe, Adam January 2016 (has links)
The Random Forest model is commonly used as a predictor function and the model have been proven useful in a variety of applications. Their popularity stems from the combination of providing high prediction accuracy, their ability to model high dimensional complex data, and their applicability under predictor correlations. This report investigates the random forest variable importance measure (VIM) as a means to find a ranking of important variables. The robustness of the VIM under imputation of categorical noise, and the capability to differentiate informative predictors from non-informative variables is investigated. The selection of variables may improve robustness of the predictor, improve the prediction accuracy, reduce computational time, and may serve as a exploratory data analysis tool. In addition the partial dependency plot obtained from the random forest model is examined as a means to find underlying relations in a non-linear simulation study. / Random Forest (RF) är en populär prediktormodell som visat goda resultat vid en stor uppsättning applikationsstudier. Modellen ger hög prediktionsprecision, har förmåga att modellera komplex högdimensionell data och modellen har vidare visat goda resultat vid interkorrelerade prediktorvariabler. Detta projekt undersöker ett mått, variabel importance measure (VIM) erhållna från RF modellen, för att beräkna graden av association mellan prediktorvariabler och målvariabeln. Projektet undersöker känsligheten hos VIM vid kvalitativt prediktorbrus och undersöker VIMs förmåga att differentiera prediktiva variabler från variabler som endast, med aveende på målvariableln, beskriver brus. Att differentiera prediktiva variabler vid övervakad inlärning kan användas till att öka robustheten hos klassificerare, öka prediktionsprecisionen, reducera data dimensionalitet och VIM kan användas som ett verktyg för att utforska relationer mellan prediktorvariabler och målvariablel.
235

A Multivariate Framework for Variable Selection and Identification of Biomarkers in High-Dimensional Omics Data

Zuber, Verena 27 June 2012 (has links)
In this thesis, we address the identification of biomarkers in high-dimensional omics data. The identification of valid biomarkers is especially relevant for personalized medicine that depends on accurate prediction rules. Moreover, biomarkers elucidate the provenance of disease, or molecular changes related to disease. From a statistical point of view the identification of biomarkers is best cast as variable selection. In particular, we refer to variables as the molecular attributes under investigation, e.g. genes, genetic variation, or metabolites; and we refer to observations as the specific samples whose attributes we investigate, e.g. patients and controls. Variable selection in high-dimensional omics data is a complicated challenge due to the characteristic structure of omics data. For one, omics data is high-dimensional, comprising cellular information in unprecedented details. Moreover, there is an intricate correlation structure among the variables due to e.g internal cellular regulation, or external, latent factors. Variable selection for uncorrelated data is well established. In contrast, there is no consensus on how to approach variable selection under correlation. Here, we introduce a multivariate framework for variable selection that explicitly accounts for the correlation among markers. In particular, we present two novel quantities for variable importance: the correlation-adjusted t (CAT) score for classification, and the correlation-adjusted (marginal) correlation (CAR) score for regression. The CAT score is defined as the Mahalanobis-decorrelated t-score vector, and the CAR score as the Mahalanobis-decorrelated correlation between the predictor variables and the outcome. We derive the CAT and CAR score from a predictive point of view in linear discriminant analysis and regression; both quantities assess the weight of a decorrelated and standardized variable on the prediction rule. Furthermore, we discuss properties of both scores and relations to established quantities. Above all, the CAT score decomposes Hotelling’s T 2 and the CAR score the proportion of variance explained. Notably, the decomposition of total variance into explained and unexplained variance in the linear model can be rewritten in terms of CAR scores. To render our approach applicable on high-dimensional omics data we devise an efficient algorithm for shrinkage estimates of the CAT and CAR score. Subsequently, we conduct extensive simulation studies to investigate the performance of our novel approaches in ranking and prediction under correlation. Here, CAT and CAR scores consistently improve over marginal approaches in terms of more true positives selected and a lower model error. Finally, we illustrate the application of CAT and CAR score on real omics data. In particular, we analyze genomics, transcriptomics, and metabolomics data. We ascertain that CAT and CAR score are competitive or outperform state of the art techniques in terms of true positives detected and prediction error.
236

Variable selection in discrete survival models

Mabvuu, Coster 27 February 2020 (has links)
MSc (Statistics) / Department of Statistics / Selection of variables is vital in high dimensional statistical modelling as it aims to identify the right subset model. However, variable selection for discrete survival analysis poses many challenges due to a complicated data structure. Survival data might have unobserved heterogeneity leading to biased estimates when not taken into account. Conventional variable selection methods have stability problems. A simulation approach was used to assess and compare the performance of Least Absolute Shrinkage and Selection Operator (Lasso) and gradient boosting on discrete survival data. Parameter related mean squared errors (MSEs) and false positive rates suggest Lasso performs better than gradient boosting. Frailty models outperform discrete survival models that do not account for unobserved heterogeneity. The two methods were also applied on Zimbabwe Demographic Health Survey (ZDHS) 2016 data on age at first marriage and did not select exactly the same variables. Gradient boosting retained more variables into the model. Place of residence, highest educational level attained and age cohort are the major influential factors of age at first marriage in Zimbabwe based on Lasso. / NRF
237

Exploring relevant features associated with measles nonvaccination using a machine learning approach

Olaya Bucaro, Orlando January 2020 (has links)
Measles is resurging around the world, and large outbreaks have been observed in several parts of the world. In 2019 the Philippines suffered a major measles outbreak partly due to low immunization rates in certain parts of the population. There is currently limited research on how to identify and reach pockets of unvaccinated individuals effectively. This thesis aims to find important factors associated with non-vaccination against measles using a machine learning approach, using data from the 2017 Philippine National Demographic and Health Survey. In the analyzed sample (n = 4006), 74.84% of children aged 9 months to 3 years had received their first dose of measles vaccine, and 25.16% had not. Logistic regression with all 536 candidate features was fit with the regularized regression method Elastic Net, capable of automatically selecting relevant features. The final model consists of 32 predictors, and these are related to access and contact with healthcare, the region of residence, wealth, education, religion, ethnicity, sanitary conditions, the ideal number of children, husbands’ occupation, age and weight of the child, and features relating to pre and postnatal care. Total accuracy of the final model is 79.02% [95% confidence interval: (76.37%, 81.5%)], sensitivity: 97.73%, specificity: 23.41% and area under receiver operating characteristic curve: 0.81. The results indicate that socioeconomic differences determine to a degree measles vaccination. However, the difficulty in classifying non-vaccinated children, the low specificity, using only health and demographic characteristics suggests other factors than what is available in the analyzed data, possibly vaccine hesitation, could have a large effect on measles non-vaccination. Based on the results, efforts should be made to ensure access to facility-based delivery for all mothers regardless of socioeconomic status, to improve measles vaccination rates in the Philippines.
238

Le lasso linéaire : une méthode pour des données de petites et grandes dimensions en régression linéaire

Watts, Yan 04 1900 (has links)
Dans ce mémoire, nous nous intéressons à une façon géométrique de voir la méthode du Lasso en régression linéaire. Le Lasso est une méthode qui, de façon simultanée, estime les coefficients associés aux prédicteurs et sélectionne les prédicteurs importants pour expliquer la variable réponse. Les coefficients sont calculés à l’aide d’algorithmes computationnels. Malgré ses vertus, la méthode du Lasso est forcée de sélectionner au maximum n variables lorsque nous nous situons en grande dimension (p > n). De plus, dans un groupe de variables corrélées, le Lasso sélectionne une variable “au hasard”, sans se soucier du choix de la variable. Pour adresser ces deux problèmes, nous allons nous tourner vers le Lasso Linéaire. Le vecteur réponse est alors vu comme le point focal de l’espace et tous les autres vecteurs de variables explicatives gravitent autour du vecteur réponse. Les angles formés entre le vecteur réponse et les variables explicatives sont supposés fixes et nous serviront de base pour construire la méthode. L’information contenue dans les variables explicatives est projetée sur le vecteur réponse. La théorie sur les modèles linéaires normaux nous permet d’utiliser les moindres carrés ordinaires (MCO) pour les coefficients du Lasso Linéaire. Le Lasso Linéaire (LL) s’effectue en deux étapes. Dans un premier temps, des variables sont écartées du modèle basé sur leur corrélation avec la variable réponse; le nombre de variables écartées (ou ordonnées) lors de cette étape dépend d’un paramètre d’ajustement γ. Par la suite, un critère d’exclusion basé sur la variance de la distribution de la variable réponse est introduit pour retirer (ou ordonner) les variables restantes. Une validation croisée répétée nous guide dans le choix du modèle final. Des simulations sont présentées pour étudier l’algorithme en fonction de différentes valeurs du paramètre d’ajustement γ. Des comparaisons sont effectuées entre le Lasso Linéaire et des méthodes compétitrices en petites dimensions (Ridge, Lasso, SCAD, etc.). Des améliorations dans l’implémentation de la méthode sont suggérées, par exemple l’utilisation de la règle du 1se nous permettant d’obtenir des modèles plus parcimonieux. Une implémentation de l’algorithme LL est fournie dans la fonction R intitulée linlasso, disponible au https://github.com/yanwatts/linlasso. / In this thesis, we are interested in a geometric way of looking at the Lasso method in the context of linear regression. The Lasso is a method that simultaneously estimates the coefficients associated with the predictors and selects the important predictors to explain the response variable. The coefficients are calculated using computational algorithms. Despite its virtues, the Lasso method is forced to select at most n variables when we are in highdimensional contexts (p > n). Moreover, in a group of correlated variables, the Lasso selects a variable “at random”, without caring about the choice of the variable. To address these two problems, we turn to the Linear Lasso. The response vector is then seen as the focal point of the space and all other explanatory variables vectors orbit around the response vector. The angles formed between the response vector and the explanatory variables are assumed to be fixed, and will be used as a basis for constructing the method. The information contained in the explanatory variables is projected onto the response vector. The theory of normal linear models allows us to use ordinary least squares (OLS) for the coefficients of the Linear Lasso. The Linear Lasso (LL) is performed in two steps. First, variables are dropped from the model based on their correlation with the response variable; the number of variables dropped (or ordered) in this step depends on a tuning parameter γ. Then, an exclusion criterion based on the variance of the distribution of the response variable is introduced to remove (or order) the remaining variables. A repeated cross-validation guides us in the choice of the final model. Simulations are presented to study the algorithm for different values of the tuning parameter γ. Comparisons are made between the Linear Lasso and competing methods in small dimensions (Ridge, Lasso, SCAD, etc.). Improvements in the implementation of the method are suggested, for example the use of the 1se rule allowing us to obtain more parsimonious models. An implementation of the LL algorithm is provided in the function R entitled linlasso available at https://github.com/yanwatts/linlasso.
239

Statistical Analysis of Structured High-dimensional Data

Sun, Yizhi 05 October 2018 (has links)
High-dimensional data such as multi-modal neuroimaging data and large-scale networks carry excessive amount of information, and can be used to test various scientific hypotheses or discover important patterns in complicated systems. While considerable efforts have been made to analyze high-dimensional data, existing approaches often rely on simple summaries which could miss important information, and many challenges on modeling complex structures in data remain unaddressed. In this proposal, we focus on analyzing structured high-dimensional data, including functional data with important local regions and network data with community structures. The first part of this dissertation concerns the detection of ``important'' regions in functional data. We propose a novel Bayesian approach that enables region selection in the functional data regression framework. The selection of regions is achieved through encouraging sparse estimation of the regression coefficient, where nonzero regions correspond to regions that are selected. To achieve sparse estimation, we adopt compactly supported and potentially over-complete basis to capture local features of the regression coefficient function, and assume a spike-slab prior to the coefficients of the bases functions. To encourage continuous shrinkage of nearby regions, we assume an Ising hyper-prior which takes into account the neighboring structure of the bases functions. This neighboring structure is represented by an undirected graph. We perform posterior sampling through Markov chain Monte Carlo algorithms. The practical performance of the proposed approach is demonstrated through simulations as well as near-infrared and sonar data. The second part of this dissertation focuses on constructing diversified portfolios using stock return data in the Center for Research in Security Prices (CRSP) database maintained by the University of Chicago. Diversification is a risk management strategy that involves mixing a variety of financial assets in a portfolio. This strategy helps reduce the overall risk of the investment and improve performance of the portfolio. To construct portfolios that effectively diversify risks, we first construct a co-movement network using the correlations between stock returns over a training time period. Correlation characterizes the synchrony among stock returns thus helps us understand whether two or multiple stocks have common risk attributes. Based on the co-movement network, we apply multiple network community detection algorithms to detect groups of stocks with common co-movement patterns. Stocks within the same community tend to be highly correlated, while stocks across different communities tend to be less correlated. A portfolio is then constructed by selecting stocks from different communities. The average return of the constructed portfolio over a testing time period is finally compared with the SandP 500 market index. Our constructed portfolios demonstrate outstanding performance during a non-crisis period (2004-2006) and good performance during a financial crisis period (2008-2010). / PHD / High dimensional data, which are composed by data points with a tremendous number of features (a.k.a. attributes, independent variables, explanatory variables), brings challenges to statistical analysis due to their “high-dimensionality” and complicated structure. In this dissertation work, I consider two types of high-dimension data. The first type is functional data in which each observation is a function. The second type is network data whose internal structure can be described as a network. I aim to detect “important” regions in functional data by using a novel statistical model, and I treat stock market data as network data to construct quality portfolios efficiently
240

Statistical Methods for Multivariate Functional Data Clustering, Recurrent Event Prediction, and Accelerated Degradation Data Analysis

Jin, Zhongnan 12 September 2019 (has links)
In this dissertation, we introduce three projects in machine learning and reliability applications after the general introductions in Chapter 1. The first project concentrates on the multivariate sensory data, the second project is related to the bivariate recurrent process, and the third project introduces thermal index (TI) estimation in accelerated destructive degradation test (ADDT) data, in which an R package is developed. All three projects are related to and can be used to solve certain reliability problems. Specifically, in Chapter 2, we introduce a clustering method for multivariate functional data. In order to cluster the customized events extracted from multivariate functional data, we apply the functional principal component analysis (FPCA), and use a model based clustering method on a transformed matrix. A penalty term is imposed on the likelihood so that variable selection is performed automatically. In Chapter 3, we propose a covariate-adjusted model to predict next event in a bivariate recurrent event system. Inspired by geyser eruptions in Yellowstone National Park, we consider two event types and model their event gap time relationship. External systematic conditions are taken account into the model with covariates. The proposed covariate adjusted recurrent process (CARP) model is applied to the Yellowstone National Park geyser data. In Chapter 4, we compare estimation methods for TI. In ADDT, TI is an important index indicating the reliability of materials, when the accelerating variable is temperature. Three methods are introduced in TI estimations, which are least-squares method, parametric model and semi-parametric model. An R package is implemented for all three methods. Applications of R functions are introduced in Chapter 5 with publicly available ADDT datasets. Chapter 6 includes conclusions and areas for future works. / Doctor of Philosophy / This dissertation focuses on three projects that are all related to machine learning and reliability. Specifically, in the first project, we propose a clustering method designated for events extracted from multivariate sensory data. When the customized event is corresponding to reliability issues, such as aging procedures, clustering results can help us learn different event characteristics by examining events belonging to the same group. Applications include diving behavior segmentation based on vehicle sensory data, where multiple sensors are measuring vehicle conditions simultaneously and events are defined as vehicle stoppages. In our project, we also proposed to conduct sensor selection by three different penalizations including individual, variable and group. Our method can be applied for multi-dimensional sensory data clustering, when optimal sensor design is also an objective. The second project introduces a covariate-adjusted model accommodated to a bivariate recurrent event process system. In such systems, events can occur repeatedly and event occurrences for each type can affect each other with certain dependence. Events in the system can be mechanical failures which is related to reliability, while next event time and type predictions are usually of interest. Precise predictions on the next event time and type can essentially prevent serious safety and economy consequences following the upcoming event. We propose two CARP models with marginal behaviors as well as the dependence structure characterized in the bivariate system. We innovate to incorporate external information to the model so that model results are enhanced. The proposed model is evaluated in simulation studies, while geyser data from Yellowstone National Park is applied. In the third project, we comprehensively discuss three estimation methods for thermal index. They are the least-square method, parametric model and semi-parametric model. When temperature is the accelerating variable, thermal index indicates the temperature at which our materials can hold up to a certain time. In reality, estimating the thermal index precisely can prolong lifetime of certain product by choosing the right usage temperature. Methods evaluations are conducted by simulation study, while applications are applied to public available datasets.

Page generated in 0.0881 seconds