• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 66
  • 11
  • 7
  • 3
  • 1
  • 1
  • Tagged with
  • 109
  • 109
  • 109
  • 35
  • 25
  • 24
  • 20
  • 19
  • 19
  • 16
  • 16
  • 15
  • 14
  • 14
  • 13
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
81

Identification de biomarqueurs prédictifs de la survie et de l'effet du traitement dans un contexte de données de grande dimension / Identification of biomarkers predicting the outcome and the treatment effect in presence of high-dimensional data

Ternes, Nils 05 October 2016 (has links)
Avec la révolution récente de la génomique et la médecine stratifiée, le développement de signatures moléculaires devient de plus en plus important pour prédire le pronostic (biomarqueurs pronostiques) ou l’effet d’un traitement (biomarqueurs prédictifs) de chaque patient. Cependant, la grande quantité d’information disponible rend la découverte de faux positifs de plus en plus fréquente dans la recherche biomédicale. La présence de données de grande dimension (nombre de biomarqueurs ≫ taille d’échantillon) soulève de nombreux défis statistiques tels que la non-identifiabilité des modèles, l’instabilité des biomarqueurs sélectionnés ou encore la multiplicité des tests.L’objectif de cette thèse a été de proposer et d’évaluer des méthodes statistiques pour l’identification de ces biomarqueurs et l’élaboration d’une prédiction individuelle des probabilités de survie pour des nouveaux patients à partir d’un modèle de régression de Cox. Pour l’identification de biomarqueurs en présence de données de grande dimension, la régression pénalisée lasso est très largement utilisée. Dans le cas de biomarqueurs pronostiques, une extension empirique de cette pénalisation a été proposée permettant d’être plus restrictif sur le choix du paramètre λ dans le but de sélectionner moins de faux positifs. Pour les biomarqueurs prédictifs, l’intérêt s’est porté sur les interactions entre le traitement et les biomarqueurs dans le contexte d’un essai clinique randomisé. Douze approches permettant de les identifier ont été évaluées telles que le lasso (standard, adaptatif, groupé ou encore ridge+lasso), le boosting, la réduction de dimension des effets propres et un modèle implémentant les effets pronostiques par bras. Enfin, à partir d’un modèle de prédiction pénalisé, différentes stratégies ont été évaluées pour obtenir une prédiction individuelle pour un nouveau patient accompagnée d’un intervalle de confiance, tout en évitant un éventuel surapprentissage du modèle. La performance des approches ont été évaluées au travers d’études de simulation proposant des scénarios nuls et alternatifs. Ces méthodes ont également été illustrées sur différents jeux de données, contenant des données d’expression de gènes dans le cancer du sein. / With the recent revolution in genomics and in stratified medicine, the development of molecular signatures is becoming more and more important for predicting the prognosis (prognostic biomarkers) and the treatment effect (predictive biomarkers) of each patient. However, the large quantity of information has rendered false positives more and more frequent in biomedical research. The high-dimensional space (i.e. number of biomarkers ≫ sample size) leads to several statistical challenges such as the identifiability of the models, the instability of the selected coefficients or the multiple testing issue.The aim of this thesis was to propose and evaluate statistical methods for the identification of these biomarkers and the individual predicted survival probability for new patients, in the context of the Cox regression model. For variable selection in a high-dimensional setting, the lasso penalty is commonly used. In the prognostic setting, an empirical extension of the lasso penalty has been proposed to be more stringent on the estimation of the tuning parameter λ in order to select less false positives. In the predictive setting, focus has been given to the biomarker-by-treatment interactions in the setting of a randomized clinical trial. Twelve approaches have been proposed for selecting these interactions such as lasso (standard, adaptive, grouped or ridge+lasso), boosting, dimension reduction of the main effects and a model incorporating arm-specific biomarker effects. Finally, several strategies were studied to obtain an individual survival prediction with a corresponding confidence interval for a future patient from a penalized regression model, while limiting the potential overfit.The performance of the approaches was evaluated through simulation studies combining null and alternative scenarios. The methods were also illustrated in several data sets containing gene expression data in breast cancer.
82

Vysoce výkonné prohledávání a dotazování ve vybraných mnohadimenzionálních prostorech v přírodních vědách / High-performance exploration and querying of selected multi-dimensional spaces in life sciences

Kratochvíl, Miroslav January 2020 (has links)
This thesis studies, implements and experiments with specific application-oriented approaches for exploring and querying multi-dimensional datasets. The first part of the thesis scrutinizes indexing of the complex space of chemical compounds, and details a design of high-performance retrieval system for small molecules. The resulting system is then utilized within a wider context of federated search in heterogeneous data and metadata related to the chemical datasets. In the second part, the thesis focuses on fast visualization and exploration of many-dimensional data that originate from single- cell cytometry. Self-organizing maps are used to derive fast methods for analysis of the datasets, and used as a base for a novel data visualization algorithm. Finally, a similar approach is utilized for highly interactive exploration of multimedia datasets. The main contributions of the thesis comprise the advancement in optimization and methods for querying the chemical data implemented in the Sachem database cartridge, the federated, SPARQL-based interface to Sachem that provides the heterogeneous search support, dimensionality reduction algorithm EmbedSOM, design and implementation of the specific EmbedSOM-backed analysis tool for flow and mass cytometry, and design and implementation of the multimedia...
83

False Discovery Rates, Higher Criticism and Related Methods in High-Dimensional Multiple Testing

Klaus, Bernd 09 January 2013 (has links)
The technical advancements in genomics, functional magnetic-resonance and other areas of scientific research seen in the last two decades have led to a burst of interest in multiple testing procedures. A driving factor for innovations in the field of multiple testing has been the problem of large scale simultaneous testing. There, the goal is to uncover lower--dimensional signals from high--dimensional data. Mathematically speaking, this means that the dimension d is usually in the thousands while the sample size n is relatively small (max. 100 in general, often due to cost constraints) --- a characteristic commonly abbreviated as d >> n. In my thesis I look at several multiple testing problems and corresponding procedures from a false discovery rate (FDR) perspective, a methodology originally introduced in a seminal paper by Benjamini and Hochberg (2005). FDR analysis starts by fitting a two--component mixture model to the observed test statistics. This mixture consists of a null model density and an alternative component density from which the interesting cases are assumed to be drawn. In the thesis I proposed a new approach called log--FDR to the estimation of false discovery rates. Specifically, my new approach to truncated maximum likelihood estimation yields accurate null model estimates. This is complemented by constrained maximum likelihood estimation for the alternative density using log--concave density estimation. A recent competitor to the FDR is the method of \"Higher Criticism\". It has been strongly advocated in the context of variable selection in classification which is deeply linked to multiple comparisons. Hence, I also looked at variable selection in class prediction which can be viewed as a special signal identification problem. Both FDR methods and Higher Criticism can be highly useful for signal identification. This is discussed in the context of variable selection in linear discriminant analysis (LDA), a popular classification method. FDR methods are not only useful for multiple testing situations in the strict sense, they are also applicable to related problems. I looked at several kinds of applications of FDR in linear classification. I present and extend statistical techniques related to effect size estimation using false discovery rates and showed how to use these for variable selection. The resulting fdr--effect method proposed for effect size estimation is shown to work as well as competing approaches while being conceptually simple and computationally inexpensive. Additionally, I applied the fdr--effect method to variable selection by minimizing the misclassification rate and showed that it works very well and leads to compact and interpretable feature sets.
84

Nouvelles méthodes pour l’apprentissage non-supervisé en grandes dimensions. / New methods for large-scale unsupervised learning.

Tiomoko ali, Hafiz 24 September 2018 (has links)
Motivée par les récentes avancées dans l'analyse théorique des performances des algorithmes d'apprentissage automatisé, cette thèse s'intéresse à l'analyse de performances et à l'amélioration de la classification nonsupervisée de données et graphes en grande dimension. Spécifiquement, dans la première grande partie de cette thèse, en s'appuyant sur des outils avancés de la théorie des grandes matrices aléatoires, nous analysons les performances de méthodes spectrales sur des modèles de graphes réalistes et denses ainsi que sur des données en grandes dimensions en étudiant notamment les valeurs propres et vecteurs propres des matrices d'affinités de ces données. De nouvelles méthodes améliorées sont proposées sur la base de cette analyse théorique et démontrent à travers de nombreuses simulations que leurs performances sont meilleures comparées aux méthodes de l'état de l'art. Dans la seconde partie de la thèse, nous proposons un nouvel algorithme pour la détection de communautés hétérogènes entre plusieurs couches d'un graphe à plusieurs types d'interaction. Une approche bayésienne variationnelle est utilisée pour approximer la distribution apostériori des variables latentes du modèle. Toutes les méthodes proposées dans cette thèse sont utilisées sur des bases de données synthétiques et sur des données réelles et présentent de meilleures performances en comparaison aux approches standard de classification dans les contextes susmentionnés. / Spurred by recent advances on the theoretical analysis of the performances of the data-driven machine learning algorithms, this thesis tackles the performance analysis and improvement of high dimensional data and graph clustering. Specifically, in the first bigger part of the thesis, using advanced tools from random matrix theory, the performance analysis of spectral methods on dense realistic graph models and on high dimensional kernel random matrices is performed through the study of the eigenvalues and eigenvectors of the similarity matrices characterizing those data. New improved methods are proposed and are shown to outperform state-of-the-art approaches. In a second part, a new algorithm is proposed for the detection of heterogeneous communities from multi-layer graphs using variational Bayes approaches to approximate the posterior distribution of the sought variables. The proposed methods are successfully applied to synthetic benchmarks as well as real-world datasets and are shown to outperform standard approaches to clustering in those specific contexts.
85

Statistical Inference for Change Points in High-Dimensional Offline and Online Data

Li, Lingjun 07 April 2020 (has links)
No description available.
86

[pt] ENSAIOS SOBRE VOLATILIDADE E PREVISIBILIDADE DE RETORNOS / [en] ESSAYS ON VOLATILITY AND RETURNS PREDICTABILITY

IURI HONDA FERREIRA 18 August 2022 (has links)
[pt] Essa tese é composta por três artigos em econometria financeira. Os dois primeiros artigos exploram a relação entre retornos intradiários do mercado de equities e a implied volatility, representada pelo Índice de Volatilidade da CBOE (VIX). Nos dois artigos, estimamos previsões um minuto à frente utilizando janelas rolantes para cada dia. No primeiro artigo, as estimativas indicam que nossos modelos de fatores de volatilidade têm uma performance superior a benchmarks tradicionais em uma análise de séries de tempo em alta frequência, mesmo aos excluirmos períodos de crise da amostra. Os resultados também indicam uma performance fora da amostra maior para dias em que não ocorrem anúncios macroeconômicos. A performance é ainda maior quando removemos períodos de crise. O segundo artigo propõe uma abordagem de aprendizado de máquinas para modelar esse exercício de previsão. Implementamos um método de estimação intradiário minuto a minuto com janelas móveis, utilizando dois tipos de modelos não lineares: redes neurais com Long-Short-Term Memory (LSTM) e Random Forests (RF). Nossas estimativas mostram que o VIX é o melhor previsor de retornos de mercado intradiários entre os candidatos na nossa análise, especialmente quando implementadas através do modelo LSTM. Esse modelo também melhora significativamente a performance quando utilizamos o retorno de mercado defasado como variável preditiva. Finalmente, o último artigo explora uma extensão multivariada do método FarmPredict, combinando modelos vetoriais autoregressivos aumentados em fatores (FAVAR) e modelos esparsos em um ambiente de alta dimensão. Utilizando um procedimento de três estágios, somos capazes de estimar e prever fatores e seus loadings, que podem ser observados, não observados ou ambos, assim como uma estrutura idiossincrática fracamente esparsa. Realizamos uma aplicação dessa metodologia em um painel de volatilidades realizadas e os resultados de performance do método em etapas indicam melhorias quando comparado a benchmarks consolidados. / [en] This thesis is composed of three papers on financial econometrics. The first two papers explore the relation between intraday equity market returns and implied volatility, represented by the CBOE Volatility Index (VIX). In both papers, we estimate one-minute-ahead forecasts using rolling windows within a day. In the first paper, the estimates indicate that our volatility factor models outperform traditional benchmarks at high frequency time-series analysis, even when excluding crisis periods. We also find that the model has a better out-of-sample performance at days without macroeconomic announcements. Interestingly, these results are amplified when we remove the crisis period. The second paper proposes a machine learning modeling approach to this forecasting exercise. We implement a minute-by-minute rolling window intraday estimation method using two nonlinear models: Long-Short-Term Memory (LSTM) neural networks and Random Forests (RF). Our estimations show that the VIX is the strongest candidate predictor for intraday market returns in our analysis, especially when implemented through the LSTM model. This model also improves significantly the performance of the lagged market return as predictive variable. Finally, the third paper explores a multivariate extension of the FarmPredict method, by combining factor-augmented vector autoregressive (FAVAR) and sparse models in a high-dimensional environment. Using a three-stage procedure, we estimate and forecast factors and its loadings, which can be observed, unobserved, or both, as well as a weakly sparse idiosyncratic structure. We provide an application of this methodology to a panel of daily realized volatilities. Finally, the accuracy of the stepwise method indicates improvements of this forecasting method when compared to consolidated benchmarks.
87

Geometry of high dimensional Gaussian data

Mossberg, Olof Samuel January 2024 (has links)
Collected data may simultaneously be of low sample size and high dimension. Such data exhibit some geometric regularities consisting of a single observation being a rotation on a sphere, and a pair of observations being orthogonal. This thesis investigates these geometric properties in some detail. Background is provided and various approaches to the result are discussed. An approach based on the mean value theorem is eventually chosen, being the only candidate investigated that gives explicit convergence bounds. The bounds are tested employing Monte Carlo simulation and found to be adequate. / Data som insamlas kan samtidigt ha en liten stickprovsstorlek men vara högdimensionell. Sådan data uppvisar vissa geometriska mönster som består av att en enskild observation är en rotation på en sfär, och att ett par av observationer är rätvinkliga. Den här uppsatsen undersöker dessa geometriska egenskaper mer detaljerat. En bakgrund ges och olika typer av angreppssätt diskuteras. Till slut väljs en metod som baseras på medelvärdessatsen eftersom detta är den enda av de undersökta metoderna som ger explicita konvergensgränser. Gränserna testas sedermera med Monte Carlo-simulering och visar sig stämma.
88

Adaptive Mixture Estimation and Subsampling PCA

Liu, Peng January 2009 (has links)
No description available.
89

Partial EM Procedure for Big-Data Linear Mixed Effects Model, and Generalized PPE for High-Dimensional Data in Julia

Cho, Jang Ik 31 August 2018 (has links)
No description available.
90

Sparse Principal Component Analysis for High-Dimensional Data: A Comparative Study

Bonner, Ashley J. 10 1900 (has links)
<p><strong>Background:</strong> Through unprecedented advances in technology, high-dimensional datasets have exploded into many fields of observational research. For example, it is now common to expect thousands or millions of genetic variables (p) with only a limited number of study participants (n). Determining the important features proves statistically difficult, as multivariate analysis techniques become flooded and mathematically insufficient when n < p. Principal Component Analysis (PCA) is a commonly used multivariate method for dimension reduction and data visualization but suffers from these issues. A collection of Sparse PCA methods have been proposed to counter these flaws but have not been tested in comparative detail. <strong>Methods:</strong> Performances of three Sparse PCA methods were evaluated through simulations. Data was generated for 56 different data-structures, ranging p, the number of underlying groups and the variance structure within them. Estimation and interpretability of the principal components (PCs) were rigorously tested. Sparse PCA methods were also applied to a real gene expression dataset. <strong>Results:</strong> All Sparse PCA methods showed improvements upon classical PCA. Some methods were best at obtaining an accurate leading PC only, whereas others were better for subsequent PCs. There exist different optimal choices of Sparse PCA methods when ranging within-group correlation and across-group variances; thankfully, one method repeatedly worked well under the most difficult scenarios. When applying methods to real data, concise groups of gene expressions were detected with the most sparse methods. <strong>Conclusions:</strong> Sparse PCA methods provide a new insightful way to detect important features amidst complex high-dimension data.</p> / Master of Science (MSc)

Page generated in 0.1116 seconds