Global ETD Search

21	New Paradigms and Optimality Guarantees in Statistical Learning and Estimation Wang, Yu-Xiang 01 December 2017 (has links) Machine learning (ML) has become one of the most powerful classes of tools for artificial intelligence, personalized web services and data science problems across fields. Within the field of machine learning itself, there had been quite a number of paradigm shifts caused by the explosion of data size, computing power, modeling tools, and the new ways people collect, share, and make use of data sets. Data privacy, for instance, was much less of a problem before the availability of personal information online that could be used to identify users in anonymized data sets. Images, videos, as well as observations generated over a social networks, often have highly localized structures, that cannot be captured by standard nonparametric models. Moreover, the “common task framework” that is adopted by many sub- disciplines of AI has made it possible for many people to collaboratively and repeated work on the same data set, leading to implicit overfitting on public benchmarks. In addition, data collected in many internet services, e.g., web search and targeted ads, are not iid, but rather feedbacks specific to the deployed algorithm. This thesis presents technical contributions under a number of new mathematical frameworks that are designed to partially address these new paradigms. • Firstly, we consider the problem of statistical learning with privacy constraints. Under Vapnik’s general learning setting and the formalism of differential privacy (DP), we establish simple conditions that characterizes the private learnability, which reveals a mixture of positive and negative insight. We then identify generic methods that reuses existing randomness to effectively solve private learning in practice; and discuss weaker notions of privacy that allows for more favorable privacy-utility tradeoff. • Secondly, we develop a few generalizations of trend filtering, a locally-adaptive nonparametric regression technique that is minimax in 1D, to the multivariate setting and to graphs. We also study specific instances of the problems, e.g., total variation denoising on d-dimensional grids more closely and the results reveal interesting statistical computational trade-offs. • Thirdly, we investigate two problems in sequential interactive learning: a) off- policy evaluation in contextual bandits, that aims to use data collected from one algorithm to evaluate the performance of a different algorithm; b) the problem of adaptive data analysis, that uses randomization to prevent adversarial data analysts from a form of “p-hacking” through multiple steps of sequential data access. In the above problems, we will provide not only performance guarantees of algorithms but also certain notions of optimality. Whenever applicable, careful empirical studies on synthetic and real data are also included. machine learning statistics differential privacy nonparametric regression trend filtering contextual bandits
22	Functional Data Models for Raman Spectral Data and Degradation Analysis Do, Quyen Ngoc 16 August 2022 (has links) Functional data analysis (FDA) studies data in the form of measurements over a domain as whole entities. Our first focus is on the post-hoc analysis with pairwise and contrast comparisons of the popular functional ANOVA model comparing groups of functional data. Existing contrast tests assume independent functional observations within group. In reality, this assumption may not be satisfactory since functional data are often collected continually overtime on a subject. In this work, we introduce a new linear contrast test that accounts for time dependency among functional group members. For a significant contrast test, it can be beneficial to identify the region of significant difference. In the second part, we propose a non-parametric regression procedure to obtain a locally sparse estimate of functional contrast. Our work is motivated by a biomedical study using Raman spectroscopy to monitor hemodialysis treatment near real-time. With contrast test and sparse estimation, practitioners can monitor the progress of the hemodialysis within session and identify important chemicals for dialysis adequacy monitoring. In the third part, we propose a functional data model for degradation analysis of functional data. Motivated by degradation analysis application of rechargeable Li-ion batteries, we combine state-of-the-art functional linear models to produce fully functional prediction for curves on heterogenous domains. Simulation studies and data analysis demonstrate the advantage of the proposed method in predicting degradation measure than existing method using aggregation method. / Doctor of Philosophy / Functional data analysis (FDA) studies complex data structure in the form of curves and shapes. Our work is motivated by two applications concerning data from Raman spectroscopy and battery degradation study. Raman spectra of a liquid sample are curves with measurements over a domain of wavelengths that can identify chemical composition and whose values signify the constituent concentrations in the sample. We first propose a statistical procedure to test the significance of a functional contrast formed by spectra collected at beginning and at later time points during a dialysis session. Then a follow-up procedure is developed to produce a sparse representation of the contrast functional contrast with clearly identified zero and nonzero regions. The use of this method on contrast formed by Raman spectra of used dialysate collected at different time points during hemodialysis sessions can be adapted for evaluating the treatment efficacy in real time. In a third project, we apply state-of-the-art methodologies from FDA to a degradation study of rechargeable Li-ion batteries. Our proposed methods produce fully functional prediction of voltage discharge curves allowing flexibility in monitoring battery health. Functional data analysis degradation data analysis functional linear regression nonparametric regression Raman spectra Lithium-ion batteries
23	Precision Aggregated Local Models Edwards, Adam Michael 28 January 2021 (has links) Large scale Gaussian process (GP) regression is infeasible for larger data sets due to cubic scaling of flops and quadratic storage involved in working with covariance matrices. Remedies in recent literature focus on divide-and-conquer, e.g., partitioning into sub-problems and inducing functional (and thus computational) independence. Such approximations can speedy, accurate, and sometimes even more flexible than an ordinary GPs. However, a big downside is loss of continuity at partition boundaries. Modern methods like local approximate GPs (LAGPs) imply effectively infinite partitioning and are thus pathologically good and bad in this regard. Model averaging, an alternative to divide-and-conquer, can maintain absolute continuity but often over-smooth, diminishing accuracy. Here I propose putting LAGP-like methods into a local experts-like framework, blending partition-based speed with model-averaging continuity, as a flagship example of what I call precision aggregated local models (PALM). Using N_C LAGPs, each selecting n from N data pairs, I illustrate a scheme that is at most cubic in n, quadratic in N_C, and linear in N, drastically reducing computational and storage demands. Extensive empirical illustration shows how PALM is at least as accurate as LAGP, can be much faster in terms of speed, and furnishes continuous predictive surfaces. Finally, I propose sequential updating scheme which greedily refines a PALM predictor up to a computational budget, and several variations on the basic PALM that may provide predictive improvements. / Doctor of Philosophy / Occasionally, when describing the relationship between two variables, it may be helpful to use a so-called ``non-parametric" regression that is agnostic to the function that connects them. Gaussian Processes (GPs) are a popular method of non-parametric regression used for their relative flexibility and interpretability, but they have the unfortunate drawback of being computationally infeasible for large data sets. Past work into solving the scaling issues for GPs has focused on ``divide and conquer" style schemes that spread the data out across multiple smaller GP models. While these model make GP methods much more accessible to large data sets they do so either at the expense of local predictive accuracy of global surface continuity. Precision Aggregated Local Models (PALM) is a novel divide and conquer method for GP models that is scalable for large data while maintaining local accuracy and a smooth global model. I demonstrate that PALM can be built quickly, and performs well predictively compared to other state of the art methods. This document also provides a sequential algorithm for selecting the location of each local model, and variations on the basic PALM methodology. approximate kriging neighborhoods Gaussian process surrogate nonparametric regression nearest neighbor boosting sequential design active learning
24	Jump estimation for noisy blurred step functions / Sprungschätzung für verrauschte Beobachtungen von verschmierten Treppenfunktionen Boysen, Leif 09 May 2006 (has links) No description available. 510 Mathematik Mathematics and Computer Science Statistische Inverse Probleme Schätzen von Sprungstellen Rekonstruktion mit Treppenfunktionen Statistical inverse problems change-point estimation reconstruction with step functions 31.73
25	Topics in Modern Bayesian Computation Qamar, Shaan January 2015 (has links) <p>Collections of large volumes of rich and complex data has become ubiquitous in recent years, posing new challenges in methodological and theoretical statistics alike. Today, statisticians are tasked with developing flexible methods capable of adapting to the degree of complexity and noise in increasingly rich data gathered across a variety of disciplines and settings. This has spurred the need for novel multivariate regression techniques that can efficiently capture a wide range of naturally occurring predictor-response relations, identify important predictors and their interactions and do so even when the number of predictors is large but the sample size remains limited. </p><p>Meanwhile, efficient model fitting tools must evolve quickly to keep pace with the rapidly growing dimension and complexity of data they are applied to. Aided by the tremendous success of modern computing, Bayesian methods have gained tremendous popularity in recent years. These methods provide a natural probabilistic characterization of uncertainty in the parameters and in predictions. In addition, they provide a practical way of encoding model structure that can lead to large gains in statistical estimation and more interpretable results. However, this flexibility is often hindered in applications to modern data which are increasingly high dimensional, both in the number of observations $n$ and the number of predictors $p$. Here, computational complexity and the curse of dimensionality typically render posterior computation inefficient. In particular, Markov chain Monte Carlo (MCMC) methods which remain the workhorse for Bayesian computation (owing to their generality and asymptotic accuracy guarantee), typically suffer data processing and computational bottlenecks as a consequence of (i) the need to hold the entire dataset (or available sufficient statistics) in memory at once; and (ii) having to evaluate of the (often expensive to compute) data likelihood at each sampling iteration. </p><p>This thesis divides into two parts. The first part concerns itself with developing efficient MCMC methods for posterior computation in the high dimensional {\em large-n large-p} setting. In particular, we develop an efficient and widely applicable approximate inference algorithm that extends MCMC to the online data setting, and separately propose a novel stochastic search sampling scheme for variable selection in high dimensional predictor settings. The second part of this thesis develops novel methods for structured sparsity in the high-dimensional {\em large-p small-n} regression setting. Here, statistical methods should scale well with the predictor dimension and be able to efficiently identify low dimensional structure so as to facilitate optimal statistical estimation in the presence of limited data. Importantly, these methods must be flexible to accommodate potentially complex relationships between the response and its associated explanatory variables. The first work proposes a nonparametric additive Gaussian process model to learn predictor-response relations that may be highly nonlinear and include numerous lower order interaction effects, possibly in different parts of the predictor space. A second work proposes a novel class of Bayesian shrinkage priors for multivariate regression with a tensor valued predictor. Dimension reduction is achieved using a low-rank additive decomposition for the latter, enabling a highly flexible and rich structure within which excellent cell-estimation and region selection may be obtained through state-of-the-art shrinkage methods. In addition, the methods developed in these works come with strong theoretical guarantees.</p> / Dissertation Statistics Approximate Bayesian computation High dimensional regression Nonparametric regression Scalable Markov chain Monte Carlo Structured additive models Variable selection
26	Nonparametric statistical inference for dependent censored data El Ghouch, Anouar 05 October 2007 (has links) A frequent problem that appears in practical survival data analysis is censoring. A censored observation occurs when the observation of the event time (duration or survival time) may be prevented by the occurrence of an earlier competing event (censoring time). Censoring may be due to different causes. For example, the loss of some subjects under study, the end of the follow-up period, drop out or the termination of the study and the limitation in the sensitivity of a measurement instrument. The literature about censored data focuses on the i.i.d. case. However in many real applications the data are collected sequentially in time or space and so the assumption of independence in such case does not hold. Here we only give some typical examples from the literature involving correlated data which are subject to censoring. In the clinical trials domain it frequently happens that the patients from the same hospital have correlated survival times due to unmeasured variables like the quality of the hospital equipment. Censored correlated data are also a common problem in the domain of environmental and spatial (geographical or ecological) statistics. In fact, due to the process being used in the data sampling procedure, e.g. the analytical equipment, only the measurements which exceed some thresholds, for example the method detection limits or the instrumental detection limits, can be included in the data analysis. Many other examples can also be found in other fields like econometrics and financial statistics. Observations on duration of unemployment e.g., may be right censored and are typically correlated. When the data are not independent and are subject to censoring, estimation and inference become more challenging mathematical problems with a wide area of applications. In this context, we propose here some new and flexible tools based on a nonparametric approach. More precisely, allowing dependence between individuals, our main contribution to this domain concerns the following aspects. First, we are interested in developing more suitable confidence intervals for a general class of functionals of a survival distribution via the empirical likelihood method. Secondly, we study the problem of conditional mean estimation using the local linear technique. Thirdly, we develop and study a new estimator of the conditional quantile function also based on the local linear method. In this dissertation, for each proposed method, asymptotic results like consistency and asymptotic normality are derived and the finite sample performance is evaluated in a simulation study. Kernel smoothing Local linear Blocking Quantile regression Survival analysis Nonparametric regression Mean regression Mixing sequences Censoring Kaplan-Meier integral
27	Nonparametric statistical inference for dependent censored data El Ghouch, Anouar 05 October 2007 (has links) A frequent problem that appears in practical survival data analysis is censoring. A censored observation occurs when the observation of the event time (duration or survival time) may be prevented by the occurrence of an earlier competing event (censoring time). Censoring may be due to different causes. For example, the loss of some subjects under study, the end of the follow-up period, drop out or the termination of the study and the limitation in the sensitivity of a measurement instrument. The literature about censored data focuses on the i.i.d. case. However in many real applications the data are collected sequentially in time or space and so the assumption of independence in such case does not hold. Here we only give some typical examples from the literature involving correlated data which are subject to censoring. In the clinical trials domain it frequently happens that the patients from the same hospital have correlated survival times due to unmeasured variables like the quality of the hospital equipment. Censored correlated data are also a common problem in the domain of environmental and spatial (geographical or ecological) statistics. In fact, due to the process being used in the data sampling procedure, e.g. the analytical equipment, only the measurements which exceed some thresholds, for example the method detection limits or the instrumental detection limits, can be included in the data analysis. Many other examples can also be found in other fields like econometrics and financial statistics. Observations on duration of unemployment e.g., may be right censored and are typically correlated. When the data are not independent and are subject to censoring, estimation and inference become more challenging mathematical problems with a wide area of applications. In this context, we propose here some new and flexible tools based on a nonparametric approach. More precisely, allowing dependence between individuals, our main contribution to this domain concerns the following aspects. First, we are interested in developing more suitable confidence intervals for a general class of functionals of a survival distribution via the empirical likelihood method. Secondly, we study the problem of conditional mean estimation using the local linear technique. Thirdly, we develop and study a new estimator of the conditional quantile function also based on the local linear method. In this dissertation, for each proposed method, asymptotic results like consistency and asymptotic normality are derived and the finite sample performance is evaluated in a simulation study. Kernel smoothing Local linear Blocking Quantile regression Survival analysis Nonparametric regression Mean regression Mixing sequences Censoring Kaplan-Meier integral
28	Econometric studies on flexible modeling of developing countries in growth analysis / Ökonometrische Studien über Wachstumsanalysen von Entwicklungsländern Köhler, Max 02 May 2012 (has links) No description available. 330 Wirtschaft Wachstumsregression Economics Nichtparametrische Regression Ökonomisches Wachstum Paneldaten Nonparametric Regression Economic Growth Panel Data Ökonomisches Wachstum
29	比較使用Kernel和Spline法的傘型迴歸估計 / Compare the Estimation on Umbrella Function by Using Kernel and Spline Regression Method 賴品霖, Lai, Pin Lin Unknown Date (has links) 本研究探討常用的兩個無母數迴歸方法，核迴歸與樣條迴歸，在具有傘型限制式下，對於傘型函數的估計與不具限制式下的傘型函數估計比較，同時也探討不同誤差變異對估計結果的影響，並進一步探討受限制下兩方法的估計比較。本研究採用「估計頂點位置與實際頂點位置差」及「誤差平方和」作為衡量估計結果的指標。在帶寬及節點的選取上，本研究採用逐一剔除交互驗證法來篩選。模擬結果顯示，受限制的核函數在誤差變異較大的頂點位置估計較佳，誤差變異縮小時反而頂點位置估計較差，受限制的B-樣條函數也有類似的狀況。而在兩方法的比較上，對於較小的誤差變異，核函數的頂點位置估計能力不如樣條函數，但在整體的誤差平方和上卻沒有太大劣勢，當誤差變異較大時，核函數的頂點位置估計能力有所提升，整體誤差平方和仍舊維持還不錯的結果。 / In this study, we give an umbrella order constraint on kernel and spline regression model. We compare their estimation in two measurements, one is the difference of estimate peak and true peak, the other one is the sum of square difference on predict and the true value. We use leave-one-out cross validation to select bandwidth for kernel function and also to decide the number of knots for spline function. The effect of different error size is also considered. Some of R packages are used when doing simulation. The result shows that when the error size is bigger, the prediction of peak location is better in both constrained kernel and spline estimation. The constrained spline regression tends to provide better peak location estimation compared to constrained kernel regression. 核迴歸樣條迴歸無母數迴歸傘型函數 Kernel regression Spline regression Nonparametric regression Umbrella function
30	Développement de modèles non paramétriques et robustes : application à l’analyse du comportement de bivalves et à l’analyse de liaison génétique Sow, Mohamedou 20 May 2011 (has links) Le développement des approches robustes et non paramétriques pour l’analyse et le traitement statistique de gros volumes de données présentant une forte variabilité,comme dans les domaines de l’environnement et de la génétique, est fondamental.Nous modélisons ici des données complexes de biologie appliquées à l’étude du comportement de bivalves et à l’analyse de liaison génétique. L’application des mathématiques à l’analyse du comportement de mollusques bivalves nous a permis d’aller vers une quantification et une traduction mathématique de comportements d’animaux in-situ, en milieu proche ou lointain. Nous avons proposé un modèle de régression non paramétrique et comparé 3 estimateurs non paramétriques, récursifs ou non,de la fonction de régression pour optimiser le meilleur estimateur. Nous avons ensuite caractérisé des rythmes biologiques, formalisé l’évolution d’états d’ouvertures,proposé des méthodes de discrimination de comportements, utilisé la méthode des shot-noises pour caractériser différents états d’ouverture-fermetures transitoires et développé une méthode originale de mesure de croissance en ligne.En génétique, nous avons abordé un cadre plus général de statistiques robustes pour l’analyse de liaison génétique. Nous avons développé des estimateurs robustes aux hypothèses de normalités et à la présence de valeurs aberrantes, nous avons aussi utilisé une approche statistique, où nous avons abordé la dépendance entre variables aléatoires via la théorie des copules. Nos principaux résultats ont montré l’intérêt pratique de ces estimateurs sur des données réelles de QTL et eQTL. / The development of robust and nonparametric approaches for the analysis and statistical treatment of high-dimensional data sets exhibiting high variability, as seen in the environmental and genetic fields, is instrumental. Here, we model complex biological data with application to the analysis of bivalves’ behavior and to linkage analysis. The application of mathematics to the analysis of mollusk bivalves’behavior gave us the possibility to quantify and translate mathematically the animals’behavior in situ, in close or far field. We proposed a nonparametric regression model and compared three nonparametric estimators (recursive or not) of the regressionfunction to optimize the best estimator. We then characterized the biological rhythms, formalized the states of opening, proposed methods able to discriminate the behaviors, used shot-noise analysis to characterize various opening/closing transitory states and developed an original approach for measuring online growth.In genetics, we proposed a more general framework of robust statistics for linkage analysis. We developed estimators robust to distribution assumptions and the presence of outlier observations. We also used a statistical approach where the dependence between random variables is specified through copula theory. Our main results showed the practical interest of these estimators on real data for QTL and eQTL analysis. Biomonitoring Régression non paramétrique Estimation à noyau Valvométrie HFNI Bivalve Estimateur robuste EQTL Méthode de paires de germains Biomonitoring Nonparametric regression Kernel estimation Valvometrie HFNI Bivalve Robust estimator EQTL Sib-pair method

Search results