Global ETD Search

41	Stochastic Approximation Algorithms with Set-valued Dynamics : Theory and Applications Ramaswamy, Arunselvan January 2016 (has links) (PDF) Stochastic approximation algorithms encompass a class of iterative schemes that converge to a sought value through a series of successive approximations. Such algorithms converge even when the observations are erroneous. Errors in observations may arise due to the stochastic nature of the problem at hand or due to extraneous noise. In other words, stochastic approximation algorithms are self-correcting schemes, in that the errors are wiped out in the limit and the algorithms still converge to the sought values. The rst stochastic approximation algorithm was developed by Robbins and Monro in 1951 to solve the root- nding problem. In 1977 Ljung showed that the asymptotic behavior of a stochastic approximation algorithm can be studied by associating a deterministic ODE, called the associated ODE, and studying it's asymptotic behavior instead. This is commonly referred to as the ODE method. In 1996 Bena•m and Bena•m and Hirsch [1] [2] used the dynamical systems approach in order to develop a framework to analyze generalized stochastic approximation algorithms, given by the following recursion: xn+1 = xn + a(n) [h(xn) + Mn+1] ; (1) where xn 2 Rd for all n; h : Rd ! Rd is Lipschitz continuous; fa(n)gn 0 is the given step-size sequence; fMn+1gn 0 is the Martingale difference noise. The assumptions of [1] later became the `standard assumptions for convergence'. One bottleneck in deploying this framework is the requirement on stability (almost sure boundedness) of the iterates. In 1999 Borkar and Meyn developed a unified set of assumptions that guaranteed both stability and convergence of stochastic approximations. However, the aforementioned frameworks did not account for scenarios with set-valued mean fields. In 2005 Bena•m, Hofbauer and Sorin [3] showed that the dynamical systems approach to stochastic approximations can be extended to scenarios with set-valued mean- fields. Again, stability of the fiterates was assumed. Note that stochastic approximation algorithms with set-valued mean- fields are also called stochastic recursive inclusions (SRIs). The Borkar-Meyn theorem for SRIs [10] As stated earlier, in many applications stability of the iterates is a hard assumption to verify. In Chapter 2 of the thesis, we present an extension of the original theorem of Borkar and Meyn to include SRIs. Specifically, we present two different (yet related) easily-verifiable sets of assumptions for both stability and convergence of SRIs. A SRI is given by the following recursion in Rd: xn+1 = xn + a(n) [yn + Mn+1] ; (2) where 8 n yn 2 H(xn) and H : Rd ! fsubsets of Rdg is a given Marchaud map. As a corollary to one of our main results, a natural generalization of the original Borkar and Meyn theorem is seen to follow. We also present two applications of our framework. First, we use our framework to provide a solution to the `approximate drift problem'. This problem can be stated as follows. When an experimenter runs a traditional stochastic approximation algorithm such as (1), the exact value of the drift h cannot be accurately calculated at every stage. In other words, the recursion run by the experimenter is given by (2), where yn is an approximation of h(xn) at stage n. A natural question arises: Do the errors due to approximations accumulate and wreak havoc with the long-term behavior (convergence) of the algorithm? Using our framework, we show the following: Suppose a stochastic approximation algorithm without errors can be guaranteed to be stable, then it's `approximate version' with errors is also stable, provided the errors are bounded at every stage. For the second application, we use our framework to relax the stability assumptions involved in the original Borkar-Meyn theorem, hence making the framework more applicable. It may be noted that the contents of Chapter 2 are based on [10]. Analysis of gradient descent methods with non-diminishing, bounded errors [9] Let us consider a continuously differentiable function f. Suppose we are interested in nding a minimizer of f, then a gradient descent (GD) scheme may be employed to nd a local minimum. Such a scheme is given by the following recursion in Rd: xn+1 = xn a(n)rf(xn): (3) GD is an important implementation tool for many machine learning algorithms, such as the backpropagation algorithm to train neural networks. For the sake of convenience, experimenters often employ gradient estimators such as Kiefer-Wolfowitz estimator, simultaneous perturbation stochastic approximation, etc. These estimators provide an estimate of the gradient rf(xn) at stage n. Since these estimators only provide an approximation of the true gradient, the experimenter is essentially running the recursion given by (2), where yn is a `gradient estimate' at stage n. Such gradient methods with errors have been previously studied by Bertsekas and Tsitsiklis [5]. However, the assumptions involved are rather restrictive and hard to verify. In particular, the gradient-errors are required to vanish asymptotically at a prescribed rate. This may not hold true in many scenarios. In Chapter 3 of the thesis, the results of [5] are extended to GD with bounded, non-diminishing errors, given by the following recursion in Rd: xn+1 = xn a(n) [rf(xn) + (n)] ; (4) where k (n)k for some fixed > 0. As stated earlier, previous literature required k (n)k ! 0, as n ! 1, at a `prescribed rate'. Sufficient conditions are presented for both stability and convergence of (4). In other words, the conditions presented in Chapter 3 ensure that the errors `do not accumulate' and wreak havoc with the stability or convergence of GD. Further, we show that (4) converges to a small neighborhood of the minimum set, which in turn depends on the error-bound . To the best of our knowledge this is the first time that GD with bounded non-diminishing errors has been analyzed. As an application, we use our framework to present a simplified implementation of simultaneous perturbation stochastic approximation (SPSA), a popular gradient descent method introduced by Spall [13]. Traditional convergence-analysis of SPSA involves assumptions that `couple' the `sensitivity parameters' of SPSA and the step-sizes. These assumptions restrict the choice of step-sizes available to the experimenter. In the context of machine learning, the learning rate may be adversely affected. We present an implementation of SPSA using `constant sensitivity parameters', thereby `decoupling' the step-sizes and sensitivity parameters. Further, we show that SPSA with constant sensitivity parameters can be analyzed using our framework. Finally, we present experimental results to support our theory. It may be noted that contents of Chapter 3 are based on [9]. b(n) a(n) Stochastic recursive inclusions with two timescales [12] There are many scenarios wherein the traditional single timescale framework cannot be used to analyze the algorithm at hand. Consider for example, the adaptive heuristic critic approach to reinforcement learning, which requires a stationary value iteration (for a fixed policy) to be executed between two policy iterations. To analyze such schemes Borkar [6] introduced the two timescale framework, along with a set of sufficient conditions which guarantee their convergence. Perkins and Leslie [8] extended the framework of Borkar to include set-valued mean- fields. However, the assumptions involved were still very restrictive and not easily verifiable. In Chapter 4 of the thesis, we present a generalization of the aforementioned frameworks. The framework presented is more general when compared to the frameworks of [6] and [8], and the assumptions involved are easily verifiable. A SRI with two timescales is given by the following coupled iteration: xn+1 = xn + a(n) un + Mn1+1 ; (5) yn+1 = yn + b(n) vn + Mn2+1 ; (6) where xn 2 R d and yn 2 R k for all n 0; un 2 h(xn; yn) and vn 2 g(xn; yn) for all n 0, where h : Rd Rk ! fsubsets of Rdg and g : Rd Rk ! fsubsets of Rkg are two given Marchaud maps; fa(n)gn 0 and fb(n)gn 0 are the step-size sequences satisfying ! 0 as n ! 1; fMn1+1gn 0 and fMn2+1 gn 0 constitute the Martingale noise terms. Our main contribution is in the weakening of the key assumption that `couples' the behavior of the x and y iterates. As an application of our framework we analyze the two timescale algorithm which solves the `constrained Lagrangian dual optimization problem'. The problem can be stated as thus: Given two functions f : Rd ! R and g : Rd ! Rk, we want to minimize f(x) subject to the condition that g(x) 0. This problem can be stated in the following primal form: inf sup f(x) + T g(x) : (7) 2R 2R0 x d k Under strong duality, solving the above equation is equivalent to solving it's dual: sup inf f(x) + T g(x) : (8) 2Rk x2Rd 0 The corresponding two timescale algorithm to solve the dual is given by: xn+1 = xn a(n) rx f(xn) + nT g(xn) + Mn2+1 ; (9) n+1 = n + b(n) f(xn) + nT g(xn) + Mn1+1 : r We use our framework to show that (9) converges to a solution of the dual given by (8). Further, as a consequence of our framework, the class of objective and constraint functions, for which (9) can be analyzed, is greatly enlarged. It may be noted that the contents of Chapter 4 are based on [12]. Stochastic approximation driven by `controlled Markov' process and temporal difference learning [11] In the field of reinforcement learning, one encounters stochastic approximation algorithms that are driven by Markov processes. The groundwork for analyzing the long-term behavior of such algorithms was laid by Benveniste et. al. [4]. Borkar [7] extended the results of [4] to include algorithms driven by `controlled Markov' processes i.e., algorithms where the `state process' was in turn driven by a time varying `control' process. Another important extension was that multiple stationary distributions were allowed, see [7] for details. The convergence analysis of [7] assumed that the iterates were stable. In reinforcement learning applications, stability is a hard assumption to verify. Hence, the stability assumption poses a bottleneck when deploying the aforementioned framework for the analysis of reinforcement algorithms. In Chapter 5 of the thesis we present sufficient conditions for both stability and convergence of stochastic approximations driven by `controlled Markov' processes. As an application of our framework, sufficient conditions for stability of temporal difference (T D) learning algorithm, an important policy-evaluation method, are presented that are compatible with existing conditions for convergence. The conditions are weakened two-fold in that (a) the Markov process is no longer required to evolve in a finite state space and (b) the state process is not required to be ergodic under a given stationary policy. It may be noted that the contents of Chapter 5 are based on [11]. Stochastic Approximation Algorithms Set-Valued Dynamical Systems Stochastic Recursive Inclusions Stability Theorem Controlled Markov Process Borkar-Meyn Theorem Stochastic Approximations Computer Science
42	Distributed Inference using Bounded Transmissions January 2013 (has links) abstract: Distributed inference has applications in a wide range of fields such as source localization, target detection, environment monitoring, and healthcare. In this dissertation, distributed inference schemes which use bounded transmit power are considered. The performance of the proposed schemes are studied for a variety of inference problems. In the first part of the dissertation, a distributed detection scheme where the sensors transmit with constant modulus signals over a Gaussian multiple access channel is considered. The deflection coefficient of the proposed scheme is shown to depend on the characteristic function of the sensing noise, and the error exponent for the system is derived using large deviation theory. Optimization of the deflection coefficient and error exponent are considered with respect to a transmission phase parameter for a variety of sensing noise distributions including impulsive ones. The proposed scheme is also favorably compared with existing amplify-and-forward (AF) and detect-and-forward (DF) schemes. The effect of fading is shown to be detrimental to the detection performance and simulations are provided to corroborate the analytical results. The second part of the dissertation studies a distributed inference scheme which uses bounded transmission functions over a Gaussian multiple access channel. The conditions on the transmission functions under which consistent estimation and reliable detection are possible is characterized. For the distributed estimation problem, an estimation scheme that uses bounded transmission functions is proved to be strongly consistent provided that the variance of the noise samples are bounded and that the transmission function is one-to-one. The proposed estimation scheme is compared with the amplify and forward technique and its robustness to impulsive sensing noise distributions is highlighted. It is also shown that bounded transmissions suffer from inconsistent estimates if the sensing noise variance goes to infinity. For the distributed detection problem, similar results are obtained by studying the deflection coefficient. Simulations corroborate our analytical results. In the third part of this dissertation, the problem of estimating the average of samples distributed at the nodes of a sensor network is considered. A distributed average consensus algorithm in which every sensor transmits with bounded peak power is proposed. In the presence of communication noise, it is shown that the nodes reach consensus asymptotically to a finite random variable whose expectation is the desired sample average of the initial observations with a variance that depends on the step size of the algorithm and the variance of the communication noise. The asymptotic performance is characterized by deriving the asymptotic covariance matrix using results from stochastic approximation theory. It is shown that using bounded transmissions results in slower convergence compared to the linear consensus algorithm based on the Laplacian heuristic. Simulations corroborate our analytical findings. Finally, a robust distributed average consensus algorithm in which every sensor performs a nonlinear processing at the receiver is proposed. It is shown that non-linearity at the receiver nodes makes the algorithm robust to a wide range of channel noise distributions including the impulsive ones. It is shown that the nodes reach consensus asymptotically and similar results are obtained as in the case of transmit non-linearity. Simulations corroborate our analytical findings and highlight the robustness of the proposed algorithm. / Dissertation/Thesis / Ph.D. Electrical Engineering 2013 Electrical engineering Distributed Detection Multiple Access Channel Constant Modulus Deflection Coefficient Error Exponent Distributed Consensus Sensor Networks Bounded Transmissions Asymptotic Covariance Stochastic Approximation Markov Processes.
43	Quantile regression for mixed-effects models = Regressão quantílica para modelos de efeitos mistos / Regressão quantílica para modelos de efeitos mistos Galarza Morales, Christian Eduardo, 1988- 27 August 2018 (has links) Orientador: Víctor Hugo Lachos Dávila / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Matemática Estatística e Computação Científica / Made available in DSpace on 2018-08-27T06:40:31Z (GMT). No. of bitstreams: 1 GalarzaMorales_ChristianEduardo_M.pdf: 5076076 bytes, checksum: 0967f08c9ad75f9e7f5df339563ef75a (MD5) Previous issue date: 2015 / Resumo: Os dados longitudinais são frequentemente analisados usando modelos de efeitos mistos normais. Além disso, os métodos de estimação tradicionais baseiam-se em regressão na média da distribuição considerada, o que leva a estimação de parâmetros não robusta quando a distribuição do erro não é normal. Em comparação com a abordagem de regressão na média convencional, a regressão quantílica (RQ) pode caracterizar toda a distribuição condicional da variável de resposta e é mais robusta na presença de outliers e especificações erradas da distribuição do erro. Esta tese desenvolve uma abordagem baseada em verossimilhança para analisar modelos de RQ para dados longitudinais contínuos correlacionados através da distribuição Laplace assimétrica (DLA). Explorando a conveniente representação hierárquica da DLA, a nossa abordagem clássica segue a aproximação estocástica do algoritmo EM (SAEM) para derivar estimativas de máxima verossimilhança (MV) exatas dos efeitos fixos e componentes de variância em modelos lineares e não lineares de efeitos mistos. Nós avaliamos o desempenho do algoritmo em amostras finitas e as propriedades assintóticas das estimativas de MV através de experimentos empíricos e aplicações para quatro conjuntos de dados reais. Os algoritmos SAEMs propostos são implementados nos pacotes do R qrLMM() e qrNLMM() respectivamente / Abstract: Longitudinal data are frequently analyzed using normal mixed effects models. Moreover, the traditional estimation methods are based on mean regression, which leads to non-robust parameter estimation for non-normal error distributions. Compared to the conventional mean regression approach, quantile regression (QR) can characterize the entire conditional distribution of the outcome variable and is more robust to the presence of outliers and misspecification of the error distribution. This thesis develops a likelihood-based approach to analyzing QR models for correlated continuous longitudinal data via the asymmetric Laplace distribution (ALD). Exploiting the nice hierarchical representation of the ALD, our classical approach follows the stochastic Approximation of the EM (SAEM) algorithm for deriving exact maximum likelihood (ML) estimates of the fixed-effects and variance components in linear and nonlinear mixed effects models. We evaluate the finite sample performance of the algorithm and the asymptotic properties of the ML estimates through empirical experiments and applications to four real life datasets. The proposed SAEMs algorithms are implemented in the R packages qrLMM() and qrNLMM() respectively / Mestrado / Estatistica / Mestre em Estatística Modelos lineares (Estatistica) Modelos não lineares (Estatística) Estimativa de parâmetro Análise de regressão Aproximação estocástica Linear models (Statistics) Nonlinear models (Statistics) Parameter estimation Regression analysis Stochastic approximation
44	Stochastic approximation in Hilbert spaces / Approximation stochastique dans les espaces de Hilbert Dieuleveut, Aymeric 28 September 2017 (has links) Le but de l’apprentissage supervisé est d’inférer des relations entre un phénomène que l’on souhaite prédire et des variables « explicatives ». À cette fin, on dispose d’observations de multiples réalisations du phénomène, à partir desquelles on propose une règle de prédiction. L’émergence récente de sources de données à très grande échelle, tant par le nombre d’observations effectuées (en analyse d’image, par exemple) que par le grand nombre de variables explicatives (en génétique), a fait émerger deux difficultés : d’une part, il devient difficile d’éviter l’écueil du sur-apprentissage lorsque le nombre de variables explicatives est très supérieur au nombre d’observations; d’autre part, l’aspect algorithmique devient déterminant, car la seule résolution d’un système linéaire dans les espaces en jeupeut devenir une difficulté majeure. Des algorithmes issus des méthodes d’approximation stochastique proposent uneréponse simultanée à ces deux difficultés : l’utilisation d’une méthode stochastique réduit drastiquement le coût algorithmique, sans dégrader la qualité de la règle de prédiction proposée, en évitant naturellement le sur-apprentissage. En particulier, le cœur de cette thèse portera sur les méthodes de gradient stochastique. Les très populaires méthodes paramétriques proposent comme prédictions des fonctions linéaires d’un ensemble choisi de variables explicatives. Cependant, ces méthodes aboutissent souvent à une approximation imprécise de la structure statistique sous-jacente. Dans le cadre non-paramétrique, qui est un des thèmes centraux de cette thèse, la restriction aux prédicteurs linéaires est levée. La classe de fonctions dans laquelle le prédicteur est construit dépend elle-même des observations. En pratique, les méthodes non-paramétriques sont cruciales pour diverses applications, en particulier pour l’analyse de données non vectorielles, qui peuvent être associées à un vecteur dans un espace fonctionnel via l’utilisation d’un noyau défini positif. Cela autorise l’utilisation d’algorithmes associés à des données vectorielles, mais exige une compréhension de ces algorithmes dans l’espace non-paramétrique associé : l’espace à noyau reproduisant. Par ailleurs, l’analyse de l’estimation non-paramétrique fournit également un éclairage révélateur sur le cadre paramétrique, lorsque le nombre de prédicteurs surpasse largement le nombre d’observations. La première contribution de cette thèse consiste en une analyse détaillée de l’approximation stochastique dans le cadre non-paramétrique, en particulier dans le cadre des espaces à noyaux reproduisants. Cette analyse permet d’obtenir des taux de convergence optimaux pour l’algorithme de descente de gradient stochastique moyennée. L’analyse proposée s’applique à de nombreux cadres, et une attention particulière est portée à l’utilisation d’hypothèses minimales, ainsi qu’à l’étude des cadres où le nombre d’observations est connu à l’avance, ou peut évoluer. La seconde contribution est de proposer un algorithme, basé sur un principe d’accélération, qui converge à une vitesse optimale, tant du point de vue de l’optimisation que du point de vue statistique. Cela permet, dans le cadre non-paramétrique, d’améliorer la convergence jusqu’au taux optimal, dans certains régimes pour lesquels le premier algorithme analysé restait sous-optimal. Enfin, la troisième contribution de la thèse consiste en l’extension du cadre étudié au delà de la perte des moindres carrés : l’algorithme de descente de gradient stochastiqueest analysé comme une chaine de Markov. Cette approche résulte en une interprétation intuitive, et souligne les différences entre le cadre quadratique et le cadre général. Une méthode simple permettant d’améliorer substantiellement la convergence est également proposée. / The goal of supervised machine learning is to infer relationships between a phenomenon one seeks to predict and “explanatory” variables. To that end, multiple occurrences of the phenomenon are observed, from which a prediction rule is constructed. The last two decades have witnessed the apparition of very large data-sets, both in terms of the number of observations (e.g., in image analysis) and in terms of the number of explanatory variables (e.g., in genetics). This has raised two challenges: first, avoiding the pitfall of over-fitting, especially when the number of explanatory variables is much higher than the number of observations; and second, dealing with the computational constraints, such as when the mere resolution of a linear system becomes a difficulty of its own. Algorithms that take their roots in stochastic approximation methods tackle both of these difficulties simultaneously: these stochastic methods dramatically reduce the computational cost, without degrading the quality of the proposed prediction rule, and they can naturally avoid over-fitting. As a consequence, the core of this thesis will be the study of stochastic gradient methods. The popular parametric methods give predictors which are linear functions of a set ofexplanatory variables. However, they often result in an imprecise approximation of the underlying statistical structure. In the non-parametric setting, which is paramount in this thesis, this restriction is lifted. The class of functions from which the predictor is proposed depends on the observations. In practice, these methods have multiple purposes, and are essential for learning with non-vectorial data, which can be mapped onto a vector in a functional space using a positive definite kernel. This allows to use algorithms designed for vectorial data, but requires the analysis to be made in the non-parametric associated space: the reproducing kernel Hilbert space. Moreover, the analysis of non-parametric regression also sheds some light on the parametric setting when the number of predictors is much larger than the number of observations. The first contribution of this thesis is to provide a detailed analysis of stochastic approximation in the non-parametric setting, precisely in reproducing kernel Hilbert spaces. This analysis proves optimal convergence rates for the averaged stochastic gradient descent algorithm. As we take special care in using minimal assumptions, it applies to numerous situations, and covers both the settings in which the number of observations is known a priori, and situations in which the learning algorithm works in an on-line fashion. The second contribution is an algorithm based on acceleration, which converges at optimal speed, both from the optimization point of view and from the statistical one. In the non-parametric setting, this can improve the convergence rate up to optimality, even inparticular regimes for which the first algorithm remains sub-optimal. Finally, the third contribution of the thesis consists in an extension of the framework beyond the least-square loss. The stochastic gradient descent algorithm is analyzed as a Markov chain. This point of view leads to an intuitive and insightful interpretation, that outlines the differences between the quadratic setting and the more general setting. A simple method resulting in provable improvements in the convergence is then proposed. Approximation stochastique Optimisation convexe Apprentissage supervisé Estimation non-paramétrique Stochastic approximation Convex optimization Supervised learning Nonparametric estimation Reproducing kernel Hilbert spaces 510
45	Efficacité de l’algorithme EM en ligne pour des modèles statistiques complexes dans le contexte des données massives Martel, Yannick 11 1900 (has links) L’algorithme EM (Dempster et al., 1977) permet de construire une séquence d’estimateurs qui converge vers l’estimateur de vraisemblance maximale pour des modèles à données manquantes pour lesquels l’estimateur du maximum de vraisemblance n’est pas calculable. Cet algorithme est remarquable compte tenu de ses nombreuses applications en apprentissage statistique. Toutefois, il peut avoir un lourd coût computationnel. Les auteurs Cappé et Moulines (2009) ont proposé une version en ligne de cet algorithme pour les modèles appartenant à la famille exponentielle qui permet de faire des gains d’efficacité computationnelle importants en présence de grands jeux de données. Cependant, le calcul de l’espérance a posteriori de la statistique exhaustive, qui est nécessaire dans la version de Cappé et Moulines (2009), est rarement possible pour des modèles complexes et/ou lorsque la dimension des données manquantes est grande. On doit alors la remplacer par un estimateur. Plusieurs questions se présentent naturellement : les résultats de convergence de l’algorithme initial restent-ils valides lorsqu’on remplace l’espérance par un estimateur ? En particulier, que dire de la normalité asymptotique de la séquence des estimateurs ainsi créés, de la variance asymptotique et de la vitesse de convergence ? Comment la variance de l’estimateur de l’espérance se reflète-t-elle sur la variance asymptotique de l’estimateur EM? Peut-on travailler avec des estimateurs de type Monte-Carlo ou MCMC? Peut-on emprunter des outils populaires de réduction de variance comme les variables de contrôle ? Ces questions seront étudiées à l’aide d’exemples de modèles à variables latentes. Les contributions principales de ce mémoire sont une présentation unifiée des algorithmes EM d’approximation stochastique, une illustration de l’impact au niveau de la variance lorsque l’espérance a posteriori est estimée dans les algorithmes EM en ligne et l’introduction d’algorithmes EM en ligne permettant de réduire la variance supplémentaire occasionnée par l’estimation de l’espérance a posteriori. / The EM algorithm Dempster et al. (1977) yields a sequence of estimators that converges to the maximum likelihood estimator for missing data models whose maximum likelihood estimator is not directly tractable. The EM algorithm is remarkable given its numerous applications in statistical learning. However, it may suffer from its computational cost. Cappé and Moulines (2009) proposed an online version of the algorithm in models whose likelihood belongs to the exponential family that provides an upgrade in computational efficiency in large data sets. However, the conditional expected value of the sufficient statistic is often intractable for complex models and/or when the missing data is of a high dimension. In those cases, it is replaced by an estimator. Many questions then arise naturally: do the convergence results pertaining to the initial estimator hold when the expected value is substituted by an estimator? In particular, does the asymptotic normality property remain in this case? How does the variance of the estimator of the expected value affect the asymptotic variance of the EM estimator? Are Monte-Carlo and MCMC estimators suitable in this situation? Could variance reduction tools such as control variates provide variance relief? These questions will be tackled by the means of examples containing latent data models. This master’s thesis’ main contributions are the presentation of a unified framework for stochastic approximation EM algorithms, an illustration of the impact that the estimation of the conditional expected value has on the variance and the introduction of online EM algorithms which reduce the additional variance stemming from the estimation of the conditional expected value. Algorithme EM Approximation stochastique Réduction de variance Statistique computationnelle Algorithme en ligne EM algorithm Stochastic approximation Variance reduction Computational statistics Online algorithm
46	Non-Convex Optimization for Latent Data Models : Algorithms, Analysis and Applications / Optimisation Non Convexe pour Modèles à Données Latentes : Algorithmes, Analyse et Applications Karimi, Belhal 19 September 2019 (has links) De nombreux problèmes en Apprentissage Statistique consistent à minimiser une fonction non convexe et non lisse définie sur un espace euclidien. Par exemple, les problèmes de maximisation de la vraisemblance et la minimisation du risque empirique en font partie.Les algorithmes d'optimisation utilisés pour résoudre ce genre de problèmes ont été largement étudié pour des fonctions convexes et grandement utilisés en pratique.Cependant, l'accrudescence du nombre d'observation dans l'évaluation de ce risque empirique ajoutée à l'utilisation de fonctions de perte de plus en plus sophistiquées représentent des obstacles.Ces obstacles requièrent d'améliorer les algorithmes existants avec des mis à jour moins coûteuses, idéalement indépendantes du nombre d'observations, et d'en garantir le comportement théorique sous des hypothèses moins restrictives, telles que la non convexité de la fonction à optimiser.Dans ce manuscrit de thèse, nous nous intéressons à la minimisation de fonctions objectives pour des modèles à données latentes, ie, lorsque les données sont partiellement observées ce qui inclut le sens conventionnel des données manquantes mais est un terme plus général que cela.Dans une première partie, nous considérons la minimisation d'une fonction (possiblement) non convexe et non lisse en utilisant des mises à jour incrémentales et en ligne. Nous proposons et analysons plusieurs algorithmes à travers quelques applications.Dans une seconde partie, nous nous concentrons sur le problème de maximisation de vraisemblance non convexe en ayant recourt à l'algorithme EM et ses variantes stochastiques. Nous en analysons plusieurs versions rapides et moins coûteuses et nous proposons deux nouveaux algorithmes du type EM dans le but d'accélérer la convergence des paramètres estimés. / Many problems in machine learning pertain to tackling the minimization of a possibly non-convex and non-smooth function defined on a Many problems in machine learning pertain to tackling the minimization of a possibly non-convex and non-smooth function defined on a Euclidean space.Examples include topic models, neural networks or sparse logistic regression.Optimization methods, used to solve those problems, have been widely studied in the literature for convex objective functions and are extensively used in practice.However, recent breakthroughs in statistical modeling, such as deep learning, coupled with an explosion of data samples, require improvements of non-convex optimization procedure for large datasets.This thesis is an attempt to address those two challenges by developing algorithms with cheaper updates, ideally independent of the number of samples, and improving the theoretical understanding of non-convex optimization that remains rather limited.In this manuscript, we are interested in the minimization of such objective functions for latent data models, ie, when the data is partially observed which includes the conventional sense of missing data but is much broader than that.In the first part, we consider the minimization of a (possibly) non-convex and non-smooth objective function using incremental and online updates.To that end, we propose several algorithms exploiting the latent structure to efficiently optimize the objective and illustrate our findings with numerous applications.In the second part, we focus on the maximization of non-convex likelihood using the EM algorithm and its stochastic variants.We analyze several faster and cheaper algorithms and propose two new variants aiming at speeding the convergence of the estimated parameters. Approximation Stochastique Optimisation Non Convexe Somme-Finie Grande-Echelle Données Latentes Mcmc Incrémental En ligne Stochastic Approximation Non-Convex Optimization Finite-Sum Large-Scale Latent Data Mcmc Incremental Online 519.22
47	Random monotone operators and application to stochastic optimization / Opérateurs monotones aléatoires et application à l'optimisation stochastique Salim, Adil 26 November 2018 (has links) Cette thèse porte essentiellement sur l'étude d'algorithmes d'optimisation. Les problèmes de programmation intervenant en apprentissage automatique ou en traitement du signal sont dans beaucoup de cas composites, c'est-à-dire qu'ils sont contraints ou régularisés par des termes non lisses. Les méthodes proximales sont une classe d'algorithmes très efficaces pour résoudre de tels problèmes. Cependant, dans les applications modernes de sciences des données, les fonctions à minimiser se représentent souvent comme une espérance mathématique, difficile ou impossible à évaluer. C'est le cas dans les problèmes d'apprentissage en ligne, dans les problèmes mettant en jeu un grand nombre de données ou dans les problèmes de calcul distribué. Pour résoudre ceux-ci, nous étudions dans cette thèse des méthodes proximales stochastiques, qui adaptent les algorithmes proximaux aux cas de fonctions écrites comme une espérance. Les méthodes proximales stochastiques sont d'abord étudiées à pas constant, en utilisant des techniques d'approximation stochastique. Plus précisément, la méthode de l'Equation Differentielle Ordinaire est adaptée au cas d'inclusions differentielles. Afin d'établir le comportement asymptotique des algorithmes, la stabilité des suites d'itérés (vues comme des chaines de Markov) est étudiée. Ensuite, des généralisations de l'algorithme du gradient proximal stochastique à pas décroissant sont mises au point pour resoudre des problèmes composites. Toutes les grandeurs qui permettent de décrire les problèmes à résoudre s'écrivent comme une espérance. Cela inclut un algorithme primal dual pour des problèmes régularisés et linéairement contraints ainsi qu'un algorithme d'optimisation sur les grands graphes. / This thesis mainly studies optimization algorithms. Programming problems arising in signal processing and machine learning are composite in many cases, i.e they exhibit constraints and non smooth regularization terms. Proximal methods are known to be efficient to solve such problems. However, in modern applications of data sciences, functions to be minimized are often represented as statistical expectations, whose evaluation is intractable. This cover the case of online learning, big data problems and distributed computation problems. To solve this problems, we study in this thesis proximal stochastic methods, that generalize proximal algorithms to the case of cost functions written as expectations. Stochastic proximal methods are first studied with a constant step size, using stochastic approximation techniques. More precisely, the Ordinary Differential Equation method is adapted to the case of differential inclusions. In order to study the asymptotic behavior of the algorithms, the stability of the sequences of iterates (seen as Markov chains) is studied. Then, generalizations of the stochastic proximal gradient algorithm with decreasing step sizes are designed to solve composite problems. Every quantities used to define the optimization problem are written as expectations. This include a primal dual algorithm to solve regularized and linearly constrained problems and an optimization over large graphs algorithm. Optimisation distribuée Apprentissage statistique Approximation stochastique Opérateurs monotones aléatoires Algorithmes proximaux Distributed optimization Machine learning Stochastic approximation Random monotone operators Proximal algorithms
48	Simulation Based Algorithms For Markov Decision Process And Stochastic Optimization Abdulla, Mohammed Shahid 05 1900 (has links) In Chapter 2, we propose several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes (MDPs) with finite state-space under the average cost criterion. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and averaged for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, a memory efficient implementation using a feature-vector representation of the state-space and TD (0) learning along the faster timescale is discussed. A three-timescale simulation based algorithm for solution of infinite horizon discounted-cost MDPs via the Value Iteration approach is also proposed. An approximation of the Dynamic Programming operator T is applied to the value function iterates. A sketch of convergence explaining the dynamics of the algorithm using associated ODEs is presented. Numerical experiments on rate based flow control on a bottleneck node using a continuous-time queueing model are presented using the proposed algorithms. Next, in Chapter 3, we develop three simulation-based algorithms for finite-horizon MDPs (FHMDPs). The first algorithm is developed for finite state and compact action spaces while the other two are for finite state and finite action spaces. Convergence analysis is briefly sketched. We then concentrate on methods to mitigate the curse of dimensionality that affects FH-MDPs severely, as there is one probability transition matrix per stage. Two parametrized actor-critic algorithms for FHMDPs with compact action sets are proposed, the ‘critic’ in both algorithms learning the policy gradient. We show w.p1convergence to a set with the necessary condition for constrained optima. Further, a third algorithm for stochastic control of stopping time processes is presented. Numerical experiments with the proposed finite-horizon algorithms are shown for a problem of flow control in communication networks. Towards stochastic optimization, in Chapter 4, we propose five algorithms which are variants of SPSA. The original one measurement SPSA uses an estimate of the gradient of objective function L containing an additional bias term not seen in two-measurement SPSA. We propose a one-measurement algorithm that eliminates this bias, and has asymptotic convergence properties making for easier comparison with the two-measurement SPSA. The algorithm, under certain conditions, outperforms both forms of SPSA with the only overhead being the storage of a single measurement. We also propose a similar algorithm that uses perturbations obtained from normalized Hadamard matrices. The convergence w.p.1 of both algorithms is established. We extend measurement reuse to design three second-order SPSA algorithms, sketch the convergence analysis and present simulation results on an illustrative minimization problem. We then propose several stochastic approximation implementations for related algorithms in flow-control of communication networks, beginning with a discrete-time implementation of Kelly’s primal flow-control algorithm. Convergence with probability1 is shown, even in the presence of communication delays and stochastic effects seen in link congestion indications. Two relevant enhancements are then pursued :a) an implementation of the primal algorithm using second-order information, and b) an implementation where edge-routers rectify misbehaving flows. Also, discrete-time implementations of Kelly’s dual algorithm and primal-dual algorithm are proposed. Simulation results a) verifying the proposed algorithms and, b) comparing stability properties with an algorithm in the literature are presented. Markov Processes - Data Processing Algorithms Simulation Markov Decision Processes (MDPs) Finite Horizon Markov Decision Processes Stochastic Approximation - Algorithms Network Flow-Control FH-MDP Algorithms Stochastic Optimization Reinforcement Learning Algorithms Computational Mathematics
49	Online Learning and Simulation Based Algorithms for Stochastic Optimization Lakshmanan, K January 2012 (has links) (PDF) In many optimization problems, the relationship between the objective and parameters is not known. The objective function itself may be stochastic such as a long-run average over some random cost samples. In such cases finding the gradient of the objective is not possible. It is in this setting that stochastic approximation algorithms are used. These algorithms use some estimates of the gradient and are stochastic in nature. Amongst gradient estimation techniques, Simultaneous Perturbation Stochastic Approximation (SPSA) and Smoothed Functional(SF) scheme are widely used. In this thesis we have proposed a novel multi-time scale quasi-Newton based smoothed functional (QN-SF) algorithm for unconstrained as well as constrained optimization. The algorithm uses the smoothed functional scheme for estimating the gradient and the quasi-Newton method to solve the optimization problem. The algorithm is shown to converge with probability one. We have also provided here experimental results on the problem of optimal routing in a multi-stage network of queues. Policies like Join the Shortest Queue or Least Work Left assume knowledge of the queue length values that can change rapidly or hard to estimate. If the only information available is the expected end-to-end delay as with our case, such policies cannot be used. The QN-SF based probabilistic routing algorithm uses only the total end-to-end delay for tuning the probabilities. We observe from the experiments that the QN-SF algorithm has better performance than the gradient and Jacobi versions of Newton based smoothed functional algorithms. Next we consider constrained routing in a similar queueing network. We extend the QN-SF algorithm to this case. We study the convergence behavior of the algorithm and observe that the constraints are satisfied at the point of convergence. We provide experimental results for the constrained routing setup as well. Next we study reinforcement learning algorithms which are useful for solving Markov Decision Process(MDP) when the precise information on transition probabilities is not known. When the state, and action sets are very large, it is not possible to store all the state-action tuples. In such cases, function approximators like neural networks have been used. The popular Q-learning algorithm is known to diverge when used with linear function approximation due to the ’off-policy’ problem. Hence developing stable learning algorithms when used with function approximation is an important problem. We present in this thesis a variant of Q-learning with linear function approximation that is based on two-timescale stochastic approximation. The Q-value parameters for a given policy in our algorithm are updated on the slower timescale while the policy parameters themselves are updated on the faster scale. We perform a gradient search in the space of policy parameters. Since the objective function and hence the gradient are not analytically known, we employ the efficient one-simulation simultaneous perturbation stochastic approximation(SPSA) gradient estimates that employ Hadamard matrix based deterministic perturbations. Our algorithm has the advantage that, unlike Q-learning, it does not suffer from high oscillations due to the off-policy problem when using function approximators. Whereas it is difficult to prove convergence of regular Q-learning with linear function approximation because of the off-policy problem, we prove that our algorithm which is on-policy is convergent. Numerical results on a multi-stage stochastic shortest path problem show that our algorithm exhibits significantly better performance and is more robust as compared to Q-learning. Future work would be to compare it with other policy-based reinforcement learning algorithms. Finally, we develop an online actor-critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints. We consider the long-run average cost Markov decision process(MDP) framework in which both the objective and the constraint functions are suitable policy-dependent long-run averages of certain sample path functions. The Lagrange multiplier method is used to handle the inequality constraints. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal solution. We also provide the results of numerical experiments on a problem of routing in a multistage queueing network with constraints on long-run average queue lengths. We observe that our algorithm exhibits good performance on this setting and converges to a feasible point. Stochastic Approximation Algorithms Stochastic Optimization Markov Decision Process Reinforcement Learning Algorithm Queueing Networks Queuing Theory Online Q-Learning Algorithm Online Actor-Critic Algorithm Markov Decision Processes Q-learning Algorithm Linear Function Approximation Computer Science
50	Développement de méthodes d'analyse de données en ligne / Development of methods to analyze data steams Bar, Romain 29 November 2013 (has links) On suppose que des vecteurs de données de grande dimension arrivant en ligne sont des observations indépendantes d'un vecteur aléatoire. Dans le second chapitre, ce dernier, noté Z, est partitionné en deux vecteurs R et S et les observations sont supposées identiquement distribuées. On définit alors une méthode récursive d'estimation séquentielle des r premiers facteurs de l'ACP projetée de R par rapport à S. On étudie ensuite le cas particulier de l'analyse canonique, puis de l'analyse factorielle discriminante et enfin de l'analyse factorielle des correspondances. Dans chacun de ces cas, on définit plusieurs processus spécifiques à l'analyse envisagée. Dans le troisième chapitre, on suppose que l'espérance En du vecteur aléatoire Zn dont sont issues les observations varie dans le temps. On note Rn = Zn - En et on suppose que les vecteurs Rn forment un échantillon indépendant et identiquement distribué d'un vecteur aléatoire R. On définit plusieurs processus d'approximation stochastique pour estimer des vecteurs directeurs des axes principaux d'une analyse en composantes principales (ACP) partielle de R. On applique ensuite ce résultat au cas particulier de l'analyse canonique généralisée (ACG) partielle après avoir défini un processus d'approximation stochastique de type Robbins-Monro de l'inverse d'une matrice de covariance. Dans le quatrième chapitre, on considère le cas où à la fois l'espérance et la matrice de covariance de Zn varient dans le temps. On donne finalement des résultats de simulation dans le chapitre 5 / High dimensional data are supposed to be independent on-line observations of a random vector. In the second chapter, the latter is denoted by Z and sliced into two random vectors R et S and data are supposed to be identically distributed. A recursive method of sequential estimation of the factors of the projected PCA of R with respect to S is defined. Next, some particular cases are investigated : canonical correlation analysis, canonical discriminant analysis and canonical correspondence analysis ; in each case, several specific methods for the estimation of the factors are proposed. In the third chapter, data are observations of the random vector Zn whose expectation En varies with time. Let Rn = Zn - En be and suppose that the vectors Rn form an independent and identically distributed sample of a random vector R. Stochastic approximation processes are used to estimate on-line direction vectors of the principal axes of a partial principal components analysis (PCA) of ~Z. This is applied next to the particular case of a partial generalized canonical correlation analysis (gCCA) after defining a stochastic approximation process of the Robbins-Monro type to estimate recursively the inverse of a covariance matrix. In the fourth chapter, the case when both expectation and covariance matrix of Zn vary with time n is considered. Finally, simulation results are given in chapter 5 Big Data Flux de données Analyse en composantes principales (ACP) ACP projetée Analyse canonique généralisée (ACG) Approximation stochastique Big data Data streams Principal components analysis (PCA) Projected PCA Stochastic approximation 519.5

Search results