Global ETD Search

1	Randomized coordinate descent methods for big data optimization Takac, Martin January 2014 (has links) This thesis consists of 5 chapters. We develop new serial (Chapter 2), parallel (Chapter 3), distributed (Chapter 4) and primal-dual (Chapter 5) stochastic (randomized) coordinate descent methods, analyze their complexity and conduct numerical experiments on synthetic and real data of huge sizes (GBs/TBs of data, millions/billions of variables). In Chapter 2 we develop a randomized coordinate descent method for minimizing the sum of a smooth and a simple nonsmooth separable convex function and prove that it obtains an ε-accurate solution with probability at least 1 - p in at most O((n/ε) log(1/p)) iterations, where n is the number of blocks. This extends recent results of Nesterov [43], which cover the smooth case, to composite minimization, while at the same time improving the complexity by the factor of 4 and removing ε from the logarithmic term. More importantly, in contrast with the aforementioned work in which the author achieves the results by applying the method to a regularized version of the objective function with an unknown scaling factor, we show that this is not necessary, thus achieving first true iteration complexity bounds. For strongly convex functions the method converges linearly. In the smooth case we also allow for arbitrary probability vectors and non-Euclidean norms. Our analysis is also much simpler. In Chapter 3 we show that the randomized coordinate descent method developed in Chapter 2 can be accelerated by parallelization. The speedup, as compared to the serial method, and referring to the number of iterations needed to approximately solve the problem with high probability, is equal to the product of the number of processors and a natural and easily computable measure of separability of the smooth component of the objective function. In the worst case, when no degree of separability is present, there is no speedup; in the best case, when the problem is separable, the speedup is equal to the number of processors. Our analysis also works in the mode when the number of coordinates being updated at each iteration is random, which allows for modeling situations with variable (busy or unreliable) number of processors. We demonstrate numerically that the algorithm is able to solve huge-scale l1-regularized least squares problems with a billion variables. In Chapter 4 we extended coordinate descent into a distributed environment. We initially partition the coordinates (features or examples, based on the problem formulation) and assign each partition to a different node of a cluster. At every iteration, each node picks a random subset of the coordinates from those it owns, independently from the other computers, and in parallel computes and applies updates to the selected coordinates based on a simple closed-form formula. We give bounds on the number of iterations sufficient to approximately solve the problem with high probability, and show how it depends on the data and on the partitioning. We perform numerical experiments with a LASSO instance described by a 3TB matrix. Finally, in Chapter 5, we address the issue of using mini-batches in stochastic optimization of Support Vector Machines (SVMs). We show that the same quantity, the spectral norm of the data, controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and use it to derive novel variants of mini-batched (parallel) SDCA. Our guarantees for both methods are expressed in terms of the original nonsmooth primal problem based on the hinge-loss. Our results in Chapters 2 and 3 are cast for blocks (groups of coordinates) instead of coordinates, and hence the methods are better described as block coordinate descent methods. While the results in Chapters 4 and 5 are not formulated for blocks, they can be extended to this setting. 519.6
2	Multiple Kernel Learning with Many Kernels Afkanpour, Arash Unknown Date No description available. Multiple Kernel Learning Stochastic Gradient Descent Greedy Coordinate Descent
3	Optimization for Supervised Machine Learning: Randomized Algorithms for Data and Parameters Hanzely, Filip 20 August 2020 (has links) Many key problems in machine learning and data science are routinely modeled as optimization problems and solved via optimization algorithms. With the increase of the volume of data and the size and complexity of the statistical models used to formulate these often ill-conditioned optimization tasks, there is a need for new efficient algorithms able to cope with these challenges. In this thesis, we deal with each of these sources of difficulty in a different way. To efficiently address the big data issue, we develop new methods which in each iteration examine a small random subset of the training data only. To handle the big model issue, we develop methods which in each iteration update a random subset of the model parameters only. Finally, to deal with ill-conditioned problems, we devise methods that incorporate either higher-order information or Nesterov’s acceleration/momentum. In all cases, randomness is viewed as a powerful algorithmic tool that we tune, both in theory and in experiments, to achieve the best results. Our algorithms have their primary application in training supervised machine learning models via regularized empirical risk minimization, which is the dominant paradigm for training such models. However, due to their generality, our methods can be applied in many other fields, including but not limited to data science, engineering, scientific computing, and statistics. optimization machine learning stochastic gradient variance reduction coordinate descent
4	Paving the Randomized Gauss-Seidel Wu, Wei 01 January 2017 (has links) The Randomized Gauss-Seidel Method (RGS) is an iterative algorithm that solves overdetermined systems of linear equations Ax = b. This paper studies an update on the RGS method, the Randomized Block Gauss-Seidel Method. At each step, the algorithm greedily minimizes the objective function L(x) = kAx bk2 with respect to a subset of coordinates. This paper describes a Randomized Block Gauss-Seidel Method (RBGS) which uses a randomized control method to choose a subset at each step. This algorithm is the first block RGS method with an expected linear convergence rate which can be described by the properties of the matrix A and its column submatrices. The analysis demonstrates that RBGS improves RGS more when given appropriate column-paving of the matrix, a partition of the columns into well-conditioned blocks. The main result yields a RBGS method that is more e cient than the simple RGS method. Numerical Analysis and Computation
5	Distributed Statistical Learning under Communication Constraints El Gamal, Mostafa 21 June 2017 (has links) "In this thesis, we study distributed statistical learning, in which multiple terminals, connected by links with limited capacity, cooperate to perform a learning task. As the links connecting the terminals have limited capacity, the messages exchanged between the terminals have to be compressed. The goal of this thesis is to investigate how to compress the data observations at multiple terminals and how to use the compressed data for inference. We first focus on the distributed parameter estimation problem, in which terminals send messages related to their local observations using limited rates to a fusion center that will obtain an estimate of a parameter related to the observations of all terminals. It is well known that if the transmission rates are in the Slepian-Wolf region, the fusion center can fully recover all observations and hence can construct an estimator having the same performance as that of the centralized case. One natural question is whether Slepian-Wolf rates are necessary to achieve the same estimation performance as that of the centralized case. In this thesis, we show that the answer to this question is negative. We then examine the optimality of data dimensionality reduction via sufficient statistics compression in distributed parameter estimation problems. The data dimensionality reduction step is often needed especially if the data has a very high dimension and the communication rate is not as high as the one characterized above. We show that reducing the dimensionality by extracting sufficient statistics of the parameter to be estimated does not degrade the overall estimation performance in the presence of communication constraints. We further analyze the optimal estimation performance in the presence of communication constraints and we verify the derived bound using simulations. Finally, we study distributed optimization problems, for which we examine the randomized distributed coordinate descent algorithm with quantized updates. In the literature, the iteration complexity of the randomized distributed coordinate descent algorithm has been characterized under the assumption that machines can exchange updates with an infinite precision. We consider a practical scenario in which the messages exchange occurs over channels with finite capacity, and hence the updates have to be quantized. We derive sufficient conditions on the quantization error such that the algorithm with quantized update still converge." convex optimization MVUE sufficient statistics Distributed learning parameter estimation coordinate descent
6	Regularized methods for high-dimensional and bi-level variable selection Breheny, Patrick John 01 July 2009 (has links) Many traditional approaches cease to be useful when the number of variables is large in comparison with the sample size. Penalized regression methods have proved to be an attractive approach, both theoretically and empirically, for dealing with these problems. This thesis focuses on the development of penalized regression methods for high-dimensional variable selection. The first part of this thesis deals with problems in which the covariates possess a grouping structure that can be incorporated into the analysis to select important groups as well as important members of those groups. I introduce a framework for grouped penalization that encompasses the previously proposed group lasso and group bridge methods, sheds light on the behavior of grouped penalties, and motivates the proposal of a new method, group MCP. The second part of this thesis develops fast algorithms for fitting models with complicated penalty functions such as grouped penalization methods. These algorithms combine the idea of local approximation of penalty functions with recent research into coordinate descent algorithms to produce highly efficient numerical methods for fitting models with complicated penalties. Importantly, I show these algorithms to be both stable and linear in the dimension of the feature space, allowing them to be efficiently scaled up to very large problems. In the third part of this thesis, I extend the idea of false discovery rates to penalized regression. The Karush-Kuhn-Tucker conditions describing penalized regression estimates provide testable hypotheses involving partial residuals. I use these hypotheses to connect the previously disparate elds of multiple comparisons and penalized regression, develop estimators for the false discovery rates of methods such as the lasso and elastic net, and establish theoretical results. Finally, the methods from all three sections are studied in a number of simulations and applied to real data from gene expression and genetic association studies. coordinate descent false discovery rate group lasso lasso penalized regression Biostatistics
7	Contribution à la décomposition de données multimodales avec des applications en apprentisage de dictionnaires et la décomposition de tenseurs de grande taille. / Contribution to multimodal data processing with applications to dictionary learning and large-scale decomposition Traoré, Abraham 26 November 2019 (has links) Dans ce travail, on s'intéresse à des outils mathématiques spéciaux appelés tenseurs qui sont formellement définis comme des tableaux multidimensionnels définis sur le produit tensoriel d'espaces vectoriels (chaque espace vectoriel étant muni de son système de coordonnées), le nombre d'espaces vectoriels impliqués dans ce produit étant l'ordre du tenseur. L'intérêt pour les tenseurs est motivé par certains travaux expérimentaux qui ont prouvé, dans divers contextes, que traiter des données multidimensionnelles avec des tenseurs plutôt que des matrices donne un meilleur résultat aussi bien pour des tâches de régression que de classification. Dans le cadre de la thèse, nous nous sommes focalisés sur une décomposition dite de Tucker et avons mis en place une méthode pour l'apprentissage de dictionnaires, une technique pour l'apprentissage en ligne de dictionnaires, une approche pour la décomposition d'un tenseur de grandes tailles et enfin une méthodologie pour la décomposition d'un tenseur qui croît par rapport à tous les modes. De nouveaux résultats théoriques concernant la convergence et la vitesse de convergence sont établis et l'efficacité des algorithmes proposés, reposant soit sur la minimisation alternée, soit sur la descente de gradients par coordonnées, est démontrée sur des problèmes réels / In this work, we are interested in special mathematical tools called tensors, that are multidimensional arrays defined on tensor product of some vector spaces, each of which has its own coordinate system and the number of spaces involved in this product is generally referred to as order. The interest for these tools stem from some empirical works (for a range of applications encompassing both classification and regression) that prove the superiority of tensor processing with respect to matrix decomposition techniques. In this thesis framework, we focused on specific tensor model named Tucker and established new approaches for miscellaneous tasks such as dictionary learning, online dictionary learning, large-scale processing as well as the decomposition of a tensor evolving with respect to each of its modes. New theoretical results are established and the efficiency of the different algorithms, which are based either on alternate minimization or coordinate gradient descent, is proven via real-world problems. Tenseur Apprentissage de dictionnaires Descente de coordonnées Minimisation alternée Tensor Dictionary learning Coordinate descent Alternate minimization 005.73
8	Implementation of a Hardware Coordinate Wise Descend Algorithm with Maximum Likelihood Estimator for Use in mMTC Activity Detection / En hårdvaruimplementation av en koordinatvis minimeringsalgoritm baserat på maximum liklihoodestimering för aktivitetsdetektion i mMT Henriksson, Mikael January 2020 (has links) In this work, a coordinate wise descent algorithm is implemented which serves the purpose of estimating active users in a base station/client wireless communication setup. The implemented algorithm utilizes the sporadic nature of users, which is believed to be the norm with 5G Massive MIMO and Internet of Things, meaning that only a subset of all users are active simultaneously at any given time. This work attempts to estimate the viability of a direct algorithm implementation to test if the performance requirements can be satisfied or if a more sophisticated implementation, such as a parallelized version, needs to be created.The result is an isomorphic ASIC implementation made in a 28 nm FD-SOI process, with proper internal word lengths extracted through simulation. Some techniques to lessen the burden on hardware without losing performance is presented which helps reduce area and increase speed of the implementation. Finally, a parallelized version of the algorithm is proposed, if one should desire to explore an implementation with higher system throughput, at almost no furtherexpense of user estimation error. ASIC 5G Massive MIMO Implementation 28-nm FD-SOI Coordinate Descent Computer Engineering Datorteknik
9	Sparse Ridge Fusion For Linear Regression Mahmood, Nozad 01 January 2013 (has links) For a linear regression, the traditional technique deals with a case where the number of observations n more than the number of predictor variables p (n > p). In the case n < p, the classical method fails to estimate the coefficients. A solution of the problem is the case of correlated predictors is provided in this thesis. A new regularization and variable selection is proposed under the name of Sparse Ridge Fusion (SRF). In the case of highly correlated predictor, the simulated examples and a real data show that the SRF always outperforms the lasso, eleastic net, and the S-Lasso, and the results show that the SRF selects more predictor variables than the sample size n while the maximum selected variables by lasso is n size. Lasso coordinate descent elastic net smooth lasso sparsity collinearity high dimensional data variable selection Statistics and Probability
10	Analysis of Order Strategies for Alternating Algorithms in Optimization Ntiamoah, Daniel 05 June 2023 (has links) No description available. Mathematics Block Coordinate Descent Alternating Least Squares Tensor Approximation Diagonal Valley.

Search results