Global ETD Search

Return to search

Scalable sparse machine learning methods for big data

Sparse machine learning models have become increasingly popular in analyzing high-dimensional data. With the evolving era of Big Data, ultrahigh-dimensional, large-scale data sets are constantly collected in many areas such as genetics, genomics, biomedical imaging, social media analysis, and high-frequency finance. Mining valuable information efficiently from these massive data sets requires not only novel statistical models but also advanced computational techniques. This thesis focuses on the development of scalable sparse machine learning methods to facilitate Big Data analytics.
Built upon the feature screening technique, the first part of this thesis proposes a family of hybrid safe-strong rules (HSSR) that incorporate safe screening rules into the sequential strong rule to remove unnecessary computational burden for solving the \textit{lasso-type} models. We present two instances of HSSR, namely SSR-Dome and SSR-BEDPP, for the standard lasso problem. We further extend SSR-BEDPP to the elastic net and group lasso problems to demonstrate the generalizability of the hybrid screening idea. In the second part, we design and implement an R package called \texttt{biglasso} to extend the lasso model fitting to Big Data in R. Our package \texttt{biglasso} utilizes memory-mapped files to store the massive data on the disk, only reading data into memory when necessary during model fitting, and is thus able to handle \textit{data-larger-than-RAM} cases seamlessly. Moreover, it's built upon our redesigned algorithm incorporated with the proposed HSSR screening, making it much more memory- and computation-efficient than existing R packages. Extensive numerical experiments with synthetic and real data sets are conducted in both parts to show the effectiveness of the proposed methods.
In the third part, we consider a novel statistical model, namely the overlapping group logistic regression model, that allows for selecting important groups of features that are associated with binary outcomes in the setting where the features belong to overlapping groups. We conduct systematic simulations and real-data studies to show its advantages in the application of genetic pathway selection. We implement an R package called \texttt{grpregOverlap} that has HSSR screening built in for fitting overlapping group lasso models.

Big data

Feature screening

High-dimensional statistics

Lasso-type models

Sparse machine learning

Biostatistics

Identifer	oai:union.ndltd.org:uiowa.edu/oai:ir.uiowa.edu:etd-7502
Date	15 December 2017
Creators	Zeng, Yaohui
Contributors	Breheny, Patrick John
Publisher	University of Iowa
Source Sets	University of Iowa
Language	English
Detected Language	English
Type	dissertation
Format	application/pdf
Source	Theses and Dissertations
Rights	Copyright © 2017 Yaohui Zeng

Page generated in 0.0022 seconds

Scalable sparse machine learning methods for big data

Description

Links & Downloads

Tags

Additional Fields