Global ETD Search

Return to search

Distributed large-scale data storage and processing

This thesis makes progress towards the fundamental understanding of heterogeneous and dynamic information systems and the way that we store and process massive data-sets. Reliable large-scale data storage: Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. However, traditional erasure codes are associated with high repair cost that is often considered an unavoidable price to pay. In this thesis, we show how to overcome these limitations. We construct novel families of erasure codes that are optimal under various repair cost metrics, while achieving the best possible reliability. We show how these modern storage codes significantly outperform traditional erasure codes. Low-rank approximations for large-scale data processing: A central goal in data analytics is extracting useful and interpretable information from massive data-sets. A challenge that arises from the distributed and large-scale nature of the data at hand, is having algorithms that are good in theory but can also scale up gracefully to large problem sizes. Using ideas from prior work, we develop a scalable lowrank optimization framework with provable guarantees for problems like the densest k-subgraph (DkS) and sparse PCA. Our experimental findings indicate that this low-rank framework can outperform the state-of-the art, by offering higher quality and more interpretable solutions, and by scaling up to problem inputs with billions of entries. / text

http://hdl.handle.net/2152/29145

Codes for distributed storage

Big-graph analytics

Identifer	oai:union.ndltd.org:UTEXAS/oai:repositories.lib.utexas.edu:2152/29145
Date	16 March 2015
Creators	Papailiopoulos, Dimitrios
Source Sets	University of Texas
Language	English
Detected Language	English
Type	Thesis
Format	application/pdf

Page generated in 0.002 seconds

Distributed large-scale data storage and processing

Description

Links & Downloads

Tags

Additional Fields