Experimental and observational data emerging from various scientific domains necessitate fast, accurate, and low-cost analysis of the data. While exploring the landscape of big data analytics, multiple challenges arise from three characteristics of big data: the volume, the variety, and the velocity. High volume and velocity of the data warrant a large amount of storage, memory, and compute power while a large variety of data demands cognition across domains. Addressing domain-intrinsic properties of data can help us analyze the data efficiently through the frugal use of high-performance computing (HPC) resources. In this thesis, we present our exploration of the data analytics landscape with domain-aware approximate and incremental algorithm design. We propose three guidelines targeting three properties of big data for domain-aware big data analytics: (1) explore geometric and domain-specific properties of high dimensional data for succinct representation, which addresses the volume property, (2) design domain-aware algorithms through mapping of domain problems to computational problems, which addresses the variety property, and (3) leverage incremental arrival of data through incremental analysis and invention of problem-specific merging methodologies, which addresses the velocity property. We demonstrate these three guidelines through the solution approaches of three representative domain problems.
We present Claret, a fast and portable parallel weighted multi-dimensional scaling (WMDS) tool, to demonstrate the application of the first guideline. It combines algorithmic concepts extended from the stochastic force-based multi-dimensional scaling (SF-MDS) and Glimmer. Claret computes approximate weighted Euclidean distances by combining a novel data mapping called stretching and Johnson Lindestrauss' lemma to reduce the complexity of WMDS from O(f(n)d) to O(f(n) log d). In demonstrating the second guideline, we map the problem of identifying multi-hit combinations of genetic mutations responsible for cancers to weighted set cover (WSC) problem by leveraging the semantics of cancer genomic data obtained from cancer biology. Solving the mapped WSC with an approximate algorithm, we identified a set of multi-hit combinations that differentiate between tumor and normal tissue samples. To identify three- and four-hits, which require orders of magnitude larger computational power, we have scaled out the WSC algorithm on a hundred nodes of Summit supercomputer. In demonstrating the third guideline, we developed a tool iBLAST to perform an incremental sequence similarity search. Developing new statistics to combine search results over time makes incremental analysis feasible. iBLAST performs (1+δ)/δ times faster than NCBI BLAST, where δ represents the fraction of database growth. We also explored various approaches to mitigate catastrophic forgetting in incremental training of deep learning models. / Doctor of Philosophy / Experimental and observational data emerging from various scientific domains necessitate fast, accurate, and low-cost analysis of the data. While exploring the landscape of big data analytics, multiple challenges arise from three characteristics of big data: the volume, the variety, and the velocity. Here volume represents the data's size, variety represents various sources and formats of the data, and velocity represents the data arrival rate. High volume and velocity of the data warrant a large amount of storage, memory, and computational power. In contrast, a large variety of data demands cognition across domains. Addressing domain-intrinsic properties of data can help us analyze the data efficiently through the frugal use of high-performance computing (HPC) resources. This thesis presents our exploration of the data analytics landscape with domain-aware approximate and incremental algorithm design. We propose three guidelines targeting three properties of big data for domain-aware big data analytics: (1) explore geometric (pair-wise distance and distribution-related) and domain-specific properties of high dimensional data for succinct representation, which addresses the volume property, (2) design domain-aware algorithms through mapping of domain problems to computational problems, which addresses the variety property, and (3) leverage incremental data arrival through incremental analysis and invention of problem-specific merging methodologies, which addresses the velocity property.
We demonstrate these three guidelines through the solution approaches of three representative domain problems. We demonstrate the application of the first guideline through the design and development of Claret. Claret is a fast and portable parallel weighted multi-dimensional scaling (WMDS) tool that can reduce the dimension of high-dimensional data points. In demonstrating the second guideline, we identify combinations of cancer-causing gene mutations by mapping the problem to a well known computational problem known as the weighted set cover (WSC) problem. We have scaled out the WSC algorithm on a hundred nodes of Summit supercomputer to solve the problem in less than two hours instead of an estimated hundred years. In demonstrating the third guideline, we developed a tool iBLAST to perform an incremental sequence similarity search. This analysis was made possible by developing new statistics to combine search results over time. We also explored various approaches to mitigate the catastrophic forgetting of deep learning models, where a model forgets to perform machine learning tasks efficiently on older data in a streaming setting.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/99798 |
Date | 20 August 2020 |
Creators | Dash, Sajal |
Contributors | Computer Science, Feng, Wu-chun, Anandakrishnan, Ramamoorthy, Leman, Scotland C., Raghvendra, Sharath, North, Christopher L. |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Language | English |
Detected Language | English |
Type | Dissertation |
Format | ETD, application/pdf, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0016 seconds