Return to search

Insight Driven Sampling for Interactive Data Intensive Computing

Data Visualization is used to help humans perceive high dimensional data, but it is unable to be applied in real time to data intensive computing applications. Attempts to process and apply traditional information visualization techniques to such applications result in slow or non-responsive applications. For such applications, sampling is often used to reduce big data to smaller data so that the benefits of data visualization can be brought to data intensive applications. Sampling allows data visualization to be used as an interface between humans and insights contained in the big data of data intensive computing. However, sampling introduces error. The objective of sampling is to reduce the amount of data being processed without introducing too much error into the results of the data intensive application. To determine an adequate level of sampling one can use statistical measures like standard error. However, such measures do not translate well for cases involving data visualization. Knowing the standard error of a sample can tell you very little about the visualization of that data. What is needed is a measure that allows system users to make an informed decision on the level of sampling needed to speed up a data intensive application. In this work we introduce an insight based measure for the impact of sampling on the results of visualized data. We develop a framework for the quantification of the level of insight, model the relationship between the level of insight and the amount of sampling, use this model to provide data intensive computing users with the ability to control the amount of sampling as a function of user provided insight requirements, and we develop a prototype that utilizes our framework. This work allows users to speed up data intensive applications with a clear understanding of how the speedup will impact the insights gained from the visualization of this data. Starting with a simple one dimensional data intensive application we apply our framework and work our way to a more complicated computational fluid dynamics case as a proof concept of the application of our framework and insight error feedback measure for those using sampling to speedup data intensive computing. / Doctor of Philosophy / Data Visualization is used to help humans perceive high dimensional data, but it is unable to be applied in real time to computing applications that generate or process vast amounts of data, also known as data intensive computing applications. Attempts to process and apply traditional information visualization techniques to such data result in slow or non-responsive data intensive applications. For such applications, sampling is often used to reduce big data to smaller data so that the benefits of data visualization can be brought to data intensive applications. Sampling allows data visualization to be used as an interface between humans and insights contained in the big data of data intensive computing. However, sampling introduces error. The objective of sampling is to reduce the amount of data being processed without introducing too much error into the results of the data intensive application. This error results from the possibility that a data sample could exclude valuable information that was included in the original data set. To determine an adequate level of sampling one can use statistical measures like standard error. However, such measures do not translate well for cases involving data visualization. Knowing the standard error of a sample can tell you very little about the visualization of that data. What is needed is a measure that allows one to make an informed decision of how much sampling to use in a data intensive application, as a result of knowing how sampling impacts how people gain insights from a visualization of the sampled data. In this work we introduce an insight based measure for the impact of sampling on the results of visualized data. We develop a framework for the quantification of the level of insight, model the relationship between the level of insight and the amount of sampling, use this model to provide data intensive computing users with an insight based feedback measure for each arbitrary sample size they choose for speeding up data intensive computing, and we develop a prototype that utilizes our framework. Our prototype applies our framework and insight based feedback measure to a computational fluid dynamics (CFD) case, but our work starts off with a simple one dimensional data application and works its way up to the more complicated CFD case. This work allows users to speed up data intensive applications with a clear understanding of how the speedup will impact the insights gained from the visualization of this data.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/107087
Date24 June 2020
CreatorsMasiane, Moeti Moeklesia
ContributorsComputer Science, North, Christopher L., Jacques, Eric Jean-Yves, Feng, Wu-chun, Su, Simon, Luther, Kurt
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
Detected LanguageEnglish
TypeDissertation
FormatETD, application/pdf, application/pdf, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0014 seconds