We present an annotation management system for cloud-based platforms, which is called “CloudNotes�. CloudNotes enables the annotation management feature in the scalable Hadoop and MapRedue platforms. In CloudNotes system, every piece of data may have one or more annotations associate with it, and these annotations will be propagated when the data is being transformed through the MapReduce jobs. Such an annotation management system is important for understanding the provenance and quality of data, especially in applications that deal with integration of scientific and biological data at unprecedented scale and complexity. We propose several extensions to the Hadoop platform that allow end-users to add and retrieve annotations seamlessly. Annotations in CloudNotes will be generated, propagated and managed in a distributed manner. We address several challenges that include attaching annotations to data at various granularities in Hadoop, annotating data in flat files with no known schema until query time, and creating and storing the annotations is a distributed fashion. We also present new storage mechanisms and novel indexing techniques that enable adding the annotations in small increments although Hadoop’s file system is optimized for large batch processing.
Identifer | oai:union.ndltd.org:wpi.edu/oai:digitalcommons.wpi.edu:etd-theses-1272 |
Date | 24 April 2014 |
Creators | Lu, Yue |
Contributors | Mohamed Y. Eltabakh, Advisor, Craig E. Wills, Department Head, , Elke A. Rundensteiner |
Publisher | Digital WPI |
Source Sets | Worcester Polytechnic Institute |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | Masters Theses (All Theses, All Years) |
Page generated in 0.0021 seconds