Global ETD Search

1	DEPENDABLE CLOUD RESOURCES FOR BIG-DATA BATCH PROCESSING & STREAMING FRAMEWORKS Bara M Abusalah (10692924) 07 May 2021 (has links) The examiner of cloud computing systems in the last few years observes that there is a trend of the emergence of new Big Data frameworks every single year. Since Hadoop was developed in 2007, new frameworks followed it such as Spark, Storm, Heron, Apex, Flink, Samza, Kafka ... etc. Each framework is developed in a certain way to target and achieve certain objectives better than other frameworks do. However, there are few common functionalities and aspects that are shared between these frameworks. One vital aspect all these frameworks strive to achieve is better reliability and faster recovery time in case of failures. Despite all the advances in making datacenters dependable, failures actually still happen. This is particularly onerous for long-running “big data” applications, where partial failures can lead to significant losses and lengthy recomputations. This is also crucial for streaming systems where events are processed and monitored online in real time, and any delay in data delivery will cause a major inconvenience to the users.<div>Another observation is that some reliability implementations are redundant between different frameworks. Big data processing frameworks like Hadoop MapReduce include fault tolerance mechanisms, but these are commonly targeted at specific system/failure models, and are often redundant between frameworks. Encapsulating these implementations into one layer and making it shared between different applications will benefit more than one frame-work without the burden of re-implementing the same reliability approach in each single framework.<br></div><div>These observations motivated us to solve the problem by presenting two systems: Guardian and Warden. Guardian is tailored towards batch processing big data systems while Warden is targeted towards stream processing systems. Both systems are robust, RMS based, generic, multi-framework, flexible, customizable, low overhead systems that allow their users to run their applications with individually configurable fault tolerance granularity and degree, with only minor changes to their implementation.<br></div><div>Most reliability approaches carry out one rigid fault tolerance technique targeted towards one system at a time. It is more challenging to provide a reliability approach that is pluggable in multiple Big Data frameworks at a time and can achieve low overheads comparable with single targeted framework approaches, yet is flexible and customizable by its users to make it tailored towards their objectives. The genericity is attained by providing an interface that can be used in different applications from different frameworks in any part of the application code. The low overhead is achieved by providing faster application finish times with and without failures. The customizability is fulfilled by providing the users the options to choose between two fault tolerance guarantees (Crash Failures / Byzantine Failures) and, in case of streaming systems; it is combined with two delivery semantics (Exactly Once / At Most Once).<br></div><div>In other words, this thesis proposes the paradigm of dependable resources: big data processing frameworks are typically built on top of resource management systems (RMSs),and proposing fault tolerance support at the level of such an RMS yields generic fault tolerance mechanisms, which can be provided with low overhead by leveraging constraints on resources.<br></div><div>To the best of our knowledge, such approach was never tried on multiple big data batch processing and streaming frameworks before.<br></div><div>We demonstrate the benefits of Guardian by evaluating some batch processing frame-works such as Hadoop, Tez, Spark and Pig on a prototype of Guardian running on Amazon-EC2, improving completion time by around 68% in the presence of failures, while maintaining around 6% overhead. We’ve also built a prototype of Warden on the Flink and Samza (with Kafka) streaming frameworks. Our evaluations on Warden highlight the effectiveness of our approach in the presence of failures and without failures compared to other fault tolerance techniques (such as checkpointing)<br></div> Computer Engineering Big Data Research reliability failure mechanism fault tolerance
2	SPECIES- TO COMMUNITY-LEVEL RESPONSES TO CLIMATE CHANGE IN EASTERN U.S. FORESTS Jonathan A Knott (8797934) 12 October 2021 (has links) <p>Climate change has dramatically altered the ecological landscape of the eastern U.S., leading to shifts in phenological events and redistribution of tree species. However, shifts in phenology and species distributions have implications for the productivity of different populations and <a></a>the communities these species are a part of. Here, I utilized two studies to quantify the effects of climate change on forests of the eastern U.S. First, I used phenology observations at a common garden of 28 populations of northern red oak (<i>Quercus rubra</i>) across seven years to assess shifts in phenology in response to warming, identify population differences in sensitivity to warming, and correlate sensitivity to the productivity of the populations. Second, I utilized data from the USDA Forest Service’s Forest Inventory and Analysis Program to identify forest communities of the eastern U.S., assess shifts in their species compositions and spatial distributions, and determine which climate-related variables are most associated with changes at the community level. In the first study, I found that populations were shifting their spring phenology in response to warming, with the greatest sensitivity in populations from warmer, wetter climates. However, these populations with higher sensitivity did not have the highest productivity; rather, populations closer to the common garden with intermediate levels of sensitivity had the highest productivity. In the second study, I found that there were 12 regional forest communities of the eastern U.S., which varied in the amount their species composition shifted over the last three decades. Additionally, all 12 communities shifted their spatial distributions, but their shifts were not correlated with the distance and direction that climate change predicted them to shift. Finally, areas with the highest changes across all 12 communities were associated with warmer, wetter, lower temperature-variable climates generally in the southeastern U.S. Taken together, these studies provide insight into the ways in which forests are responding to climate change and have implications for the management and sustainability of forests in a continuously changing global environment.</p> Ecology Ecological Impacts of Climate Change Forest Ecology Climate Change Forest Inventory and Analysis (FIA) Phenology Latent Dirichlet allocation (LDA) Big Data Research Community Ecology
3	EXPLOITING THE SPATIAL DIMENSION OF BIG DATA JOBS FOR EFFICIENT CLUSTER JOB SCHEDULING Akshay Jajoo (9530630) 16 December 2020 (has links) With the growing business impact of distributed big data analytics jobs, it has become crucial to optimize their execution and resource consumption. In most cases, such jobs consist of multiple sub-entities called tasks and are executed online in a large shared distributed computing system. The ability to accurately estimate runtime properties and coordinate execution of sub-entities of a job allows a scheduler to efficiently schedule jobs for optimal scheduling. This thesis presents the first study that highlights spatial dimension, an inherent property of distributed jobs, and underscores its importance in efficient cluster job scheduling. We develop two new classes of spatial dimension based algorithms to<br>address the two primary challenges of cluster scheduling. First, we propose, validate, and design two complete systems that employ learning algorithms exploiting spatial dimension. We demonstrate high similarity in runtime properties between sub-entities of the same job by detailed trace analysis on four different industrial cluster traces. We identify design challenges and propose principles for a sampling based learning system for two examples, first for a coflow scheduler, and second for a cluster job scheduler.<br>We also propose, design, and demonstrate the effectiveness of new multi-task scheduling algorithms based on effective synchronization across the spatial dimension. We underline and validate by experimental analysis the importance of synchronization between sub-entities (flows, tasks) of a distributed entity (coflow, data analytics jobs) for its efficient execution. We also highlight that by not considering sibling sub-entities when scheduling something it may also lead to sub-optimal overall cluster performance. We propose, design, and implement a full coflow scheduler based on these assertions. Computer Engineering Distributed Computing Networking and Communications Computer Communications Networks Big Data Research Big Data Framework Background cluster scheduling Data center networks data center network Hadoop MapReduce HADOOP Online Learning scheduling decision

1

Page generated in 0.0872 seconds