Global ETD Search

511	Design and implementation of the Maestro network control platform January 2009 (has links) Computer network operation is inherently complex because it consists of many functions such as routing, firewalling, VPN provisioning, traffic load-balancing, network maintenance, etc. To cope with this, network designers have created modular components to handle each function. Unfortunately, in reality, unavoidable dependencies exist among some of the components and they may interact accidentally. There is no single mechanism for systematically governing the interactions among the various components. In addition, routing is mainly realized by distributed routing protocols for higher survivability. Some other components need to be centralized, because either they have no obvious mapping onto distributed computations, or they could achieve more optimal solutions. Both distributed control and centralized control are necessary in network management. However, the interaction between distributed and centralized controls makes the problem even more complicated. No existing study has considered either how to systematically manage the interactions among network control components, or how distributed and centralized control system could collaborate to leverage the inherent advantages of both. To address these problems, we propose a system called Maestro. Maestro orchestrates the network control components that govern the behavior of a network, and enables the collaboration between distributed and centralized control components. Maestro provides abstractions for the modular implementation of network control components, and addresses the fundamental problems originating from the concurrent operations of network control components, namely communication between components, scheduling of component executions, concurrency management and protection enforcement. Maestro allows distributed and centralized control components to collaborate in four different approaches, and each approach has different strength and weakness. In this thesis we present the design and implementation of a prototype of Maestro, and evaluate the performance and effectiveness of Maestro mechanisms. Computer Science
512	Exploring the potential for accelerating sparse matrix-vector product on a Processing-in-Memory architecture January 2009 (has links) As the importance of memory access delays on performance has mushroomed over the past few decades, researchers have begun exploring Processing-in-Memory (PIM) technology, which offers higher memory bandwidth, lower memory latency, and lower power consumption. In this study, we investigate whether an emerging PIM design from Sandia National Laboratories can boost performance for sparse matrix-vector product (SMVP). While SMVP is in the best-case bandwidth-bound, factors related to matrix structure and representation also limit performance. We analyze SMVP both in the context of an AMD Opteron processor and the Sandia PIM, exploring the performance limiters for each and the degree to which these can be ameliorated by data and code transformations. Over a range of sparse matrices, SMVP on the PIM outperformed the Opteron by a factor of 1.82. On the PIM, computational kernel and data structure transformations improved performance by almost 40% over conventional implementations using compressed-sparse row format. Computer Science
513	A storage architecture for data-intensive computing January 2010 (has links) The assimilation of computing into our daily lives is enabling the generation of data at unprecedented rates. In 2008, IDC estimated that the "digital universe" contained 486 exabytes of data [9]. The computing industry is being challenged to develop methods for the cost-effective processing of data at these large scales. The MapReduce programming model has emerged as a scalable way to perform data-intensive computations on commodity cluster computers. Hadoop is a popular open-source implementation of MapReduce. To manage storage resources across the cluster, Hadoop uses a distributed user-level filesystem. This filesystem --- HDFS --- is written in Java and designed for portability across heterogeneous hardware and software platforms. The efficiency of a Hadoop cluster depends heavily on the performance of this underlying storage system. This thesis is the first to analyze the interactions between Hadoop and storage. It describes how the user-level Hadoop filesystem, instead of efficiently capturing the full performance potential of the underlying cluster hardware, actually degrades application performance significantly. Architectural bottlenecks in the Hadoop implementation result in inefficient HDFS usage due to delays in scheduling new MapReduce tasks. Further, HDFS implicitly makes assumptions about how the underlying native platform manages storage resources, even though native filesystems and I/O schedulers vary widely in design and behavior. Methods to eliminate these bottlenecks in HDFS are proposed and evaluated both in terms of their application performance improvement and impact on the portability of the Hadoop framework. In addition to improving the performance and efficiency of the Hadoop storage system, this thesis also focuses on improving its flexibility. The goal is to allow Hadoop to coexist in cluster computers shared with a variety of other applications through the use of virtualization technology. The introduction of virtualization breaks the traditional Hadoop storage architecture, where persistent HDFS data is stored on local disks installed directly in the computation nodes. To overcome this challenge, a new flexible network-based storage architecture is proposed, along with changes to the HDFS framework. Network-based storage enables Hadoop to operate efficiently in a dynamic virtualized environment and furthers the spread of the MapReduce parallel programming model to new applications. Computer Science
514	Efficient traffic trajectory error detection January 2010 (has links) Our recent survey on publicly reported router bugs shows that many router bugs, once triggered, can cause various traffic trajectory errors including traffic deviating from its intended forwarding paths, traffic being mistakenly dropped and unauthorized traffic bypassing packet filters. These traffic trajectory errors are serious problems because they may cause network applications to fail and create security loopholes for network intruders to exploit. Therefore, traffic trajectory errors must be quickly and efficiently detected so that the corrective action can be performed in a timely fashion. Detecting traffic trajectory errors requires the real-time tracking of the control states (e.g., forwarding tables, packet filters) of routers and the scalable monitoring of the actual traffic trajectories in the network. Traffic trajectory errors can then be detected by efficiently comparing the observed traffic trajectories against the intended control states. Making such trajectory error detection efficient and practical for large-scale high speed networks requires us to address many challenges. First, existing traffic trajectory monitoring algorithms require the simultaneously monitoring of all network interfaces in a network for the packets of interest, which will cause a daunting monitoring overhead. To improve the efficiency of traffic trajectory monitoring, we propose the router group monitoring technique that only monitors the periphery interfaces of a set of selected router groups. We analyze a large number of real network topologies and show that effective router groups with high trajectory error detection rates exist in all cases. We then develop an analytical model for quickly and accurately estimating the detection rates of different router groups. Based on this model, we propose an algorithm to select a set of router groups that can achieve complete error detection and low monitoring overhead. Second, maintaining the control states of all the routers in the network requires a significant amount of memory. However, there exist no studies on how to efficiently store multiple complex packet filters. We propose to store multiple packet filters using a shared Hyper- Cuts decision tree. To help decide which subset of packet filters should share a HyperCuts decision tree, we first identify a number of important factors that collectively impact the efficiency of the resulting shared HyperCuts decision tree. Based on the identified factors, we then propose to use machine learning techniques to predict whether any pair of packet filters should share a tree. Given the pair-wise prediction matrix, a greedy heuristic algorithm is used to classify packet filters into a number of shared HyperCuts decision trees. Our experiments using both real packet filters and synthetic packet filters show that our shared HyperCuts decision trees require considerably less memory while having the same or a slightly higher average height than separate trees. In addition, the shared HyperCuts decision trees enable concurrent lookup of multiple packet filters sharing the same tree. Finally, based on the two proposed techniques, we have implemented a complete prototype system that is compatible with Juniper's JUNOS. We have shown in the thesis that, to detect traffic trajectory errors, it is sufficient to only selectively implement a small set of key functions of a full-fletched router on our prototype, which makes our prototype simpler and less error prone. We conduct both Emulab experiments and micro-benchmark experiments to show that the system can efficiently track router control states, monitor traffic trajectories and detect traffic trajectory errors. Computer Science
515	Designing type inference for typed object-oriented languages January 2010 (has links) Type-checked object-oriented languages have typically been designed with extremely simple type systems. However, there has recently been intense interest in extending such languages with more sophisticated types and subtyping relationships. JAVA and C# are mainstream languages that have been successfully extended with generic classes and methods; SCALA, FORTRESS, and X10 are new languages that adopt more advanced typing features, such as arrows, tuples, unions, intersections, dependent types, and existentials. Presently, the type inference performed by these languages is unstable and evolving. This thesis explores problems arising in the design of a type inference specification for such languages. We first present a formal description of subtyping in the context of a variety of advanced typing features. We then demonstrate how our formal subtyping algorithm can be easily re-expressed to produce a type inference algorithm, and observe that this algorithm is general enough to address a variety of important type-checking problems. Finally, we apply this theory to a case study of the JAVA language's type system. We express JAVA'S types and inference algorithm in terms of our formal theory and note a variety of opportunities for improvement. We then describe the results of applying an improved type inference implementation to a selection of existing JAVA code, noting that, without introducing significant backwards-incompatibility problems for these programs, we've managed to significantly reduce the need for annotated method invocations. Computer Science
516	Workload-aware live storage migration for clouds January 2010 (has links) The emerging open cloud computing model will provide users with great freedom to dynamically migrate virtualized computing services to, from, and between clouds over the wide-area. While this freedom leads to many potential benefits, the running services must be minimally disrupted by the migration. Unfortunately, current solutions for wide-area migration incur too much disruption as they will significantly slow down storage I/O operations during migration. The resulting increase in service latency could be very costly to a business. This thesis presents a novel storage migration scheduling algorithm that can greatly improve storage I/O performance during wide-area migration. Our algorithm is unique in that it considers individual virtual machine's storage I/O workload such as temporal locality, spatial locality and popularity characteristics to compute an efficient data transfer schedule. Using a trace-driven framework, we show that our algorithm provides large performance benefits across a wide range of popular virtual machine workloads. Computer Science
517	Efficient optimization of memory accesses in parallel programs January 2010 (has links) The power, frequency, and memory wall problems have caused a major shift in mainstream computing by introducing processors that contain multiple low power cores. As multi-core processors are becoming ubiquitous, software trends in both parallel programming languages and dynamic compilation have added new challenges to program compilation for multi-core processors. This thesis proposes a combination of high-level and low-level compiler optimizations to address these challenges. The high-level optimizations introduced in this thesis include new approaches to May-Happen-in-Parallel analysis and Side-Effect analysis for parallel programs and a novel parallelism-aware Scalar Replacement for Load Elimination transformation. A new Isolation Consistency (IC) memory model is described that permits several scalar replacement transformation opportunities compared to many existing memory models. The low-level optimizations include a novel approach to register allocation that retains the compile time and space efficiency of Linear Scan, while delivering runtime performance superior to both Linear Scan and Graph Coloring. The allocation phase is modeled as an optimization problem on a Bipartite Liveness Graph (BLG) data structure. The assignment phase focuses on reducing the number of spill instructions by using register-to-register move and exchange instructions wherever possible. Experimental evaluations of our scalar replacement for load elimination transformation in the Jikes RVM dynamic compiler show decreases in dynamic counts for getfield operations of up to 99.99%, and performance improvements of up to 1.76x on 1 core, and 1.39x on 16 cores, when compared with the load elimination algorithm available in Jikes RVM. A prototype implementation of our BLG register allocator in Jikes RVM demonstrates runtime performance improvements of up to 3.52x relative to Linear Scan on an x86 processor. When compared to Graph Coloring register allocator in the GCC compiler framework, our allocator resulted in an execution time improvement of up to 5.8%, with an average improvement of 2.3% on a POWER5 processor. With the experimental evaluations combined with the foundations presented in this thesis, we believe that the proposed high-level and low-level optimizations are useful in addressing some of the new challenges emerging in the optimization of parallel programs for multi-core architectures. Computer Science
518	Segmentation and visualization of volume maps January 2010 (has links) Volume data is a simple and often-used representation for exchanging and processing data in various scientific domains, such as medicine and molecular biology. The segmentation of volume data is an essential part of data interpretation. Researchers have extensively studied the problem of segmentation focusing on efficient algorithms for segmenting and rendering volumes. Our contribution is two-fold. First, we propose a tri-linear classification method that can implemented on the GPU to reduce artifacts and jaggedness along the material boundaries that appear when rendering segmented volumes. Our representation provides sub-voxel accuracy for representing segmented materials. Second, we demonstrate our interactive painting-based segmentation tool, which can be used to rapidly produce an intuitive segmentation. We compare our tool against known results and show that we can generate similar segmentations using a simple and intuitive control scheme. Computer Science
519	Performance analysis for parallel programs from multicore to petascale January 2010 (has links) Cutting-edge science and engineering applications require petascale computing. Petascale computing platforms are characterized by both extreme parallelism (systems of hundreds of thousands to millions of cores) and hybrid parallelism (nodes with multicore chips). Consequently, to effectively use petascale resources, applications must exploit concurrency at both the node and system level --- a difficult problem. The challenge of developing scalable petascale applications is only partially aided by existing languages and compilers. As a result, manual performance tuning is often necessary to identify and resolve poor parallel and serial efficiency. Our thesis is that it is possible to achieve unique, accurate, and actionable insight into the performance of fully optimized parallel programs by measuring them with asynchronous-sampling-based call path profiles; attributing the resulting binary-level measurements to source code structure; analyzing measurements on-the-fly and postmortem to highlight performance inefficiencies; and presenting the resulting context- sensitive metrics in three complementary views. To support this thesis, we have developed several techniques for identifying performance problems in fully optimized serial, multithreaded and petascale programs. First, we describe how to attribute very precise (instruction-level) measurements to source-level static and dynamic contexts in fully optimized applications --- all for an average run-time overhead of a few percent. We then generalize this work with the development of logical call path profiling and apply it to work-stealing-based applications. Second, we describe techniques for pinpointing and quantifying parallel inefficiencies such as parallel idleness, parallel overhead and lock contention in multithreaded executions. Third, we show how to diagnose scalability bottlenecks in petascale applications by scaling our our measurement, analysis and presentation tools to support large-scale executions. Finally, we provide a coherent framework for these techniques by sketching a unique and comprehensive performance analysis methodology. This work forms the basis of Rice University's HPCTOOLKIT performance tools. Computer Science
520	Optimizing network I/O virtualization through guest-driven scheduler bypass January 2010 (has links) Virtualization is increasingly utilized for consolidating server resources to improve efficiency by conserving power and space. However, significant hurdles remain in achieving satisfactory performance in a virtualized system. Notably, virtualization of network I/O continues to be a performance barrier. The driver domain model of I/O virtualization suffers from an inherent network performance disadvantage due to the necessity of scheduling a driver domain. However, this virtualization model is desirable because of its fault tolerance and isolation properties. In this work, I argue that it is possible to overcome the barrier of network I/O performance while maintaining domain protection by providing a i mechanism which enables guests to operate the driver domain on their own behalf without the intervention of the scheduler. I describe my implementation of the worldswitch mechanism and evaluate its performance. I show that with the worldswitch enabled, guests achieve higher bandwidth and lower latency than in an unmodified system. Computer Science

Search results