Global ETD Search

1	Automatic generation of synthetic workloads for multicore systems Ganesan, Karthik 11 July 2012 (has links) When designing a computer system, benchmark programs are used with cycle accurate performance/power simulators and HDL level simulators to evaluate novel architectural enhancements, perform design space exploration, understand the worst-case power characteristics of various designs and find performance bottlenecks. This research effort is directed towards automatically generating synthetic benchmarks to tackle three design challenges: 1) For most of the simulation related purposes, full runs of modern real world parallel applications like the PARSEC, SPLASH suites cannot be used as they take machine weeks of time on cycle accurate and HDL level simulators incurring a prohibitively large time cost 2) The second design challenge is that, some of these real world applications are intellectual property and cannot be shared with processor vendors for design studies 3) The most significant problem in the design stage is the complexity involved in fixing the maximum power consumption of a multicore design, called the Thermal Design Power (TDP). In an effort towards fixing this maximum power consumption of a system at the most optimal point, designers are used to hand-crafting possible code snippets called power viruses. But, this process of trying to manually write such maximum power consuming code snippets is very tedious. All of these aforementioned challenges has lead to the resurrection of synthetic benchmarks in the recent past, serving as a promising solution to all the challenges. During the design stage of a multicore system, availability of a framework to automatically generate system-level synthetic benchmarks for multicore systems will greatly simplify the design process and result in more confident design decisions. The key idea behind such an adaptable benchmark synthesis framework is to identify the key characteristics of real world parallel applications that affect the performance and power consumption of a real program and create synthetic executable programs by varying the values for these characteristics. Firstly, with such a framework, one can generate miniaturized synthetic clones for large target (current and futuristic) parallel applications enabling an architect to use them with slow low-level simulation models (e.g., RTL models in VHDL/Verilog) and helps in tailoring designs to the targeted applications. These synthetic benchmark clones can be distributed to architects and designers even if the original applications are intellectual property, when they are not publicly available. Lastly, such a framework can be used to automatically create maximum power consuming code snippets to be able to help in fixing the TDP, heat sinks, cooling system and other power related features of the system. The workload cloning framework built using the proposed synthetic benchmark generation methodology is evaluated to show its superiority over the existing cloning methodologies for single-core systems by generating miniaturized clones for CPU2006 and ImplantBench workloads with only an average error of 2.9% in performance for up to five orders of magnitude of simulation speedup. The correlation coefficient predicting the sensitivity to design changes is 0.95 and 0.98 for performance and power consumption. The proposed framework is evaluated by cloning parallel applications implemented based on p-threads and OpenMP in the PARSEC benchmark suite. The average error in predicting performance is 4.87% and that of power consumption is 2.73%. The correlation coefficient predicting the sensitivity to design changes is 0.92 for performance. The efficacy of the proposed synthetic benchmark generation framework for power virus generation is evaluation on SPARC, Alpha and x86 ISAs using full system simulators and also using real hardware. The results show that the power viruses generated for single-core systems consume 14-41% more power compared to MPrime on SPARC ISA. Similarly, the power viruses generated for multicore systems consume 45-98%, 40-89% and 41-56% more power than PARSEC workloads, running multiple copies of MPrime and multithreaded SPECjbb respectively. / text Power virus Stressmark Workload cloning Synthetic benchmarks Thermal design power Parallel workloads
2	Performance Analysis and Evaluation of Divisible Load Theory and Dynamic Loop Scheduling Algorithms in Parallel and Distributed Environments Balasubramaniam, Mahadevan 14 August 2015 (has links) High performance parallel and distributed computing systems are used to solve large, complex, and data parallel scientific applications that require enormous computational power. Data parallel workloads which require performing similar operations on different data objects, are present in a large number of scientific applications, such as N-body simulations and Monte Carlo simulations, and are expressed in the form of loops. Data parallel workloads that lack precedence constraints are called arbitrarily divisible workloads, and are amenable to easy parallelization. Load imbalance that arise from various sources such as application, algorithmic, and systemic characteristics during the execution of scientific applications degrades performance. Scheduling of arbitrarily divisible workloads to address load imbalance in order to obtain better utilization of computing resources is a major area of research. Divisible load theory (DLT) and dynamic loop scheduling (DLS) algorithms are two algorithmic approaches employed in the scheduling of arbitrarily divisible workloads. Despite sharing the same goal of achieving load balancing, the two approaches are fundamentally different. Divisible load theory algorithms are linear, deterministic and platform dependent, whereas dynamic loop scheduling algorithms are probabilistic and platform agnostic. Divisible load theory algorithms have been traditionally used for performance prediction in environments characterized by known or expected variation in the system characteristics at runtime. Dynamic loop scheduling algorithms are designed to simultaneously address all the sources of load imbalance that stochastically arise at runtime from application, algorithmic, and systemic characteristics. In this dissertation, an analysis and performance evaluation of DLT and DLS algorithms are presented in the form of a scalability study and a robustness investigation. The effect of network topology on their performance is studied. A hybrid scheduling approach is also proposed that integrates DLT and DLS algorithms. The hybrid approach combines the strength of DLT and DLS algorithms and improves the performance of the scientific applications running in large scale parallel and distributed computing environments, and delivers performance superior to that which can be obtained by applying DLT algorithms in isolation. The range of conditions for which the hybrid approach is useful is also identified and discussed. 3D torus parallel and distributed computing robustness scalability dynamic loop scheduling divisible load theory arbitrarily divisible workloads data parallel workloads cluster

Search results

Automatic generation of synthetic workloads for multicore systems

Performance Analysis and Evaluation of Divisible Load Theory and Dynamic Loop Scheduling Algorithms in Parallel and Distributed Environments