Global ETD Search

11	Programmer-assisted Automatic Parallelization Huang, Diego 08 December 2011 (has links) Parallel software is now required to exploit the abundance of threads and processors in modern multicore computers. Unfortunately, manual parallelization is too time-consuming and error-prone for all but the most advanced programmers. While automatic parallelization promises threaded software with little programmer effort, current auto-parallelizers are easily thwarted by pointers and other forms of ambiguity in the code. In this dissertation we profile the loops in SPEC CPU2006, categorize the loops in terms of available parallelism, and focus on promising loops that are not parallelized by IBM's XL C/C++ V10 auto-parallelizer. For those loops we propose methods of improved interaction between the programmer and compiler that can facilitate their parallelization. In particular, we (i) suggest methods for the compiler to better identify to the programmer the parallelization-blockers; (ii) suggest methods for the programmer to provide guarantees to the compiler that overcome these parallelization-blockers; and (iii) evaluate the resulting impact on performance. compiler automatic parallelization programmer guarantees 0984 0537
12	Efficient design-space exploration of custom instruction-set extensions Zuluaga, Marcela January 2010 (has links) Customization of processors with instruction set extensions (ISEs) is a technique that improves performance through parallelization with a reasonable area overhead, in exchange for additional design effort. This thesis presents a collection of novel techniques that reduce the design effort and cost of generating ISEs by advancing automation and reconfigurability. In addition, these techniques maximize the perfomance gained as a function of the additional commited resources. Including ISEs into a processor design implies development at many levels. Most prior works on ISEs solve separate stages of the design: identification, selection, and implementation. However, the interations between these stages also hold important design trade-offs. In particular, this thesis addresses the lack of interaction between the hardware implementation stage and the two previous stages. Interaction with the implementation stage has been mostly limited to accurately measuring the area and timing requirements of the implementation of each ISE candidate as a separate hardware module. However, the need to independently generate a hardware datapath for each ISE limits the flexibility of the design and the performance gains. Hence, resource sharing is essential in order to create a customized unit with multi-function capabilities. Previously proposed resource-sharing techniques aggressively share resources amongst the ISEs, thus minimizing the area of the solution at any cost. However, it is shown that aggressively sharing resources leads to large ISE datapath latency. Thus, this thesis presents an original heuristic that can be parameterized in order to control the degree of resource sharing amongst a given set of ISEs, thereby permitting the exploration of the existing implementation trade-offs between instruction latency and area savings. In addition, this thesis introduces an innovative predictive model that is able to quickly expose the optimal trade-offs of this design space. Compared to an exhaustive exploration of the design space, the predictive model is shown to reduce by two orders of magnitude the number of executions of the resource-sharing algorithm that are required in order to find the optimal trade-offs. This thesis presents a technique that is the first one to combine the design spaces of ISE selection and resource sharing in ISE datapath synthesis, in order to offer the designer solutions that achieve maximum speedup and maximum resource utilization using the available area. Optimal trade-offs in the design space are found by guiding the selection process to favour ISE combinations that are likely to share resources with low speedup losses. Experimental results show that this combined approach unveils new trade-offs between speedup and area that are not identified by previous selection techniques; speedups of up to 238% over previous selection thecniques were obtained. Finally, multi-cycle ISEs can be pipelined in order to increase their throughput. However, it is shown that traditional ISE identification techniques do not allow this optimization due to control flow overhead. In order to obtain the benefits of overlapping loop executions, this thesis proposes to carefully insert loop control flow statements into the ISEs, thus allowing the ISE to control the iterations of the loop. The proposed ISEs broaden the scope of instruction-level parallelism and obtain higher speedups compared to traditional ISEs, primarily through pipelining, the exploitation of spatial parallelism, and reducing the overhead of control flow statements and branches. A detailed case study of a real application shows that the proposed method achieves 91% higher speedups than the state-of-the-art, with an area overhead of less than 8% in hardware implementation. 004.1
13	Parallel Stochastic Estimation on Multicore Platforms Rosén, Olov January 2015 (has links) The main part of this thesis concerns parallelization of recursive Bayesian estimation methods, both linear and nonlinear such. Recursive estimation deals with the problem of extracting information about parameters or states of a dynamical system, given noisy measurements of the system output and plays a central role in signal processing, system identification, and automatic control. Solving the recursive Bayesian estimation problem is known to be computationally expensive, which often makes the methods infeasible in real-time applications and problems of large dimension. As the computational power of the hardware is today increased by adding more processors on a single chip rather than increasing the clock frequency and shrinking the logic circuits, parallelization is one of the most powerful ways of improving the execution time of an algorithm. It has been found in the work of this thesis that several of the optimal filtering methods are suitable for parallel implementation, in certain ranges of problem sizes. For many of the suggested parallelizations, a linear speedup in the number of cores has been achieved providing up to 8 times speedup on a double quad-core computer. As the evolution of the parallel computer architectures is unfolding rapidly, many more processors on the same chip will soon become available. The developed methods do not, of course, scale infinitely, but definitely can exploit and harness some of the computational power of the next generation of parallel platforms, allowing for optimal state estimation in real-time applications. / CoDeR-MP Recursive estimation Parallelization Bayesian estimation Anomaly detection
14	Parallel Particle Swarm Optimization and Large Swarms McNabb, Andrew W. 27 January 2011 (has links) (PDF) Optimization is the search for the maximum or minimum of a given objective function. Particle Swarm Optimization (PSO) is a simple and effective evolutionary algorithm, but it may take hours or days to optimize difficult objective functions which are deceptive or expensive. Deceptive functions may be highly multimodal and multidimensional, and PSO requires extensive exploration to avoid being trapped in local optima. Expensive functions, whose computational complexity may arise from dependence on detailed simulations or large datasets, take a long time to evaluate. For deceptive or expensive objective functions, PSO must be parallelized to use multiprocessor systems and clusters efficiently. This thesis investigates the implications of parallelizing PSO and in particular, the details of parallelization and the effects of large swarms. PSO can be expressed naturally in Google's MapReduce framework to develop a simple and robust parallel implementation that automatically includes communication, load balancing, and fault tolerance. This flexible implementation makes it easy to apply modifications to the algorithm, such as those that improve optimization of difficult objective functions and improve parallel performance. Results show that larger swarms help with both of these goals, but they are most effective if arranged into sparse topologies with lower overhead from communication. Additionally, PSO must be modified to use communication more efficiently in a large sparse swarm for objective functions where information ideally flows quickly through a large swarm. Swarm size is usually fixed at a modest number around 50, but particularly in a parallel computational environment, much larger swarms are much more effective for deceptive objective functions. Likewise, swarms much smaller than 50 are more effective for expensive but less deceptive functions. In general, swarm size should be carefully chosen using all available information about the objective function and computational environment. particle swarm optimization parallelization Computer Sciences
15	Optimizing locality and parallelism through program reorganization Krishnamoorthy, Sriram 07 January 2008 (has links) No description available. Computer Science data locality parallelization out-of-core
16	Compile-Time Characterization of Recurrent Patterns in Irregular Computations Singri, Arjun Jagadeesh 03 September 2010 (has links) No description available. Computer Science static analysis llvm parallelization
17	Development and Acceleration of Parallel Chemical Transport Models Eller, Paul Ray 03 August 2009 (has links) Improving chemical transport models for atmospheric simulations relies on future developments of mathematical methods and parallelization methods. Better mathematical methods allow simulations to more accurately model realistic processes and/or to run in a shorter amount of time. Parellization methods allow simulations to run in much shorter amounts of time, therefore allowing scientists to use more accurate or more detailed simulations (higher resolution grids, smaller time steps). The state-of-the-science GEOS-Chem model is modified to use the Kinetic Pre-Processor, giving users access to an array of highly efficient numerical integration methods and to a wide variety of user options. Perl parsers are developed to interface GEOS-Chem with KPP in addition to modifications to KPP allowing KPP integrators to interface with GEOS-Chem. A variety of different numerical integrators are tested on GEOS-Chem, demonstrating that KPP provided chemical integrators produce more accurate solutions in a given amount of time than the original GEOS-Chem chemical integrator. The STEM chemical transport model provides a large scale end-to-end application to experiment with running chemical integration methods and transport methods on GPUs. GPUs provide high computational power at a fairly cheap cost. The CUDA programming environment simplifies the GPU development process by providing access to powerful functions to execute parallel code. This work demonstrates the accleration of a large scale end-to-end application on GPUs showing significant speedups. This is achieved by implementing all relevant kernels on the GPU using CUDA. Nevertheless, further improvements to GPUs are needed to allow these applications to fully exploit the power of GPUs. / Master of Science KPP GEOS-Chem STEM Parallelization GPU CUDA
18	Run-time optimization of adaptive irregular applications Yu, Hao 15 November 2004 (has links) Compared to traditional compile-time optimization, run-time optimization could oﬀer signiﬁcant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identiﬁed a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Speciﬁcally, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm speciﬁed by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems. compiler optimizations adaptive optimization performance modeling run-time parallelization run-time optimization reduction parallelization
19	Run-time optimization of adaptive irregular applications Yu, Hao 15 November 2004 (has links) Compared to traditional compile-time optimization, run-time optimization could oﬀer signiﬁcant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identiﬁed a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Speciﬁcally, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm speciﬁed by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems. compiler optimizations adaptive optimization performance modeling run-time parallelization run-time optimization reduction parallelization
20	Automated Recognition of Algorithmic Patterns in DSP Programs Shafiee Sarvestani, Amin January 2011 (has links) We introduce an extensible knowledge based tool for idiom (pattern) recognition in DSP(digital signal processing) programs. Our tool utilizesfunctionality provided by the Cetus compiler infrastructure fordetecting certain computation patterns that frequently occurin DSP code. We focus on recognizing patterns for for-loops andstatements in their bodies as these often are the performance criticalconstructs in DSP applications for which replacementby highly optimized, target-specific parallel algorithms will bemost profitable. For better structuring and efficiency of patternrecognition, we classify patterns by different levels of complexitysuch that patterns in higher levels are defined in terms of lowerlevel patterns.The tool works statically on the intermediate representation(IR). It traverses the abstract syntax tree IR in post-orderand applies bottom-up pattern matching, at each IR nodeutilizing information about the patterns already matched for itschildren or siblings.For better extensibility and abstraction,most of the structuralpart of recognition rules is specified in XML form to separatethe tool implementation from the pattern specifications.Information about detected patterns will later be used foroptimized code generation by local algorithm replacement e.g. for thelow-power high-throughput multicore DSP architecture ePUMA. Automatic Parallelization Algorithmic Pattern Recognition Cetus DSP DSP Code Parallelization Compiler Frameworks Computer Sciences Datavetenskap (datalogi)

Search results