Global ETD Search

1	Improving OpenMP Productivity with Data Locality Optimizations and High-resolution Performance Analysis Muddukrishna, Ananya January 2016 (has links) The combination of high-performance parallel programming and multi-core processors is the dominant approach to meet the ever increasing demand for computing performance today. The thesis is centered around OpenMP, a popular parallel programming API standard that enables programmers to quickly get started with writing parallel programs. However, in contrast to the quickness of getting started, writing high-performance OpenMP programs requires high effort and saps productivity. Part of the reason for impeded productivity is OpenMP’s lack of abstractions and guidance to exploit the strong architectural locality exhibited in NUMA systems and manycore processors. The thesis contributes with data distribution abstractions that enable programmers to distribute data portably in NUMA systems and manycore processors without being aware of low-level system topology details. Data distribution abstractions are supported by the runtime system and leveraged by the second contribution of the thesis – an architecture-specific locality-aware scheduling policy that reduces data access latencies incurred by tasks, allowing programmers to obtain with minimal effort upto 69% improved performance for scientific programs compared to state-of-the-art work-stealing scheduling. Another reason for reduced programmer productivity is the poor support extended by OpenMP performance analysis tools to visualize, understand, and resolve problems at the level of grains– task and parallel for-loop chunk instances. The thesis contributes with a cost-effective and automatic method to extensively profile and visualize grains. Grain properties and hardware performance are profiled at event notifications from the runtime system with less than 2.5% overheads and visualized using a new method called theGrain Graph. The grain graph shows the program structure that unfolded during execution and highlights problems such as low parallelism, work inflation, and poor parallelization benefit directly at the grain level with precise links to problem areas in source code. The thesis demonstrates that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing tools in standard programs from SPEC OMP 2012, Parsec 3.0 and Barcelona OpenMP Tasks Suite (BOTS). Grain profiles are also applied to study the input sensitivity and similarity of BOTS programs. All thesis contributions are assembled together to create an iterative performance analysis and optimization work-flow that enables programmers to achieve desired performance systematically and more quickly than what is possible using existing tools. This reduces pressure on experts and removes the need for tedious trial-and-error tuning, simplifying OpenMP performance analysis. / <p>QC 20151221</p> OpenMP Performance Analysis Scheduling Locality Optimizations
2	Composable, Sound Transformations for Nested Recursion and Loops Kirshanthan Sundararajah (16647885) 26 July 2023 (has links) <p> </p> <p>Programs that use loops to operate over arrays and matrices are generally known as <em>regular programs</em>. These programs appear in critical applications such as image processing, differential equation solvers, and machine learning. Over the past few decades, extensive research has been done on composing, verifying, and applying scheduling transformations like loop interchange and loop tiling for regular programs. As a result, we have general frameworks such as the polyhedral model to handle transformations for loop-based programs. Similarly, programs that use recursion and loops to manipulate pointer-based data structures are known as <em>irregular programs</em>. Irregular programs also appear in essential applications such as scientific simulations, data mining, and graphics rendering. However, there is no analogous framework for recursive programs. In the last decade, although many scheduling transformations have been developed for irregular programs, they are ad-hoc in various aspects, such as being developed for a specific application and lacking portability. This dissertation examines principled ways to handle scheduling transformations for recursive programs through a unified framework resulting in performance enhancement. </p> <p>Finding principled approaches to optimize irregular programs at compile-time is a long-standing problem. We specifically focus on scheduling transformations that reorder a program’s operations to improve performance by enhancing locality and exploiting parallelism. In the first part of this dissertation, we present PolyRec, a unified general framework that can compose and apply scheduling transformations to nested recursive programs and reason about the correctness of composed transformations. PolyRec is a first-of-its-kind unified general transformation framework for irregular programs consisting of nested recursion and loops. It is built on solid theoretical foundations from the world of automata and transducers and provides a fundamentally novel way to think about recursive programs and scheduling transformations for them. The core idea is designing mechanisms to strike a balance between the expressivity in representing the set of dynamic instances of computations, transformations, and dependences and the decidability of checking the correctness of composed transformations. We use <em>multi-tape </em>automata and transducers to represent the set of dynamic instances of computations and transformations, respectively. These machines are similar yet more expressive than their classical single-tape counterparts. While in general decidable properties of classical machines are undecidable for multi-tape machines, we have proven that those properties are decidable for the class of machines we consider, and we present algorithms to verify these properties. Therefore these machines provide the building blocks to compose and verify scheduling transformations for nested recursion and loops. The crux of the PolyRec framework is its regular string-based representation of dynamic instances that allows to lexicographically order instances identically to their execution order. All the transformations considered in PolyRec require different ordering of these strings representable only with <em>additive </em>changes to the strings. </p> <p>Loop transformations such as <em>skewing </em>require performing arithmetic on the representation of dynamic instances. In the second part of this dissertation, we explore this space of transformations by introducing skewing to nested recursion. Skewing plays an essential role in producing easily parallelizable loop nests from seemingly difficult ones due to dependences carried across loops. The inclusion of skewing for nested recursion to PolyRec requires significant extensions to representing dynamic instances and transformations that facilitate <em>performing arithmetic using strings</em>. First, we prove that the machines that represent the transformations are still composable. Then we prove that the representation of dependences and the algorithm that checks the correctness of composed transformations hold with minimal changes. Our new extended framework is known as UniRec, since it resembles the unimodular transformations for perfectly nested loop nests, which consider any combination of the primary transformations interchange, reversal, and skewing. UniRec opens possibilities of producing newly composed transformations for nested recursion and loops and verifying their correctness. We claim that UniRec completely subsumes the unimodular framework for loop transformations since nested recursion is more general than loop nests. </p> High performance computing Performance evaluation Formal methods for software Programming languages Recursion Loops Scheduling Transformations Dependence Testing Locality Optimizations Parallelism Skewing Polyhedral Model

Search results

Improving OpenMP Productivity with Data Locality Optimizations and High-resolution Performance Analysis

Composable, Sound Transformations for Nested Recursion and Loops