1 |
Atomic block formation for explicit data graph execution architecturesMaher, Bertrand Allen 13 December 2010 (has links)
Limits on power consumption, complexity, and on-chip latency have
focused computer architects on power-efficient designs that exploit
parallelism. One approach divides programs into atomic blocks of
operations that execute semi-independently, which efficiently creates
a large window of potentially concurrent operations. This
dissertation studies the intertwined roles of the compiler,
architecture, and microarchitecture in achieving efficiency and high
performance with a block-atomic architecture.
For such an architecture to achieve high performance the compiler must
form blocks effectively. The compiler must create large blocks of
instructions to amortize the per-block overhead, but control flow and
content restrictions limit the compiler's options. Block formation
should consider factors such of frequency of execution, block size
such as selecting control-flow paths that are frequently executed, and
exploiting locality of computations to reduce communication overheads.
This dissertation determines what characteristics of programs
influence block formation and proposes techniques to generate
effective blocks. The first contribution is a method for solving
phase-ordering problems inherent to block formation, mitigating the
tension between block-enlarging optimizations---if-conversion, tail
duplication, loop unrolling, and loop peeling---as well as scalar
optimizations. Given these optimizations, analysis shows that the
remaining obstacles to creating larger blocks are inherent in the
control flow structure of applications, and furthermore that any fixed
block size entails a sizable amount of wasted space. To eliminate
this overhead, this dissertation proposes an architectural
implementation of variable-size blocks that allow the compiler to
dramatically improve block efficiency.
We use these mechanisms to develop policies for block formation that
achieve high performance on a range of applications and processor
configurations. We find that the best policies differ significantly
depending on the number of participating cores. Using machine
learning, we discover generalized policies for particular hardware
configurations and find that the best policy varies significantly
between applications and based on the number of parallel resources
available in the microarchitecture. These results show that effective
and efficient block-atomic execution is possible when the compiler and
microarchitecture are designed cooperatively. / text
|
Page generated in 0.0862 seconds