461 |
A Multiprocessor Architecture Using Modular Arithmetic for Very High Precision ComputationWu, Henry M. 01 April 1989 (has links)
We outline a multiprocessor architecture that uses modular arithmetic to implement numerical computation with 900 bits of intermediate precision. A proposed prototype, to be implemented with off-the-shelf parts, will perform high-precision arithmetic as fast as some workstations and mini- computers can perform IEEE double-precision arithmetic. We discuss how the structure of modular arithmetic conveniently maps into a simple, pipelined multiprocessor architecture. We present techniques we developed to overcome a few classical drawbacks of modular arithmetic. Our architecture is suitable to and essential for the study of chaotic dynamical systems.
|
462 |
Software tools for modeling and simulation of on-chip communication architecturesZhu, Xinping, January 1900 (has links) (PDF)
Thesis (Ph. D.)--Princeton University, 2005. / "June 2005." Description based on contents viewed Apr. 11, 2007; title from title screen. Includes bibliographical references (p. 135-147).
|
463 |
Modeling performance and power for energy-efficient GPGPU computingHong, Sunpyo 12 November 2012 (has links)
The objective of the proposed research is to develop an analytical model that predicts performance and power for many-core architecture and further propose a mechanism, which leverages the analytical model, to enable energy-efficient execution of an application.
The key insight of the model is to investigate and quantify a complex relationship that exists between the thread-level parallelism and memory-level parallelism for an application on a given many-core architecture. Two metrics are proposed: memory warp parallelism (MWP), which refers to the number of overlapping memory accesses per core, and computation warp parallelism (CWP), which characterizes an application type. By using these metrics in addition to the architectural and application parameters, the overall application performance is produced. The model uses statically-available parameters such as instruction-mixture information and input-data size, and the prediction accuracy is 13.3% for the GPU-computing benchmarks.
Another important aspect of using many-core architecture is reducing peak power and achieving energy savings. By using the proposed integrated power and performance (IPP) framework, the results showed that different optimization points exist for GPU architecture depending on the application type. The work shows that by activating fewer cores, 10.99% of run-time energy consumption can be saved for the bandwidth-limited benchmarks, and a projection of 25.8% energy savings is predicted when power-gating at core level is employed.
Finally, the model is shifted to throughput using OpenCL for targeting more variety of processors. First, multiple outputs relating to performance are predicted, including upper-bound and lower-bound values. Second, by using the model parameters, an application can be categorized into a different category, each with its own suggestions for improving performance and energy efficiency. Third, the bandwidth saturation point accuracy is significantly improved by considering independent memory accesses and updating the performance model. Furthermore, a trade-off analysis using architectural and application parameters is straightforward, which provides more insights to improve energy efficiency.
In the future, a computer system will contain hundreds of heterogeneous cores. Hence, it is mandatory that a workload gets scheduled to an efficient core or distributed on both types of cores. A preliminary work by using the analytical model to do scheduling between CPU and GPU is demonstrated in the appendix. Since profiling phase is not required, the kernel code can be transformed to run more efficiently on the specific architecture. Another extension of the work regarding the relationship between the speed-up and energy efficiency is mathematically derived. Finally, future research ideas are presented regarding the usage of the model for programmer, compiler, and runtime for future heterogeneous systems.
|
464 |
The "Mobius Cube" : an interconnection network for parallel computationLarson, Shawn M. 26 November 1990 (has links)
Graduation date: 1991
|
465 |
Cluster partitioning approaches to parallel Monte Carlo simulation on multiprocessorsRanawake, Udaya A. 23 April 1992 (has links)
We consider the parallelization of Monte Carlo algorithms for analyzing numerical
models of charge transport used in semiconductor device physics. Parallel
algorithms for the standard k-space Monte Carlo simulation of a three band model
of bulk GaAs on hypercube multicomputers are first presented. This Monte Carlo
model includes scattering due to polar-optical, intervalley, and acoustic phonons, as
well as electron-electron scattering. The k-space Monte Carlo program, excluding
electron-electron scattering, is then extended to simulate a semiconductor device
by the addition of the real space position of each simulated particle and the assignment
of particle charge, using a cloud in cell scheme, to solve the Poisson's equation
with particle dynamics. Techniques for effectively partitioning this device so as
to balance the computational load while minimizing the communication overhead
are discussed. Approaches for improving the efficiency of the parallel algorithm,
either by dynamically balancing of load or by employing the usual techniques for
enhancing rare events in Monte Carlo simulations are also considered. The parallel
algorithms were implemented on a 64-node NCUBE multiprocessor and test results
were generated to validate the parallel k-space, as well as the device simulation
programs. Timing measurements were also made to study the variation of speedups
as both the problem size and number of processors are varied.
The effective exploitation of the computational power of message passing
multiprocessors requires the efficient mapping of parallel programs onto processors
so as to balance the computational load while minimizing the communication overhead between processors. A lower bound for this communication volume when
mapping arbitrary task graphs onto distributed processor systems is derived. For
a K processor system this lower bound can be computed from the K (possibly)
largest eigenvalues of the adjacency matrix of the task graph and the eigenvalues
of the adjacency matrix of the processor graph. We also derive the eigenvalues of
the adjacency matrix of the processor graph for a hypercube and give test results
comparing the lower bound for the communication volume with the values given by
a heuristic algorithm for a number of task graphs. / Graduation date: 1992
|
466 |
Scheduling system of affine recurrence equations by means of piecewise affine timing functionsMui, Lap K. 05 March 1992 (has links)
Many systematic methods exist for mapping algorithms to processor arrays. The
algorithm is usually specified as a set of recurrence equations, and the processor arrays
are synthesized by finding timing and allocation functions which transform index points
in the recurrences into points in a space-time domain.
The problem of scheduling (i.e. finding the timing function) of recurrence equations
has been studied by a number of researchers. Of particular interest here are Systems of
Affine Recurrence Equations (SAREs). The existing methods are limited to affine (or
linear) schedules over the entire domain of computation. For some algorithms, there are
points in the computation domain where the dependencies point in opposite directions,
and an affine schedule does not exist, although a valid Piecewise Affine Schedule (PAS)
can exist. The objective of this thesis is to examine these schedules and obtain a
systematic method for deriving such schedules for SAREs. PAS can be found by first
partitioning the computation domain and then obtaining a new SARE by renaming the
variables. By partitioning the computation domain, we can obtain additional parallelism
from the dependency graph, and find faster schedules over subspaces of the domain. In
this paper, we describe a procedure for partitioning the domain and to generate a new
SARE by renaming the variables. Some heuristics are introduced for partitioning the
domain based on the properties of dependence vectors. After the partitioning and
renaming, an existing method (due to Mauras et al.) is applied to find the schedules.
Examples of Toeplitz System and Algebraic Path Problem are used to illustrate the results. / Graduation date: 1992
|
467 |
Overlay Architectures for FPGA-Based Software Packet ProcessingMartin, Labrecque 16 June 2011 (has links)
Packet processing is the enabling technology of networked information systems
such as the Internet and is usually performed with fixed-function custom-made
ASIC chips. As communication protocols evolve rapidly, there is increasing
interest in adapting features of the processing over time and, since software
is the preferred way of expressing complex computation, we are interested in
finding a platform to execute packet processing software with the best
possible throughput. Because FPGAs are widely used in network equipment and
they can implement processors, we are motivated to investigate executing
software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric
are currently geared towards performing embedded sequential tasks and, in
contrast, network processing is most often inherently parallel between packet
flows, if not between each individual packet.
Our goal is to allow multiple threads of execution in an FPGA to reach a
higher aggregate throughput than commercially available shared-memory soft
multi-processors via improvements to the underlying soft processor
architecture. We study a number of processor pipeline organizations to
identify which ones can scale to a larger number of execution threads and find
that tuning multithreaded pipelines can provide compact cores with high
throughput. We then perform a design space exploration of multicore soft
systems, compare single-threaded and multithreaded designs to identify
scalability limits and develop processor architectures allowing threads to
execute with as little architectural stalls as possible: in particular with
instruction replay and static hazard detection mechanisms. To further reduce
the wait times, we allow threads to speculatively execute by leveraging
transactional memory. Our multithreaded multiprocessor along with our
compilation and simulation framework makes the FPGA easy to use for an average
programmer who can write an application as a single thread of computation with
coarse-grained synchronization around shared data structures. Comparing with
multithreaded processors using lock-based synchronization, we measure up to
57\% additional throughput with the use of transactional-memory-based
synchronization. Given our applications, gigabit interfaces and 125 MHz system
clock rate, our results suggest that soft processors can process packets in
software at high throughput and low latency, while capitalizing on the FPGAs
already available in network equipment.
|
468 |
An instruction set simulator for the 8086 16-bit microprocessorMapes, Glenn 03 June 2011 (has links)
The intent of this thesis is to show the usefulness simulating of an instruction set in software and to demonstrate the feasibility of doing so by providing the framework of a simulation program.The design of new computer architectures and computer based control systems is a trial and error process. Normal design practice is to design and build a prototype of the new system and then evaluate the performance of the prototype. Designing complex systems in this manner is very time consuming and expensive; using a software program to simulate the operation of the new system can help solve certain design problems and shorten the development time and effort.The instruction set simulator executes a subset of the 8086 instruction set and contains routines that are useful in debugging the target software.The feasibility of implementing an instruction set simulator to solve certain design problems has been demonstrated by implementing the most commonly used op codes from the 8086 instruction set.Ball State UniversityMuncie, IN 47306
|
469 |
Overlay Architectures for FPGA-Based Software Packet ProcessingMartin, Labrecque 16 June 2011 (has links)
Packet processing is the enabling technology of networked information systems
such as the Internet and is usually performed with fixed-function custom-made
ASIC chips. As communication protocols evolve rapidly, there is increasing
interest in adapting features of the processing over time and, since software
is the preferred way of expressing complex computation, we are interested in
finding a platform to execute packet processing software with the best
possible throughput. Because FPGAs are widely used in network equipment and
they can implement processors, we are motivated to investigate executing
software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric
are currently geared towards performing embedded sequential tasks and, in
contrast, network processing is most often inherently parallel between packet
flows, if not between each individual packet.
Our goal is to allow multiple threads of execution in an FPGA to reach a
higher aggregate throughput than commercially available shared-memory soft
multi-processors via improvements to the underlying soft processor
architecture. We study a number of processor pipeline organizations to
identify which ones can scale to a larger number of execution threads and find
that tuning multithreaded pipelines can provide compact cores with high
throughput. We then perform a design space exploration of multicore soft
systems, compare single-threaded and multithreaded designs to identify
scalability limits and develop processor architectures allowing threads to
execute with as little architectural stalls as possible: in particular with
instruction replay and static hazard detection mechanisms. To further reduce
the wait times, we allow threads to speculatively execute by leveraging
transactional memory. Our multithreaded multiprocessor along with our
compilation and simulation framework makes the FPGA easy to use for an average
programmer who can write an application as a single thread of computation with
coarse-grained synchronization around shared data structures. Comparing with
multithreaded processors using lock-based synchronization, we measure up to
57\% additional throughput with the use of transactional-memory-based
synchronization. Given our applications, gigabit interfaces and 125 MHz system
clock rate, our results suggest that soft processors can process packets in
software at high throughput and low latency, while capitalizing on the FPGAs
already available in network equipment.
|
470 |
Optimizing VLIW architectures for multimedia applicationsSalamí San Juan, Esther 01 June 2007 (has links)
The growing interest that multimedia processing has experimented during the last decade is motivating processor designers to reconsider which execution paradigms are the most appropriate for general-purpose processors. On the other hand, as the size of transistors decreases, power dissipation has become a relevant limitation to increases in the frequency of operation. Thus, the efficient exploitation of the different sources of parallelism is a key point to investigate in order to sustain the performance improvement rate of processors and face the growing requirements of future multimedia applications. We belief that a promising option arises from the combination of the Very Long Instruction Word (VLIW) and the vector processing paradigms together with other ways of exploiting coarser grain parallelism, such as Chip MultiProcessing (CMP). As part of this thesis, we analyze the problem of memory disambiguation in multimedia applications, as it represents a serious restriction for exploiting Instruction Level Parallelism (ILP) in VLIW architectures. We state that the real handicap for memory disambiguation in multimedia is the extensive use of pointers and indirect references usually found in those codes, together with the limited static information available to the compiler on certain occasions. Based on the observation that the input and output multimedia streams are commonly disjointed memory regions, we propose and implement a memory disambiguation technique that dynamically analyzes the region domain of every load and store before entering a loop, evaluates whether or not the full loop is disambiguated and executes the corresponding loop version. This mechanism does not require any additional hardware or instructions and has negligible effects over compilation time and code size. The performance achieved is comparable to that of advanced interprocedural pointer analysis techniques, with considerably less software complexity. We also demonstrate that both techniques can be combined to improve performance.In order to deal with the inherent Data Level Parallelism (DLP) of multimedia kernels without disrupting the existing core designs, major processor manufacturers have chosen to include MMX-like µSIMD extensions. By analyzing the scalability of the DLP and non-DLP regions of code separately in VLIW processors with µSIMD extensions, we observe that the performance of the overall application is dominated by the performance of the non-DLP regions, which in fact exhibit only modest amounts of ILP. As a result, the performance achieved by very wide issue configurations does not compensate for the related cost. To exploit the DLP of the vector regions in a more efficient way, we propose enhancing the µSIMD -VLIW core with conventional vector processing capabilities. The combination of conventional and sub-word level vector processing results in a 2-dimensional extension that combines the best of each one, including a reduction in the number of operations, lower fetch bandwidth requirements, simplicity of the control unit, power efficiency, scalability, and support for multimedia specific features such as saturation or reduction. This enhancement has a minimal impact on the VLIW core and reaches more parallelism than wider issue µSIMD implementations at a lower cost. Similar proposals have been successfully evaluated for superscalar cores. In this thesis, we demonstrate that 2-dimensional Vector-µSIMD extensions are also effective with static scheduling, allowing for high-performance cost-effective implementations.
|
Page generated in 0.0621 seconds