Spelling suggestions: "subject:"multiprocessor""
151 |
PROTEUS, a microprogrammable, multiprocessor computerKesselman, Joseph Jay January 1982 (has links)
Thesis (B.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1982. / MICROFICHE COPY AVAILABLE IN ARCHIVES AND ENGINEERING / by Joseph Jay Kesselman Jr. / B.S.
|
152 |
A transparent and energy aware reconfigurable multiprocessor platform for efficient ILP and TLP exploitationRutzig, Mateus Beck January 2012 (has links)
As the number of embedded applications is increasing, the current strategy of several companies is to launch a new platform within short periods, to execute the application set more efficiently, with low energy consumption. However, for each new platform deployment, new tool chains must come along, with additional libraries, debuggers and compilers. This strategy implies in high hardware redesign costs, breaks binary compatibility and results in a high overhead in the software development process. Therefore, focusing on area savings, low energy consumption, binary compatibility maintenance and mainly software productivity improvement, we propose the exploitation of Custom Reconfigurable Arrays for Multiprocessor System (CReAMS). CReAMS is composed of multiple adaptive reconfigurable systems to efficiently explore Instruction and Thread Level Parallelism (ILP and TLP) at hardware level, in a totally transparent fashion. Conceived as homogeneous organization, CReAMS shows a reduction of 37% in energy-delay product (EDP) compared to an ordinary multiprocessing platform when assuming the same chip area. When a variety of processor with different capabilities on exploiting ILP are coupled in a single die, conceiving CReAMS as a heterogeneous organization, performance improvements of up to 57% and energy savings of up to 36% are showed in comparison with the homogenous platform. In addition, the efficiency of the adaptability provided by CReAMS is demonstrated in a comparison to a multiprocessing system composed of 4- issue Out-of-Order SparcV8 processors, 28% of performance improvements are shown considering a power budget scenario.
|
153 |
Methodology for Accurate Speedup PredictionChittor, Aruna 09 December 1994 (has links)
The effective use of computational resources requires a good understanding of parallel architectures and algorithms. The effect of the parallel architecture and also the parallel application on the performance of the parallel systems becomes more complex with increasing numbers of processors. We will address this issue in this thesis, and develop a methodology to predict the overall execution time of a parallel application as a function of the system and problem size by combining simple analysis with a few experimental results. We show that runtimes and speedup can be predicted more accurately by analyzing the functional forms of the sequential and parallel times of critical code segments of a parallel application that affect the speedup of a parallel program. We then combine the functional forms to model the runtime of a parallel application. A small set of experiments are sufficient to get a good estimate of the coefficients for the runtime models obtained. Speedup can then be derived for any case from the runtime model. We also analyze the effect of the 1/0 on runtimes in memory bounded parallel systems, and how speedup is affected by communication and 1/0. Throughout the thesis we use the bitonic merge sort as a typical realistic parallel application to illustrate our methodology. Several variations of the sorting algorithm (such as problem size greater than or equal to the system size, unlimited or limited buffer size) suitable for a wide range of problem sizes are implemented in two parallel environments and the speedups for them are measured and compared with different speedup predictions. We have conducted numerous experiments using Multi and PVM to empirically study speedup for the different realistic implementations of the bitonic merge sort. The results show how well the various models predicted speedup, and that our methodology can predict speedup accurately for a given parallel application. One interesting value from a speedup curve is the roll-off point - the system size beyond which speedup actually decreases when the number of processors is increased. Results show that simple theoretic models predicted roll-off point to be higher than the actual values, where as our methodology predicted it to be less than and closer to the actual values. The predictions by our methodology also compare well with the speedup estimates provided by the Multi tool.
|
154 |
ADAPT : architectural and design exploration for application specific instruction-set processor technologiesShee, Seng Lin, Computer Science & Engineering, Faculty of Engineering, UNSW January 2007 (has links)
This thesis presents design automation methodologies for extensible processor platforms in application specific domains. The work presents first a single processor approach for customization; a methodology that can rapidly create different processor configurations by the removal of unused instructions sets from the architecture. A profile directed approach is used to identify frequently used instructions and to eliminate unused opcodes from the available instruction pool. A coprocessor approach is next explored to create an SoC (System-on-Chip) to speedup the application while reducing energy consumption. Loops in applications are identified and accelerated by tightly coupling a coprocessor to an ASIP (Application Specific Instruction-set Processor). Latency hiding is used to exploit the parallelism provided by this architecture. A case study has been performed on a JPEG encoding algorithm; comparing two different coprocessor approaches: a high-level synthesis approach and our custom coprocessor approach. The thesis concludes by introducing a heterogenous multi-processor system using ASIPs as processing entities in a pipeline configuration. The problem of mapping each algorithmic stage in the system to an ASIP configuration is formulated. We proposed an estimation technique to calculate runtimes of the configured multiprocessor system without running cycle-accurate simulations, which could take a significant amount of time. We present two heuristics to efficiently search the design space of a pipeline-based multi ASIP system and compare the results against an exhaustive approach. In our first approach, we show that, on average, processor size can be reduced by 30%, energy consumption by 24%, while performance is improved by 24%. In the coprocessor approach, compared with the use of a main processor alone, a loop performance improvement of 2.57x is achieved using the custom coprocessor approach, as against 1.58x for the high level synthesis method, and 1.33x for the customized instruction approach. Energy savings are 57%, 28% and 19%, respectively. Our multiprocessor design provides a performance improvement of at least 4.03x for JPEG and 3.31x for MP3, for a single processor design system. The minimum cost obtained using our heuristic was within 0.43% and 0.29% of the optimum values for the JPEG and MP3 benchmarks respectively.
|
155 |
Optimizing performance/watt of embedded SIMD multiprocessors through a priori application guided power schedulingAlbright, Ryan K. 20 April 2012 (has links)
A method for improving performance/watt of an embedded single-instruction multiple-data (SIMD) architecture using application-guided a priori scheduling of hardware resources is presented. A multi-core architectural simulator is adopted that accurately estimates power, performance, and utilization of various processor components (logic, interconnect and memory). A greedy search is then performed on each algorithm block of a signal processing chain in order to schedule each component's throughput and power. The proposed software-directed hardware rebalancing, applied to a typical electroencephalography (EEG) filtering chain, is analyzed for two different SIMD architectures. The first, representing a super V[subscript th] processor demonstrates a 51%-86% improvement in performance/watt at 1%-10% throughput reduction using block level or algorithm level a priori scheduling. The second architecture used is Synctium, a near V[subscript th] processor which demonstrates 50%-99% performance/watt improvement across the same throughput reduction range and optimization techniques. / Graduation date: 2012
|
156 |
Software tools for modeling and simulation of on-chip communication architecturesZhu, Xinping, January 1900 (has links) (PDF)
Thesis (Ph. D.)--Princeton University, 2005. / "June 2005." Description based on contents viewed Apr. 11, 2007; title from title screen. Includes bibliographical references (p. 135-147).
|
157 |
Cluster partitioning approaches to parallel Monte Carlo simulation on multiprocessorsRanawake, Udaya A. 23 April 1992 (has links)
We consider the parallelization of Monte Carlo algorithms for analyzing numerical
models of charge transport used in semiconductor device physics. Parallel
algorithms for the standard k-space Monte Carlo simulation of a three band model
of bulk GaAs on hypercube multicomputers are first presented. This Monte Carlo
model includes scattering due to polar-optical, intervalley, and acoustic phonons, as
well as electron-electron scattering. The k-space Monte Carlo program, excluding
electron-electron scattering, is then extended to simulate a semiconductor device
by the addition of the real space position of each simulated particle and the assignment
of particle charge, using a cloud in cell scheme, to solve the Poisson's equation
with particle dynamics. Techniques for effectively partitioning this device so as
to balance the computational load while minimizing the communication overhead
are discussed. Approaches for improving the efficiency of the parallel algorithm,
either by dynamically balancing of load or by employing the usual techniques for
enhancing rare events in Monte Carlo simulations are also considered. The parallel
algorithms were implemented on a 64-node NCUBE multiprocessor and test results
were generated to validate the parallel k-space, as well as the device simulation
programs. Timing measurements were also made to study the variation of speedups
as both the problem size and number of processors are varied.
The effective exploitation of the computational power of message passing
multiprocessors requires the efficient mapping of parallel programs onto processors
so as to balance the computational load while minimizing the communication overhead between processors. A lower bound for this communication volume when
mapping arbitrary task graphs onto distributed processor systems is derived. For
a K processor system this lower bound can be computed from the K (possibly)
largest eigenvalues of the adjacency matrix of the task graph and the eigenvalues
of the adjacency matrix of the processor graph. We also derive the eigenvalues of
the adjacency matrix of the processor graph for a hypercube and give test results
comparing the lower bound for the communication volume with the values given by
a heuristic algorithm for a number of task graphs. / Graduation date: 1992
|
158 |
Data Processing Techniques on Modern Hardware ArchitecturesTsirogiannis, Dimitrios 31 August 2011 (has links)
The last decade has been characterized by radical changes in the computing landscape. We have witnessed the advent of multi-core processors, flash-based storage systems and the proliferation of scale out architectures, such as map-reduce-based systems and massively parallel databases. Although data management systems have embraced modern hardware technologies to some extent, they have not realized
their full potential.
The goal of this thesis is two-fold. Primarily, it demonstrates the staggering potential for performance improvement offered by modern hardware architectures and, then, proposes how data management
systems must alter in order to realize this potential. Additionally, this thesis demonstrates that utilizing modern hardware architectures is important both for performance and energy-efficiency. Towards this goal, we propose query processing and indexing techniques for chip multiprocessors and we analyze the trade-offs of executing complex database queries on modern processor technologies. Subsequently, we propose query processing methods tailored to flash-based storage systems. Finally, we analyze the power consumption of database systems and we reveal opportunities for improving their
energy efficiency.
|
159 |
Data Processing Techniques on Modern Hardware ArchitecturesTsirogiannis, Dimitrios 31 August 2011 (has links)
The last decade has been characterized by radical changes in the computing landscape. We have witnessed the advent of multi-core processors, flash-based storage systems and the proliferation of scale out architectures, such as map-reduce-based systems and massively parallel databases. Although data management systems have embraced modern hardware technologies to some extent, they have not realized
their full potential.
The goal of this thesis is two-fold. Primarily, it demonstrates the staggering potential for performance improvement offered by modern hardware architectures and, then, proposes how data management
systems must alter in order to realize this potential. Additionally, this thesis demonstrates that utilizing modern hardware architectures is important both for performance and energy-efficiency. Towards this goal, we propose query processing and indexing techniques for chip multiprocessors and we analyze the trade-offs of executing complex database queries on modern processor technologies. Subsequently, we propose query processing methods tailored to flash-based storage systems. Finally, we analyze the power consumption of database systems and we reveal opportunities for improving their
energy efficiency.
|
160 |
Technology Impacts of CMOS Scaling on Microprocessor Core Design for Hard-Fault Tolerance in Single-Core Applications and Optimized Throughput in Throughput-Oriented Chip MultiprocessorsBower, Fred January 2010 (has links)
<p>The continued march of technological progress, epitomized by Moore’s Law provides the microarchitect with increasing numbers of transistors to employ as we continue to shrink feature geometries. Physical limitations impose new constraints upon designers in the areas of overall power and localized power density. Techniques to scale threshold and supply voltages to lower values in order to reduce power consumption of the part have also run into physical limitations, exacerbating power and cooling problems in deep sub-micron CMOS process generations. Smaller device geometries are also subject to increased sensitivity to common failure modes as well as manufacturing process variability.</p>
<p>In the face of these added challenges, we observe a shift in the focus of the industry, away from building ever–larger single–core chips, whose focus is on reducing single–threaded latency toward a design approach that employs multiple cores on a single chip to improve throughput. While the early multicore era utilized the existing single–core designs of the previous generation in small numbers, subsequent generations have introduced cores tailored to multicore use. These cores seek to achieve power-efficient throughput and have led to a new emphasis on throughput-oriented computing, particularly for Internet workloads, where the end-to-end computational task is dominated by long–latency network operations. The ubiquity of these workloads makes a compelling argument for throughput–oriented designs, but does not free the microarchitect fully from latency demands of common workloads in enterprise and desktop application spaces.</p>
<p>We believe that a continued need for both throughput–oriented and latency–sensitive processors will exist in coming generations of technology. We further opine that making effective use of the additional transistors that will be available may require different techniques for latency–sensitive designs than for throughput–oriented ones, since we may trade latency or throughput for the desired attribute of a core in each of the respective paradigms.</p>
<p>We make three major contributions with this thesis. Our first contribution is a fine–grained fault diagnosis and deconfiguration technique for array structures, such as the ROB, within the microprocessor core. We present and evaluate two variants of this technique. The first variant uses an existing fault detection and correction technique whose scope is the processor core execution pipeline to ensure correct processor operation. The second variant integrates fault detection and correction into the array structure itself to provide a self–contained, fine–grained, fault detection, diagnosis, and repair technique.</p>
<p>In our second contribution, we develop a lightweight, fine–grained fault diagnosis mechanism for the processor core. In this work, we leverage the first contribution's methods to provide deconfiguration of faulty array elements. We additionally extend the scope of that work to include all pipeline circuitry from instruction issue to retirement.</p>
<p>In our third and final contribution, we focus on throughput–oriented core data cache design. In this work, we study the demands of the throughput–oriented core running a representative workload and then propose and evaluate an alternative data cache implementation that more closely matches the demands of the core. We then show that a better–matched cache design can be exploited to provide improved throughput under a fixed power budget.</p>
<p>Our results show that typical latency–sensitive cores have sufficient redundancy to make finegrained hard–fault tolerance an affordable alternative for hardening complex designs. Our designs suffer little or no performance loss when no faults are present and retain nearly the same performance characteristics in the presence of small numbers of hard faults in protected structures. In our study of the latency–sensitive core, we have shown that SRAM–based designs have low latencies that end up providing less benefit to a throughput–oriented core and workload than a better–fitted data cache composed of DRAM. The move from a high–power, fast technology to a lower–power, slower technology allows us to increase L1 data cache capacity, which is a net benefit for the throughput–oriented core.</p> / Dissertation
|
Page generated in 0.0814 seconds