Global ETD Search

461	A Multiprocessor Architecture Using Modular Arithmetic for Very High Precision Computation Wu, Henry M. 01 April 1989 (has links) We outline a multiprocessor architecture that uses modular arithmetic to implement numerical computation with 900 bits of intermediate precision. A proposed prototype, to be implemented with off-the-shelf parts, will perform high-precision arithmetic as fast as some workstations and mini- computers can perform IEEE double-precision arithmetic. We discuss how the structure of modular arithmetic conveniently maps into a simple, pipelined multiprocessor architecture. We present techniques we developed to overcome a few classical drawbacks of modular arithmetic. Our architecture is suitable to and essential for the study of chaotic dynamical systems. modular arithmetic computer architecture multiprocessor sresidue number system computer arithmetic pipelining chaos
462	Software tools for modeling and simulation of on-chip communication architectures Zhu, Xinping, January 1900 (has links) (PDF) Thesis (Ph. D.)--Princeton University, 2005. / "June 2005." Description based on contents viewed Apr. 11, 2007; title from title screen. Includes bibliographical references (p. 135-147).
463	Modeling performance and power for energy-efficient GPGPU computing Hong, Sunpyo 12 November 2012 (has links) The objective of the proposed research is to develop an analytical model that predicts performance and power for many-core architecture and further propose a mechanism, which leverages the analytical model, to enable energy-efficient execution of an application. The key insight of the model is to investigate and quantify a complex relationship that exists between the thread-level parallelism and memory-level parallelism for an application on a given many-core architecture. Two metrics are proposed: memory warp parallelism (MWP), which refers to the number of overlapping memory accesses per core, and computation warp parallelism (CWP), which characterizes an application type. By using these metrics in addition to the architectural and application parameters, the overall application performance is produced. The model uses statically-available parameters such as instruction-mixture information and input-data size, and the prediction accuracy is 13.3% for the GPU-computing benchmarks. Another important aspect of using many-core architecture is reducing peak power and achieving energy savings. By using the proposed integrated power and performance (IPP) framework, the results showed that different optimization points exist for GPU architecture depending on the application type. The work shows that by activating fewer cores, 10.99% of run-time energy consumption can be saved for the bandwidth-limited benchmarks, and a projection of 25.8% energy savings is predicted when power-gating at core level is employed. Finally, the model is shifted to throughput using OpenCL for targeting more variety of processors. First, multiple outputs relating to performance are predicted, including upper-bound and lower-bound values. Second, by using the model parameters, an application can be categorized into a different category, each with its own suggestions for improving performance and energy efficiency. Third, the bandwidth saturation point accuracy is significantly improved by considering independent memory accesses and updating the performance model. Furthermore, a trade-off analysis using architectural and application parameters is straightforward, which provides more insights to improve energy efficiency. In the future, a computer system will contain hundreds of heterogeneous cores. Hence, it is mandatory that a workload gets scheduled to an efficient core or distributed on both types of cores. A preliminary work by using the analytical model to do scheduling between CPU and GPU is demonstrated in the appendix. Since profiling phase is not required, the kernel code can be transformed to run more efficiently on the specific architecture. Another extension of the work regarding the relationship between the speed-up and energy efficiency is mathematically derived. Finally, future research ideas are presented regarding the usage of the model for programmer, compiler, and runtime for future heterogeneous systems. Model Power Energy GPGPU GPU Analytical model Performance Graphics processing units Computer architecture Energy consumption
464	The "Mobius Cube" : an interconnection network for parallel computation Larson, Shawn M. 26 November 1990 (has links) Graduation date: 1991 Computer networks Hypercube networks (Computer networks) Computer architecture
465	Cluster partitioning approaches to parallel Monte Carlo simulation on multiprocessors Ranawake, Udaya A. 23 April 1992 (has links) We consider the parallelization of Monte Carlo algorithms for analyzing numerical models of charge transport used in semiconductor device physics. Parallel algorithms for the standard k-space Monte Carlo simulation of a three band model of bulk GaAs on hypercube multicomputers are first presented. This Monte Carlo model includes scattering due to polar-optical, intervalley, and acoustic phonons, as well as electron-electron scattering. The k-space Monte Carlo program, excluding electron-electron scattering, is then extended to simulate a semiconductor device by the addition of the real space position of each simulated particle and the assignment of particle charge, using a cloud in cell scheme, to solve the Poisson's equation with particle dynamics. Techniques for effectively partitioning this device so as to balance the computational load while minimizing the communication overhead are discussed. Approaches for improving the efficiency of the parallel algorithm, either by dynamically balancing of load or by employing the usual techniques for enhancing rare events in Monte Carlo simulations are also considered. The parallel algorithms were implemented on a 64-node NCUBE multiprocessor and test results were generated to validate the parallel k-space, as well as the device simulation programs. Timing measurements were also made to study the variation of speedups as both the problem size and number of processors are varied. The effective exploitation of the computational power of message passing multiprocessors requires the efficient mapping of parallel programs onto processors so as to balance the computational load while minimizing the communication overhead between processors. A lower bound for this communication volume when mapping arbitrary task graphs onto distributed processor systems is derived. For a K processor system this lower bound can be computed from the K (possibly) largest eigenvalues of the adjacency matrix of the task graph and the eigenvalues of the adjacency matrix of the processor graph. We also derive the eigenvalues of the adjacency matrix of the processor graph for a hypercube and give test results comparing the lower bound for the communication volume with the values given by a heuristic algorithm for a number of task graphs. / Graduation date: 1992 Multiprocessors Computer architecture
466	Scheduling system of affine recurrence equations by means of piecewise affine timing functions Mui, Lap K. 05 March 1992 (has links) Many systematic methods exist for mapping algorithms to processor arrays. The algorithm is usually specified as a set of recurrence equations, and the processor arrays are synthesized by finding timing and allocation functions which transform index points in the recurrences into points in a space-time domain. The problem of scheduling (i.e. finding the timing function) of recurrence equations has been studied by a number of researchers. Of particular interest here are Systems of Affine Recurrence Equations (SAREs). The existing methods are limited to affine (or linear) schedules over the entire domain of computation. For some algorithms, there are points in the computation domain where the dependencies point in opposite directions, and an affine schedule does not exist, although a valid Piecewise Affine Schedule (PAS) can exist. The objective of this thesis is to examine these schedules and obtain a systematic method for deriving such schedules for SAREs. PAS can be found by first partitioning the computation domain and then obtaining a new SARE by renaming the variables. By partitioning the computation domain, we can obtain additional parallelism from the dependency graph, and find faster schedules over subspaces of the domain. In this paper, we describe a procedure for partitioning the domain and to generate a new SARE by renaming the variables. Some heuristics are introduced for partitioning the domain based on the properties of dependence vectors. After the partitioning and renaming, an existing method (due to Mauras et al.) is applied to find the schedules. Examples of Toeplitz System and Algebraic Path Problem are used to illustrate the results. / Graduation date: 1992 Array processors Computer architecture
467	Overlay Architectures for FPGA-Based Software Packet Processing Martin, Labrecque 16 June 2011 (has links) Packet processing is the enabling technology of networked information systems such as the Internet and is usually performed with fixed-function custom-made ASIC chips. As communication protocols evolve rapidly, there is increasing interest in adapting features of the processing over time and, since software is the preferred way of expressing complex computation, we are interested in finding a platform to execute packet processing software with the best possible throughput. Because FPGAs are widely used in network equipment and they can implement processors, we are motivated to investigate executing software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric are currently geared towards performing embedded sequential tasks and, in contrast, network processing is most often inherently parallel between packet flows, if not between each individual packet. Our goal is to allow multiple threads of execution in an FPGA to reach a higher aggregate throughput than commercially available shared-memory soft multi-processors via improvements to the underlying soft processor architecture. We study a number of processor pipeline organizations to identify which ones can scale to a larger number of execution threads and find that tuning multithreaded pipelines can provide compact cores with high throughput. We then perform a design space exploration of multicore soft systems, compare single-threaded and multithreaded designs to identify scalability limits and develop processor architectures allowing threads to execute with as little architectural stalls as possible: in particular with instruction replay and static hazard detection mechanisms. To further reduce the wait times, we allow threads to speculatively execute by leveraging transactional memory. Our multithreaded multiprocessor along with our compilation and simulation framework makes the FPGA easy to use for an average programmer who can write an application as a single thread of computation with coarse-grained synchronization around shared data structures. Comparing with multithreaded processors using lock-based synchronization, we measure up to 57\% additional throughput with the use of transactional-memory-based synchronization. Given our applications, gigabit interfaces and 125 MHz system clock rate, our results suggest that soft processors can process packets in software at high throughput and low latency, while capitalizing on the FPGAs already available in network equipment. computer architecture soft processors FPGA packet processing network processor multithreaded transactional memory 0544
468	An instruction set simulator for the 8086 16-bit microprocessor Mapes, Glenn 03 June 2011 (has links) The intent of this thesis is to show the usefulness simulating of an instruction set in software and to demonstrate the feasibility of doing so by providing the framework of a simulation program.The design of new computer architectures and computer based control systems is a trial and error process. Normal design practice is to design and build a prototype of the new system and then evaluate the performance of the prototype. Designing complex systems in this manner is very time consuming and expensive; using a software program to simulate the operation of the new system can help solve certain design problems and shorten the development time and effort.The instruction set simulator executes a subset of the 8086 instruction set and contains routines that are useful in debugging the target software.The feasibility of implementing an instruction set simulator to solve certain design problems has been demonstrated by implementing the most commonly used op codes from the 8086 instruction set.Ball State UniversityMuncie, IN 47306 Digital computer simulation. System analysis. System design. Computer architecture. Ball State University. Thesis (M.S.)
469	Overlay Architectures for FPGA-Based Software Packet Processing Martin, Labrecque 16 June 2011 (has links) Packet processing is the enabling technology of networked information systems such as the Internet and is usually performed with fixed-function custom-made ASIC chips. As communication protocols evolve rapidly, there is increasing interest in adapting features of the processing over time and, since software is the preferred way of expressing complex computation, we are interested in finding a platform to execute packet processing software with the best possible throughput. Because FPGAs are widely used in network equipment and they can implement processors, we are motivated to investigate executing software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric are currently geared towards performing embedded sequential tasks and, in contrast, network processing is most often inherently parallel between packet flows, if not between each individual packet. Our goal is to allow multiple threads of execution in an FPGA to reach a higher aggregate throughput than commercially available shared-memory soft multi-processors via improvements to the underlying soft processor architecture. We study a number of processor pipeline organizations to identify which ones can scale to a larger number of execution threads and find that tuning multithreaded pipelines can provide compact cores with high throughput. We then perform a design space exploration of multicore soft systems, compare single-threaded and multithreaded designs to identify scalability limits and develop processor architectures allowing threads to execute with as little architectural stalls as possible: in particular with instruction replay and static hazard detection mechanisms. To further reduce the wait times, we allow threads to speculatively execute by leveraging transactional memory. Our multithreaded multiprocessor along with our compilation and simulation framework makes the FPGA easy to use for an average programmer who can write an application as a single thread of computation with coarse-grained synchronization around shared data structures. Comparing with multithreaded processors using lock-based synchronization, we measure up to 57\% additional throughput with the use of transactional-memory-based synchronization. Given our applications, gigabit interfaces and 125 MHz system clock rate, our results suggest that soft processors can process packets in software at high throughput and low latency, while capitalizing on the FPGAs already available in network equipment. computer architecture soft processors FPGA packet processing network processor multithreaded transactional memory 0544
470	Optimizing VLIW architectures for multimedia applications Salamí San Juan, Esther 01 June 2007 (has links) The growing interest that multimedia processing has experimented during the last decade is motivating processor designers to reconsider which execution paradigms are the most appropriate for general-purpose processors. On the other hand, as the size of transistors decreases, power dissipation has become a relevant limitation to increases in the frequency of operation. Thus, the efficient exploitation of the different sources of parallelism is a key point to investigate in order to sustain the performance improvement rate of processors and face the growing requirements of future multimedia applications. We belief that a promising option arises from the combination of the Very Long Instruction Word (VLIW) and the vector processing paradigms together with other ways of exploiting coarser grain parallelism, such as Chip MultiProcessing (CMP). As part of this thesis, we analyze the problem of memory disambiguation in multimedia applications, as it represents a serious restriction for exploiting Instruction Level Parallelism (ILP) in VLIW architectures. We state that the real handicap for memory disambiguation in multimedia is the extensive use of pointers and indirect references usually found in those codes, together with the limited static information available to the compiler on certain occasions. Based on the observation that the input and output multimedia streams are commonly disjointed memory regions, we propose and implement a memory disambiguation technique that dynamically analyzes the region domain of every load and store before entering a loop, evaluates whether or not the full loop is disambiguated and executes the corresponding loop version. This mechanism does not require any additional hardware or instructions and has negligible effects over compilation time and code size. The performance achieved is comparable to that of advanced interprocedural pointer analysis techniques, with considerably less software complexity. We also demonstrate that both techniques can be combined to improve performance.In order to deal with the inherent Data Level Parallelism (DLP) of multimedia kernels without disrupting the existing core designs, major processor manufacturers have chosen to include MMX-like µSIMD extensions. By analyzing the scalability of the DLP and non-DLP regions of code separately in VLIW processors with µSIMD extensions, we observe that the performance of the overall application is dominated by the performance of the non-DLP regions, which in fact exhibit only modest amounts of ILP. As a result, the performance achieved by very wide issue configurations does not compensate for the related cost. To exploit the DLP of the vector regions in a more efficient way, we propose enhancing the µSIMD -VLIW core with conventional vector processing capabilities. The combination of conventional and sub-word level vector processing results in a 2-dimensional extension that combines the best of each one, including a reduction in the number of operations, lower fetch bandwidth requirements, simplicity of the control unit, power efficiency, scalability, and support for multimedia specific features such as saturation or reduction. This enhancement has a minimal impact on the VLIW core and reaches more parallelism than wider issue µSIMD implementations at a lower cost. Similar proposals have been successfully evaluated for superscalar cores. In this thesis, we demonstrate that 2-dimensional Vector-µSIMD extensions are also effective with static scheduling, allowing for high-performance cost-effective implementations. memory disambiguation VLIW SIMD extensions ILP vector processing DLP multimedia computer architecture 004

Search results