Global ETD Search

1	Design of an Asynchronous Ring Bus Architecture for Multi-Core Systems Lei, Kin-fong 18 August 2010 (has links) In the multi-core systems, the data transfer between cores becomes a major challenge. The on-chip interconnect networks should be low latency, high throughput, scalability, better router or arbitration strategy, and low power consumption. An asynchronous ring bus, which is 33 bit width, adopting dual-rail single-track data protocol is proposed in this thesis. It provides not only robust but also high-speed asynchronous circuits condition. Owing to asynchronous circuits design, there are different transfer times in different hop counts. The shorter the distance is, the faster the data can be transferred. Unlink the synchronous ring bus, the bus frequency must be limited by the longest hop count latency. On the other hand, the transmission time of asynchronous circuits will not be held up by the longest distance even though the number of core is increased. For providing higher throughput, multiple cores which are able to access the bus simultaneously make a direct connection between each other. In bus arbitration, distribution arbiter is adopted to arbitrate the right to use the bus and solve the collision. Finally, the system performance in different arbitration strategies has been estimated in TSMC 0.18£gm process in this thesis. The transmission time of the shortest distance is 1.5 ns approximately, and the longest distance first has a better performance in different arbitration strategies. On-Chip Interconnect Networks Asynchronous Ring Bus Multi-Core Systems
2	Asynchronous Ring Network Mechanism with A Fair Arbitration Strategy for Network on Chip Wong, Chen-Ang 14 August 2012 (has links) The multi-core systems are usually implemented on homogeneous or heterogeneous cores, in order to design the better NOC (network on chip), it must consider the performance, scalability, simplifies hardware design and arbitration strategy at the on chip network. The routers are designed with circuit-switched network, circuit switching is asynchronous circuits and routers have no queuing (buffering), therefore, it is simple and efficient in implementation. Synchronous circuit is network with a clock source, but the distributing global clock has many problems such as power consumption, increasing the area and Clock skew. Ring topology with multi-transaction bus architecture. It could make multiple packets to access the bus at the same time, so that the multi-transaction bus architecture is better to get more throughputs. When the number of cores increase, the central arbiter circuit is more complexity, this thesis presents an SAP (self-adjusting priority) schedule that can fairly adjust priorities of each component by appropriately exchanging weighting at distributed arbiter. When numerous requests encounter contention on a network, a winner owning the highest priority will exchange its priority with the lowest priority of these requests. This principle guarantees that winners will decreased the opportunity of incurring network at the next time. In opposition, these losers can obtain the higher priority than that of the original. Therefore, the proposed scheme not only offers fair strategy, but also simplifies hardware design. switch circuit multi-core systems arbitration strategy Arbiter distributed system
3	Performance Evaluation of Node.js on Multi-core Computing Systems Azmat, Janty January 2018 (has links) Since JavaScript code that is executed by the Node.js run-time environment is run in a single thread without really utilizing the full power of multi-core systems, fairly new approaches attempt to solve this situation. Some of these approaches are considered well publicly tested and are widely used at the time of writing this document. The objectives for this study are to check which ones of these approaches achieve the better scalability in accordance to the number of handled requests, and to what extent those approaches utilize the multi-core power compared to the raw Node.js environment with the normal CPU scheduling. Node.js parallel computing multi-core systems Engineering and Technology Teknik och teknologier
4	Design of multi-core dataflow cryptprocessor Alzahrani, Ali Saeed 28 August 2018 (has links) Embedded multi-core systems are implemented as systems-on-chip that rely on packet store-and-forward networks-on-chip for communications. These systems do not use buses nor global clock. Instead routers are used to move data between the cores and each core uses its own local clock. This implies concurrent asynchronous computing. Implementing algorithms in such systems is very much facilitated using dataflow concepts. In this work, we propose a methodology for implementing algorithms on dataflow platforms. The methodology can be applied to multi-threaded, multi-core platforms or a combination of these platforms as well. This methodology is based on a novel dataflow graph representation of the algorithm. We applied the proposed methodology to obtain a novel dataflow multi-core computing model for the secure hash algorithm-3. The resulting hardware was implemented in FPGA to verify the performance parameters. The proposed model of computation has advantages such as flexible I/O timing in term of scheduling policy, execution of tasks as soon as possible, and self-timed event-driven system. In other words, I/O timing and correctness of algorithm evaluation are dissociated in this work. The main advantage of this proposal is the ability to dynamically obfuscate algorithm evaluation to thwart side-channel attacks without having to redesign the system. This has important implications for cryptographic applications. Also, the dissertation proposes four countermeasure techniques against side-channel attacks for SHA-3 hashing. The countermeasure techniques are based on choosing stochastic or deterministic input data scheduling strategies. Extensive simulations of the SHA-3 algorithm and the proposed countermeasures approaches were performed using object-oriented MATLAB models to verify and validate the effectiveness of the techniques. The design immunity for the proposed countermeasures is assessed. / Graduate / 2020-11-19 Embedded multi-core systems object-oriented MATLAB models FPGA
5	Roko: Balancing Performance and Usability in Coarse-grain Parallelization Segulja, Cedomir 06 April 2010 (has links) We present Roko, a system that allows parallelization of sequential C codes with a modest user intervention. The user exposes parallelism at the function level by annotating the code with pragmas. Roko defines only two pragmas: the parallel pragma is used to denote function calls that will be executed asynchronously, and the exposed pragma is used to describe data usage of the marked function calls. Architecturally, Roko consists of three components: a compiler that analyzes pragmas, a software environment that spreads the execution over multiple processors, and a hardware support that implements a novel synchronization scheme, versioning. We have designed, implemented and evaluated an FPGA-based prototype of Roko. Our experimental evaluation shows: (i) that few simple pragmas are all that is needed to expose parallelism in benchmark applications and (ii) that Roko can deliver good performance in terms of application speedup. Programming Model Parallelization Synchronization Concurrency Control Multi-core Systems FPGA Applications 0984
6	Roko: Balancing Performance and Usability in Coarse-grain Parallelization Segulja, Cedomir 06 April 2010 (has links) We present Roko, a system that allows parallelization of sequential C codes with a modest user intervention. The user exposes parallelism at the function level by annotating the code with pragmas. Roko defines only two pragmas: the parallel pragma is used to denote function calls that will be executed asynchronously, and the exposed pragma is used to describe data usage of the marked function calls. Architecturally, Roko consists of three components: a compiler that analyzes pragmas, a software environment that spreads the execution over multiple processors, and a hardware support that implements a novel synchronization scheme, versioning. We have designed, implemented and evaluated an FPGA-based prototype of Roko. Our experimental evaluation shows: (i) that few simple pragmas are all that is needed to expose parallelism in benchmark applications and (ii) that Roko can deliver good performance in terms of application speedup. Programming Model Parallelization Synchronization Concurrency Control Multi-core Systems FPGA Applications 0984
7	Microarchitecture and FPGA Implementation of the Multi-level Computing Architecture Capalija, Davor 30 July 2008 (has links) We design the microarchitecture of the Multi-Level Computing Architecture (MLCA), focusing on its Control Processor (CP). The design of the microarchitecture of the CP faces us with both opportunities and challenges that stem from the coarse granularity of the tasks and the large number of inputs and outputs for each task instruction. Thus, we explore changes to standard superscalar microarchitectural techniques. We design the entire CP microarchitecture and implement it on an FPGA using SystemVerilog. We synthesize and evaluate the MLCA system based on a 4-processor shared-memory multiprocessor. The performance of realistic applications shows scalable speedups that are comparable to that of simulation. We believe that our implementation achieves low complexity in terms of FPGA resource usage and operating frequency. In addition, we argue that our design methodology allows the scalability of the CP as the entire system grows. Computer architecture FPGA applications Microarchitecture Parallelism Embedded systems Multi-core systems 0984
8	Microarchitecture and FPGA Implementation of the Multi-level Computing Architecture Capalija, Davor 30 July 2008 (has links) We design the microarchitecture of the Multi-Level Computing Architecture (MLCA), focusing on its Control Processor (CP). The design of the microarchitecture of the CP faces us with both opportunities and challenges that stem from the coarse granularity of the tasks and the large number of inputs and outputs for each task instruction. Thus, we explore changes to standard superscalar microarchitectural techniques. We design the entire CP microarchitecture and implement it on an FPGA using SystemVerilog. We synthesize and evaluate the MLCA system based on a 4-processor shared-memory multiprocessor. The performance of realistic applications shows scalable speedups that are comparable to that of simulation. We believe that our implementation achieves low complexity in terms of FPGA resource usage and operating frequency. In addition, we argue that our design methodology allows the scalability of the CP as the entire system grows. Computer architecture FPGA applications Microarchitecture Parallelism Embedded systems Multi-core systems 0984
9	Jack Rabbit : an effective Cell BE programming system for high performance parallelism Ellis, Apollo Isaac Orion 08 July 2011 (has links) The Cell processor is an example of the trade-offs made when designing a mass market power efficient multi-core machine, but the machine-exposing architecture and raw communication mechanisms of Cell are hard to manage for a programmer. Cell's design is simple and causes software complexity to go up in the areas of achieving low threading overhead, good bandwidth efficiency, and load balance. Several attempts have been made to produce efficient and effective programming systems for Cell, but the attempts have been too specialized and thus fall short. We present Jack Rabbit, an efficient thread pool work queue implementation, with load balancing mechanisms and double buffering. Our system incurs low threading overhead, gets good load balance, and achieves bandwidth efficiency. Our system represents a step towards an effective way to program Cell and any similar current or future processors. / text Cell processor Multi-core systems High performance computing Runtime Barnes Hut LU factorization Mandelbrot Double buffering Thread pool Work queue Load balance

Search results