Global ETD Search

101	Design of High-performance DMA Controller for Multi-core Platform Wang, Tongtong January 2006 (has links) <p>The DMA(direct memory access) controller is a special component in DSP processor used to offload the data transferring from CPU and improve the data access efficiency in the microprocessor.</p><p>This paper describes the design and implementation of DMA(direct memory access) device for microprocessor developed using C++ Language and SystemC libraries. The main facts covered within this report are the structure of a microprocessor with embedded DMA, and some interesting points of SystemC and TLM library that are useful for the design and implementation of the system level design.</p><p>This paper starts with an introduction of the theory of DMA , the structure of the microprocessor and the multicore microprocessor. Next it goes further into the DMA specification discussion. The next chapter is the implementation of DMA and the microsystem, later on in this chapter is an explanation of the SystemC methods I used in the system design.</p><p>At last, the simulation results of the whole system is presented and analyzed. The utility of the DMA is discussed and calculated.</p><p>With all these aspects covered in the paper, it is easy for the readers to understand the DMA theory , micro architecture as well as the fundamental knowledge of SystemC.</p> DMA DSP structure Multicore microprocessor System level design Electrical engineering Elektroteknik
102	E-AMOM: An Energy-Aware Modeling and Optimization Methodology for Scientific Applications on Multicore Systems Lively, Charles 2012 May 1900 (has links) Power consumption is an important constraint in achieving efficient execution on High Performance Computing Multicore Systems. As the number of cores available on a chip continues to increase, the importance of power consumption will continue to grow. In order to achieve improved performance on multicore systems scientific applications must make use of efficient methods for reducing power consumption and must further be refined to achieve reduced execution time. In this dissertation, we introduce a performance modeling framework, E-AMOM, to enable improved execution of scientific applications on parallel multicore systems with regards to a limited power budget. We develop models for each application based upon performance hardware counters. Our models utilize different performance counters for each application and for each performance component (runtime, system power consumption, CPU power consumption, and memory power consumption) that are selected via our performance-tuned principal component analysis method. Models developed through E-AMOM provide insight into the performance characteristics of each application that affect performance for each component on a parallel multicore system. Our models are more than 92% accurate across both Hybrid (MPI/OpenMP) and MPI implementations for six scientific applications. E-AMOM includes an optimization component that utilizes our models to employ run-time Dynamic Voltage and Frequency Scaling (DVFS) and Dynamic Concurrency Throttling to reduce power consumption of the scientific applications. Further, we optimize our applications based upon insights provided by the performance models to reduce runtime of the applications. Our methods and techniques are able to save up to 18% in energy consumption for Hybrid (MPI/OpenMP) and MPI scientific applications and reduce the runtime of the applications up to 11% on parallel multicore systems. Performance Modeling Power consumption Multicore Parallel Programming MPI Hybrid Power prediction
103	Heterogeneity-awareness in multithreaded multicore processors Acosta Ojeda, Carmelo Alexis 07 July 2009 (has links) During the last decades, Computer Architecture has experienced a great series of revolutionary changes. The increasing transistor count on a single chip has led to some of the main milestones in the field, from the release of the first Superscalar (1965) to the state-of-the-art Multithreaded Multicore Architectures, like the Intel Core i7 (2009).Moore's Law has continued for almost half of a century and is not expected to stop for at least another decade, and perhaps much longer. Moore observed a trend in the process technology advances. So, the number of transistors that can be placed inexpensively on an integrated circuit has increased exponentially, doubling approximately every two years. Nevertheless, having more available transistors can not be always directly translated into having more performance.The complexity of state-of-the-art software has reached heights unthinkable in prior ages, both in terms of the amount of computation and the complexity involved. If we deeply analyze this complexity in software we would realize that software is comprised of smaller execution processes that, although maintaining certain spatial/temporal locality, imply an inherently heterogeneous behavior. That is, during execution time the hardware executes very different portions of software, with huge differences in terms of behavior and hardware requirements. This heterogeneity in the behaviour of the software is not specific of the latest videogame, but it is inherent to software programming itself, since the very beginning of Algorithmics.In this PhD dissertation we deeply analyze the inherent heterogeneity present in software behavior. We identify the main issues and sources of this heterogeneity, that hamper most of the state-of-the-art processor designs from obtaining their maximum potential. Hence, the heterogeneity in software turns most of the current processors, commonly called general-purpose processors, into overdesigned. That is, they have much more hardware resources than really needed to execute the software running on them. This fact would not represent a main problem if we were not concerned on the additional power consumption involved in software computation.The final goal of this PhD dissertation consists in assigning each portion of software exactly the amount of hardware resources really needed to fully exploit its maximal potential; without consuming more energy than the strictly needed. That is, obtaining complexity-effective executions using the inherent heterogeneity in software behavior as steering indicator. Thus, we start deeply analyzing the heterogenous behaviour of the software run on top of general-purpose processors and then matching it on top of a heterogeneously distributed hardware, which explicitly exploit heterogeneous hardware requirements. Only by being heterogeneity-aware in software, and appropriately matching this software heterogeneity on top of hardware heterogeneity, may we effectively obtain better processor designs.The PhD dissertation is comprised of four main contributions that cover both multithreaded single-core (hdSMT) and multicore (TCA Algorithm, hTCA Framework and MFLUSH) scenarios, deeply explained in their corresponding chapters in the PhD dissertation memory. Overall, these contributions cover a significant range of the Heterogeneity-Aware Processors' design space. Within this design space, we have focused on the state-of-the-art trend in processor design: Multithreaded Multicore (CMP+SMT) Processors.We make special emphasis on the MPsim simulation tool, specifically designed and developed for this PhD dissertation. This tool has already gone beyond this PhD dissertation, becoming a reference tool by an important group of researchers spread over the Computer Architecture Department (DAC) at the Polytechnic University of Catalonia (UPC), the Barcelona Supercomputing Center (BSC) and the University of Las Palmas de Gran Canaria (ULPGC). thread-to-core assignment multicore multithreading heterogeneity-awareness instruction fetch policy 004
104	Programming Models and Runtimes for Heterogeneous Systems Grossman, Max 16 September 2013 (has links) With the plateauing of processor frequencies and increase in energy consumption in computing, application developers are seeking new sources of performance acceleration. Heterogeneous platforms with multiple processor architectures offer one possible avenue to address these challenges. However, modern heterogeneous programming models tend to be either so low-level as to severely hinder programmer productivity, or so high-level as to limit optimization opportunities. The novel systems presented in this thesis strike a better balance between abstraction and transparency, enabling programmers to be productive and produce high-performance applications on heterogeneous platforms. This thesis starts by summarizing the strengths, weaknesses, and features of existing heterogeneous programming models. It then introduces and evaluates four novel heterogeneous programming models and runtime systems: JCUDA, CnC-CUDA, DyGR, and HadoopCL. We'll conclude by positioning the key contributions of each piece in this thesis relative to the state-of-the-art, and outline possible directions for future work. heterogeneous GPU GPGPU programming model runtime multicore abstraction distributed CUDA OpenCL Hadoop
105	Effectiveness of Tracing in a Multicore Environment Sivakumar, Narendran, Sundar Rajan, Sriram January 2010 (has links) Debugging in real time is imperative for telecommunication networks with their ever increasing size and complexity. In event of an error or an unexpected occurrence of event, debugging the complex systems that controls these networks becomes an insurmountable task. With the help of tracing, it is possible to capture the snapshot of a system at any given point of time. Tracing, in essence, captures the state of the system along with the programs currently running on the system. LTTng is one such tool developed to perform tracing in both kernel space and user space of an application. In this thesis, we evaluate the effectiveness of LTTng and its impact on the performance on the applications traced by it. As part of this thesis we have formulated a comprehensive load matrix to simulate varying load demands in a telecommunication network. We have also devised a detailed experimental methodology which encompasses a collection of test suites used to determine the efficiency of various LTTng trace primitives. We were also able to prove that, in our experiments, LTTng’s kernel tracing is more efficient than User Space Tracing and LTTng’s User Space Tracing has a performance impact of around three to five percent. Tracing LTTng AMP SMP Multicore Software engineering Programvaruteknik Computer science Datalogi
106	Design of High-performance DMA Controller for Multi-core Platform Wang, Tongtong January 2006 (has links) The DMA(direct memory access) controller is a special component in DSP processor used to offload the data transferring from CPU and improve the data access efficiency in the microprocessor. This paper describes the design and implementation of DMA(direct memory access) device for microprocessor developed using C++ Language and SystemC libraries. The main facts covered within this report are the structure of a microprocessor with embedded DMA, and some interesting points of SystemC and TLM library that are useful for the design and implementation of the system level design. This paper starts with an introduction of the theory of DMA , the structure of the microprocessor and the multicore microprocessor. Next it goes further into the DMA specification discussion. The next chapter is the implementation of DMA and the microsystem, later on in this chapter is an explanation of the SystemC methods I used in the system design. At last, the simulation results of the whole system is presented and analyzed. The utility of the DMA is discussed and calculated. With all these aspects covered in the paper, it is easy for the readers to understand the DMA theory , micro architecture as well as the fundamental knowledge of SystemC. DMA DSP structure Multicore microprocessor System level design Electrical engineering Elektroteknik
107	A Partitioning Approach for Parallel Simulation of AC-Radial Shipboard Power Systems Uriarte, Fabian Marcel 2010 May 1900 (has links) An approach to parallelize the simulation of AC-Radial Shipboard Power Systems (SPSs) using multicore computers is presented. Time domain simulations of SPSs are notoriously slow, due principally to the number of components, and the time-variance of the component models. A common approach to reduce the simulation run-time of power systems is to formulate the electrical network equations using modified nodal analysis, use Bergeron's travelling-wave transmission line model to create subsystems, and to parallelize the simulation using a distributed computer. In this work, an SPS was formulated using loop analysis, defining the subsystems using a diakoptics-based approach, and the simulation parallelized using a multicore computer. A program was developed in C# to conduct multithreaded parallel-sequential simulations of an SPS. The program first represents an SPS as a graph, and then partitions the graph. Each graph partition represents a SPS subsystem and is computationally balanced using iterative refinement heuristics. Once balanced subsystems are obtained, each SPS subsystem's electrical network equations are formulated using loop analysis. Each SPS subsystem is solved using a unique thread, and each thread is manually assigned to a core of a multicore computer. To validate the partitioning approach, performance metrics were created to assess the speed gain and accuracy of the partitioned SPS simulations. The simulation parameters swept for the performance metrics were the number of partitions, the number of cores used, and the time step increment. The results of the performance metrics showed adequate speed gains with negligible error. An increasing simulation speed gain was observed when the number of partitions and cores were augmented, obtaining maximum speed gains of <30x when using a quadcore computer. Results show that the speed gain is more sensitive to the number partitions than is to the number of cores. While multicore computers are suitable for parallel-sequential SPS simulations, increasing the number of cores does not contribute to the gain in speed as much as does partitioning. The simulation error increased with the simulation time step but did not influence the partitioned simulation results. The number of operations caused by protective devices was used to determine whether the simulation error introduced by partitioning SPS simulations produced a inconsistent system behavior. It is shown, for the time step sizes uses, that protective devices did not operate inadvertently, which indicates that the errors did not alter RMS measurement and, hence, were non-influential. C# diakoptics EMTP loop analysis multicore PC partitioning parallel power systems ship shipboard power systems Windows
108	Effects Of Parallel Programming Design Patterns On The Performance Of Multi-core Processor Based Real Time Embedded Systems Kekec, Burak 01 June 2010 (has links) (PDF) Increasing usage of multi-core processors has led to their use in real time embedded systems (RTES). This entails high performance requirements which may not be easily met when software development follows traditional techniques long used for single processor systems. In this study, parallel programming design patterns especially developed and reported in the literature will be used to improve RTES implementations on multi-core systems. Specific performance parameters will be selected for assessment, and performance of traditionally developed software will be compared with that of software developed using parallel programming patterns.
109	Efficient shared cache management in multicore processors Xie, Yuejian 20 May 2011 (has links) In modern multicore processors, various resources (such as memory bandwidth and caches) are designed to be shared by concurrently running threads. Though it is good to be able to run multiple programs on a single chip at the same time, sometimes the contention of these shared resources can create problems for system performance. Naive hard-partitioning between threads can result in low resource utilization. This research shows that simple and effective approaches to dynamically manage the shared cache can be achieved. The contributions of this work are the following: (1) a technique for dynamic on-line classification of application memory access behaviors to predict the usefulness of cache partitioning, and a simple shared-cache management approach based on the classification; (2) a cache pseudo-partitioning technique that manipulates insertion and promotion policies; (3) a scalable algorithm to quickly decide per-core cache allocations; (4) pseudo-LRU cache partition approximation; (5) a dynamic shared cache compression technique that considers different thread behaviors. Performance Management Multicore processors Shared cache Simultaneous multithreading processors Cache memory
110	Resource management for efficient single-ISA heterogeneous computing Chen, Jian, doctor of electrical and computer engineering 11 July 2012 (has links) Single-ISA heterogeneous multi-core processors (SHMP) have become increasingly important due to their potential to significantly improve the execution efficiency for diverse workloads and thereby alleviate the power density constraints in Chip Multiprocessors (CMP). The importance of SHMP is further underscored by the fact that manufacturing defects and process variation could also cause single-ISA heterogeneity in CMPs even though the CMP is originally designed as homogeneous. However, to fully exploit the execution efficiency that SHMP has to offer, programs have to be efficiently mapped/scheduled to the appropriate cores such that the hardware resources of the cores match the resource demands of the programs, which is challenging and remains an open problem. This dissertation presents a comprehensive set of off-line and on-line techniques that leverage analytical performance modeling to bridge the gap between the workload diversity and the hardware heterogeneity. For the off-line scenario, this dissertation presents an efficient resource demand analysis framework that can estimate the resource demands of a program based on the inherent characteristics of the program without using any detailed simulation. Based on the estimated resource demands, this dissertation further proposes a multi-dimensional program-core matching technique that projects program resource demands and core configurations to a unified multi-dimensional space, and uses the weighted Euclidean distance between these two to identify the matching program-core pair. This dissertation also presents a dynamic and predictive application scheduler for SHMPs. It uses a set of hardware-efficient online profilers and an analytical performance model to simultaneously predict the application’s performance on different cores. Based on the predicted performance, the scheduler identifies and enforces near-optimal application assignment for each scheduling interval without any trial runs or off-line profiling. Using only a few kilo-bytes of extra hardware, the proposed heterogeneity-aware scheduler improves the weighted speedup by 11.3% compared with the commodity OpenSolaris scheduler and by 6.8% compared with the best known research scheduler. Finally, this dissertation presents a predictive yet cost effective mechanism to manage intra-core and/or inter-core resources in dynamic SHMP. It also uses a set of hardware-efficient online profilers and an analytical performance model to predict the application’s performance with different resource allocations. Based on the predicted performance, the resource allocator identifies and enforces near optimum resource partitions for each epoch without any trial runs. The experimental results show that the proposed predictive resource management framework could improve the weighted speedup of the CMP system by an average of 11.6% compared with the equal partition scheme, and 9.3% compared with existing reactive resource management scheme. / text Performance modeling Heterogeneous multicore processor Hardware resource demand Application scheduling Resource management

Search results