Global ETD Search

1	Efficient use of Multi-core Technology in Interactive Desktop Applications Karlsson, Johan January 2015 (has links) The emergence of multi-core processors has successfully ended the era where applications could enjoy free and regular performance improvements without source code modifications. This thesis aims to gather experiences from the work of retrofitting parallelism into a desktop application originally written for sequential execution. The main contribution is the underlying theory and the performance evaluation, experiments and tests of the parallel software regions compared to its sequential counterparts. The feasibility is demonstrated as the theory is put into use when a complex commercially active desktop application is being rewritten to support parallelism. The thesis finds no simple guaranteed solution to the problem of making a serial application execute in parallel. However, experiments and tests proves that many of the evaluated methods offers tangible performance advantages compared to sequential execution. Multi-core processors parallelism
2	Multi-core architectures with coarse-grained dynamically reconfigurable processors for broadband wireless access technologies Han, Wei January 2010 (has links) Broadband Wireless Access technologies have significant market potential, especially the WiMAX protocol which can deliver data rates of tens of Mbps. Strong demand for high performance WiMAX solutions is forcing designers to seek help from multi-core processors that offer competitive advantages in terms of all performance metrics, such as speed, power and area. Through the provision of a degree of flexibility similar to that of a DSP and performance and power consumption advantages approaching that of an ASIC, coarse-grained dynamically reconfigurable processors are proving to be strong candidates for processing cores used in future high performance multi-core processor systems. This thesis investigates multi-core architectures with a newly emerging dynamically reconfigurable processor – RICA, targeting WiMAX physical layer applications. A novel master-slave multi-core architecture is proposed, using RICA processing cores. A SystemC based simulator, called MRPSIM, is devised to model this multi-core architecture. This simulator provides fast simulation speed and timing accuracy, offers flexible architectural options to configure the multi-core architecture, and enables the analysis and investigation of multi-core architectures. Meanwhile a profiling-driven mapping methodology is developed to partition the WiMAX application into multiple tasks as well as schedule and map these tasks onto the multi-core architecture, aiming to reduce the overall system execution time. Both the MRPSIM simulator and the mapping methodology are seamlessly integrated with the existing RICA tool flow. Based on the proposed master-slave multi-core architecture, a series of diverse homogeneous and heterogeneous multi-core solutions are designed for different fixed WiMAX physical layer profiles. Implemented in ANSI C and executed on the MRPSIM simulator, these multi-core solutions contain different numbers of cores, combine various memory architectures and task partitioning schemes, and deliver high throughputs at relatively low area costs. Meanwhile a design space exploration methodology is developed to search the design space for multi-core systems to find suitable solutions under certain system constraints. Finally, laying a foundation for future multithreading exploration on the proposed multi-core architecture, this thesis investigates the porting of a real-time operating system – Micro C/OS-II to a single RICA processor. A multitasking version of WiMAX is implemented on a single RICA processor with the operating system support. 621.382 multi-core ; WiMAX ; reconfigurable
3	ESL Model of the Hyper-scalar Processor on a Chip Chen, Po-kai 20 August 2007 (has links) This paper proposed a scalable chip multiprocessor architecture, which is called Hyper-scalar combined with the concept of superscalar and multithreaded architecture; hence, this architecture can enhance single-threaded performance by using core group and also supports multithreaded applications. System programmer can dynamically allocate the core groups to accelerate a single thread by extended system instructions. In order to solve the data dependence between all issued instructions the virtual shared register file is proposed. This mechanism allows the data in local register files to be accessed from other cores through the data switching path hardware and the instructions are executed only when the operands are available. The instructions within a single-threaded application can be dispatched to variable cores without re-compilation. This execution paradigm accelerates the single-threaded performance more flexibly. In the case of simulation and experimental framework, the ESL Model written in SystemC, a modeling language based on C++ is to provide hardware-oriented simulation platform and the MediaBench suite is selected for the experiments. On average, the Hyper-scalar architecture can accelerate single-threaded performance by 30% to 110% using 2 ~ 8 cores. Data-Driven Multi-Core ESL
4	Performance Analysis of Graph Algorithms using Graphics Processing Units Weng, Hui-Ze 02 September 2010 (has links) The GPU significantly improves the computing power by increasing the number of cores in recent years. The design principle of GPU focuses on the parallism of data processing. Therefore, there is some limitation of GPU application for the better computing power. For example, the processing of highly dependent data could not be well-paralleled. Consequently, it could not take the advantage of the computing power improved by GPU. Most of researches in GPU have discussed the improvement of computing power. Therefore, we try to study the cost effectiveness by the comparison between GPU and Multi-Core CPU. By well-applying the different hardware architectures of GPU and Multi-Core CPU, we implement some typical algorithms, respectively, and show the experimental result. Furthermore, the analysis of cost effectiveness, including time and money spending, is also well discussed in this paper. GPU Parallel computing Multi-Core
5	Design of the Execution-driven Simulation Environment for Hyper-scalar Architecture Su, Ding-Siang 21 August 2008 (has links) As a result of the microprocessor system research and the development of VLSI manufacturing process technology, the recent trend and development in high performance computer have toward to multi-core architecture. However current multi-core architectures are designed by the symmetric multi processors (SMP) concept. In traditional SMP mechanism, there are only data link between processor cores. So a single thread only can be handled by a single core, it limits the usage rate in multi-core and performance can not increase. This paper proposed a scalable chip multiprocessor architecture, which is called Hyper-scalar. The Principal characteristic of the architecture is ¡§design the interconnect control mechanisms for instructions in the multi-core¡¨. Some single scalar processor cores in Hyper-scalar architecture can be dynamically grouped as an n-way superscalar accelerator to improve the instruction-level parallelism, which is called accelerator group. Hyper-scalar combines the advantages of superscalar and multithreaded architecture; Hence, this architecture can not only enhance single-threaded performance by using accelerate group but also supports multithreaded applications. The paper based on ARM instruction set, to analyze how to create the interactive control mechanisms for instruction in the multi-core, and how to enhance the performance of a single thread in the Hyper-scalar architecture. It can be divided into four parts: register flow, memory flow, instruction control flow, chop of multi cycle instruction. When instructions are issued into the processor, they must be attached dependence tags that can solve the dependence between all issued instructions. All instructions can exchange the data through the virtual shared register file (VSRF) mechanism, and all instructions are executed only when the operands are available. In the memory flow part: we solve the dependence problem with a simple technique¡Xto execute instruction in instruction order. In instruction control flow part: in order to improve performance, we perform speculation execution mechanism, so the instructions can out of order execution beyond the basic block. Finally because there are some multi cycle instructions in the ARM instruction set, in hyper-scalar framework can chop into many one cycle instructions to further enhance performance. The simulation Model is written by SystemC, a modeling language based on C++ is to provide hardware-oriented simulation platform and the MediaBench suite is selected for the experiments. On average, the Hyper-scalar architecture can accelerate single-threaded performance by 50% to 300% using 2 ~ 8 cores. Hyper-scalar multi-core architecture
6	ARCHITECTURE-AWARE MAPPING AND SCHEDULING OF MIXED-CRITICALITY APPLICATIONS ON MULTI-CORE PLATFORMS Vasu, Aishwarya 01 May 2018 (has links) (PDF) The desire to have enhanced and increased feature sets in embedded applications has contributed to a significant increase in the computational demands of such systems over the years. To support such demand and yet maintain reasonable power/energy budgets, the industry has begun a shift to multi-core architectures even in the embedded systems domain. Embedded real-time applications such as Avionics and Automotive systems are no exception to this trend. Such systems have strict certification requirements of subsets of their functionality, which result in strict temporal constraints on those subsets, while other subsets may have less strict requirements. Migrating such {\em mixed criticality} systems from single-core to multi-core platforms is challenging because application/component isolation and freedom from interference among them must be guaranteed. Safe and efficient, architecture-aware mapping and scheduling of system components (e.g., partitions, tasks, etc. as relevant to a particular domain) on the multiple cores is at the center of any scheme to migrate such systems from single-core to multi-core platforms. In this dissertation, we propose, develop and evaluate a unified framework to automate the mapping and scheduling process with the consideration of several architectural and application level requirements/constraints (e.g., communication and cache conflicts among system components, constraints prohibiting the allocation of certain system components on the same core, etc.) MIXED-CRITICALITY MULTI-CORE SCHEDULING
7	An asymmetric multi-core architecture for efficiently accelerating critical paths in multithreaded programs Suleman, Muhammad Aater 20 October 2010 (has links) Extracting high-performance from Chip Multiprocessors (CMPs) requires that the application be parallelized i.e., divided into threads which execute concurrently on multiple cores. To save programmer effort, difficult to parallelize program portions are often left as serial. We show that common serial portions, i.e., non-parallel kernels, critical sections, and limiter stages in a pipeline, become the critical path of the program when the number of cores increases, thereby limiting performance and scalability. We propose that instead of burdening the software programmers with the task of shortening the serial portions, we can accelerate the serial portions using hardware support. To this end, we propose the Asymmetric Chip-Multiprocessor (ACMP) paradigm which provides one (or few) fast core(s) for accelerated execution of the serial portions and multiple slow, small cores for high throughput on the parallel portions. We show a concrete example implementation of the ACMP which consists of one large, high-performance core and many small, power-efficient cores. We develop hardware/software mechanisms to accelerate the execution of serial portions using the ACMP, and further improve the ACMP by proposing mechanisms to tackle common overheads incurred by the ACMP. / text Multi-core Multithreading CMPs ACMP Heterogeneous cores
8	Profile-driven parallelisation of sequential programs Tournavitis, Georgios January 2011 (has links) Traditional parallelism detection in compilers is performed by means of static analysis and more specifically data and control dependence analysis. The information that is available at compile time, however, is inherently limited and therefore restricts the parallelisation opportunities. Furthermore, applications written in C – which represent the majority of today’s scientific, embedded and system software – utilise many lowlevel features and an intricate programming style that forces the compiler to even more conservative assumptions. Despite the numerous proposals to handle this uncertainty at compile time using speculative optimisation and parallelisation, the software industry still lacks any pragmatic approaches that extracts coarse-grain parallelism to exploit the multiple processing units of modern commodity hardware. This thesis introduces a novel approach for extracting and exploiting multiple forms of coarse-grain parallelism from sequential applications written in C. We utilise profiling information to overcome the limitations of static data and control-flow analysis enabling more aggressive parallelisation. Profiling is performed using an instrumentation scheme operating at the Intermediate Representation (Ir) level of the compiler. In contrast to existing approaches that depend on low-level binary tools and debugging information, Ir-profiling provides precise and direct correlation of profiling information back to the Ir structures of the compiler. Additionally, our approach is orthogonal to existing automatic parallelisation approaches and additional fine-grain parallelism may be exploited. We demonstrate the applicability and versatility of the proposed methodology using two studies that target different forms of parallelism. First, we focus on the exploitation of loop-level parallelism that is abundant in many scientific and embedded applications. We evaluate our parallelisation strategy against the Nas and Spec Fp benchmarks and two different multi-core platforms (a shared-memory Intel Xeon Smp and a heterogeneous distributed-memory Ibm Cell blade). Empirical evaluation shows that our approach not only yields significant improvements when compared with state-of- the-art parallelising compilers, but comes close to and sometimes exceeds the performance of manually parallelised codes. On average, our methodology achieves 96% of the performance of the hand-tuned parallel benchmarks on the Intel Xeon platform, and a significant speedup for the Cell platform. The second study, addresses the problem of partially sequential loops, typically found in implementations of multimedia codecs. We develop a more powerful whole-program representation based on the Program Dependence Graph (Pdg) that supports profiling, partitioning and codegeneration for pipeline parallelism. In addition we demonstrate how this enhances conventional pipeline parallelisation by incorporating support for multi-level loops and pipeline stage replication in a uniform and automatic way. Experimental results using a set of complex multimedia and stream processing benchmarks confirm the effectiveness of the proposed methodology that yields speedups up to 4.7 on a eight-core Intel Xeon machine. 005.3
9	Resurshantering i Dual-core kluster Gustafsson, Johan, Lingbrand, Mikael January 2008 (has links) <p>Med den nya generationen processorer där vi har flera cpu-kärnor på ett chip, så ökas prestandan genom parallell exekvering. I denna rapport presenterar vi en omvärldsstudie om allmän multiprocessorteori där vi går igenom olika tekniker för både hårdvara och mjukvara. Vi har även utfört empiriska tester på ett datorkluster, där vi har testat de två olika programmen Fluent och CFX, som utför CFD beräkningar. För varje program så har tre modeller använts för simuleringar med varierande antal beräkningsnoder. Vi har undersökt vad som är mest lönsamt, att använda en eller båda CPU-kärnorna vid de olika simuleringarna. För att testa detta har vi kört simuleringar där vi har kört med en respektive två cpu-kärnor på beräkningsnoderna. Under simuleringarna har vi samlat in mätvärden som nätverk, minne och cpu-belastning för alla noder samt exekveringstider. Dessa värden har sedan sammanställts där vi ser att ju större en modell är desto mer lönar det sig att köra med en cpu-kärna. I endast ett av våra tester har det visat sig lönsamt att använda båda cpu-kärnorna. En formel har sedan utarbetats för att påvisa skillnaderna mellan olika antal processer med en respektive två cpu-kärnor per nod. Denna formel kan appliceras för att räkna ut den totala kostnaden per simulering med hjälp av årskostnaden för de noder och licenser som används.</p> Kluster Masternod Slavnod CPU-kärna Multi-core
10	Running stream-like programs on heterogeneous multi-core systems Carpenter, Paul 24 October 2011 (has links) All major semiconductor companies are now shipping multi-cores. Phones, PCs, laptops, and mobile internet devices will all require software that can make effective use of these cores. Writing high-performance parallel software is difficult, time-consuming and error prone, increasing both time-to-market and cost. Software outlives hardware; it typically takes longer to develop new software than hardware, and legacy software tends to survive for a long time, during which the number of cores per system will increase. Development and maintenance productivity will be improved if parallelism and technical details are managed by the machine, while the programmer reasons about the application as a whole. Parallel software should be written using domain-specific high-level languages or extensions. These languages reveal implicit parallelism, which would be obscured by a sequential language such as C. When memory allocation and program control are managed by the compiler, the program's structure and data layout can be safely and reliably modified by high-level compiler transformations. One important application domain contains so-called stream programs, which are structured as independent kernels interacting only through one-way channels, called streams. Stream programming is not applicable to all programs, but it arises naturally in audio and video encode and decode, 3D graphics, and digital signal processing. This representation enables high-level transformations, including kernel unrolling and kernel fusion. This thesis develops new compiler and run-time techniques for stream programming. The first part of the thesis is concerned with a statically scheduled stream compiler. It introduces a new static partitioning algorithm, which determines which kernels should be fused, in order to balance the loads on the processors and interconnects. A good partitioning algorithm is crucial if the compiler is to produce efficient code. The algorithm also takes account of downstream compiler passes---specifically software pipelining and buffer allocation---and it models the compiler's ability to fuse kernels. The latter is important because the compiler may not be able to fuse arbitrary collections of kernels. This thesis also introduces a static queue sizing algorithm. This algorithm is important when memory is distributed, especially when local stores are small. The algorithm takes account of latencies and variations in computation time, and is constrained by the sizes of the local memories. The second part of this thesis is concerned with dynamic scheduling of stream programs. First, it investigates the performance of known online, non-preemptive, non-clairvoyant dynamic schedulers. Second, it proposes two dynamic schedulers for stream programs. The first is specifically for one-dimensional stream programs. The second is more general: it does not need to be told the stream graph, but it has slightly larger overhead. This thesis also introduces some support tools related to stream programming. StarssCheck is a debugging tool, based on Valgrind, for the StarSs task-parallel programming language. It generates a warning whenever the program's behaviour contradicts a pragma annotation. Such behaviour could otherwise lead to exceptions or race conditions. StreamIt to OmpSs is a tool to convert a streaming program in the StreamIt language into a dynamically scheduled task based program using StarSs. / Totes les empreses de semiconductors produeixen actualment multi-cores. Mòbils,PCs, portàtils, i dispositius mòbils d’Internet necessitaran programari quefaci servir eficientment aquests cores. Escriure programari paral·lel d’altrendiment és difícil, laboriós i propens a errors, incrementant tant el tempsde llançament al mercat com el cost. El programari té una vida més llarga queel maquinari; típicament pren més temps desenvolupar nou programi que noumaquinari, i el programari ja existent pot perdurar molt temps, durant el qualel nombre de cores dels sistemes incrementarà. La productivitat dedesenvolupament i manteniment millorarà si el paral·lelisme i els detallstècnics són gestionats per la màquina, mentre el programador raona sobre elconjunt de l’aplicació.El programari paral·lel hauria de ser escrit en llenguatges específics deldomini. Aquests llenguatges extrauen paral·lelisme implícit, el qual és ocultatper un llenguatge seqüencial com C. Quan l’assignació de memòria i lesestructures de control són gestionades pel compilador, l’estructura iorganització de dades del programi poden ser modificades de manera segura ifiable per les transformacions d’alt nivell del compilador.Un dels dominis de l’aplicació importants és el que consta dels programes destream; aquest programes són estructurats com a nuclis independents queinteractuen només a través de canals d’un sol sentit, anomenats streams. Laprogramació de streams no és aplicable a tots els programes, però sorgeix deforma natural en la codificació i descodificació d’àudio i vídeo, gràfics 3D, iprocessament de senyals digitals. Aquesta representació permet transformacionsd’alt nivell, fins i tot descomposició i fusió de nucli.Aquesta tesi desenvolupa noves tècniques de compilació i sistemes en tempsd’execució per a programació de streams. La primera part d’aquesta tesi esfocalitza amb un compilador de streams de planificació estàtica. Presenta unnou algorisme de partició estàtica, que determina quins nuclis han de serfusionats, per tal d’equilibrar la càrrega en els processadors i en lesinterconnexions. Un bon algorisme de particionat és fonamental per tal de queel compilador produeixi codi eficient. L’algorisme també té en compte elspassos de compilació subseqüents---específicament software pipelining il’arranjament de buffers---i modela la capacitat del compilador per fusionarnuclis. Aquesta tesi també presenta un algorisme estàtic de redimensionament de cues.Aquest algorisme és important quan la memòria és distribuïda, especialment quanles memòries locals són petites. L’algorisme té en compte latències ivariacions en els temps de càlcul, i considera el límit imposat per la mida deles memòries locals.La segona part d’aquesta tesi es centralitza en la planificació dinàmica deprogrames de streams. En primer lloc, investiga el rendiment dels planificadorsdinàmics online, non-preemptive i non-clairvoyant. En segon lloc, proposa dosplanificadors dinàmics per programes de stream. El primer és específicament pera programes de streams unidimensionals. El segon és més general: no necessitael graf de streams, però els overheads són una mica més grans.Aquesta tesi també presenta un conjunt d’eines de suport relacionades amb laprogramació de streams. StarssCheck és una eina de depuració, que és basa enValgrind, per StarSs, un llenguatge de programació paral·lela basat en tasques.Aquesta eina genera un avís cada vegada que el comportament del programa estàen contradicció amb una anotació pragma. Aquest comportament d’una altra manerapodria causar excepcions o situacions de competició. StreamIt to OmpSs és unaeina per convertir un programa de streams codificat en el llenguatge StreamIt aun programa de tasques en StarSs planificat de forma dinàmica. OMPSS streaming multi-core heterigbeiys 004

Search results