Global ETD Search

11	Resurshantering i Dual-core kluster Gustafsson, Johan, Lingbrand, Mikael January 2008 (has links) Med den nya generationen processorer där vi har flera cpu-kärnor på ett chip, så ökas prestandan genom parallell exekvering. I denna rapport presenterar vi en omvärldsstudie om allmän multiprocessorteori där vi går igenom olika tekniker för både hårdvara och mjukvara. Vi har även utfört empiriska tester på ett datorkluster, där vi har testat de två olika programmen Fluent och CFX, som utför CFD beräkningar. För varje program så har tre modeller använts för simuleringar med varierande antal beräkningsnoder. Vi har undersökt vad som är mest lönsamt, att använda en eller båda CPU-kärnorna vid de olika simuleringarna. För att testa detta har vi kört simuleringar där vi har kört med en respektive två cpu-kärnor på beräkningsnoderna. Under simuleringarna har vi samlat in mätvärden som nätverk, minne och cpu-belastning för alla noder samt exekveringstider. Dessa värden har sedan sammanställts där vi ser att ju större en modell är desto mer lönar det sig att köra med en cpu-kärna. I endast ett av våra tester har det visat sig lönsamt att använda båda cpu-kärnorna. En formel har sedan utarbetats för att påvisa skillnaderna mellan olika antal processer med en respektive två cpu-kärnor per nod. Denna formel kan appliceras för att räkna ut den totala kostnaden per simulering med hjälp av årskostnaden för de noder och licenser som används. Kluster Masternod Slavnod CPU-kärna Multi-core
12	Design of The Rendezvous Mechanism In The Multi-Core AMBA System Chang, Mu-Chi 06 August 2008 (has links) In current chip multi-processors (CMPs), the on-chip network is a major factor affecting overall system performance. Different kinds of communication protocols vary from different communication architectures of current SOC designs. For example, the AMBA is master-slave architecture, which transacts and communicates the data of between the two CORE (Master) through the Memory (Slave). The architecture cost long time for load and store with memory. Hence, this paper design and implement a Rendezvous protocol on AMBA architecture, which is called Rendezvous of Advanced High performance Bus (RAHB), to let two processors can communicate with each other without memory reference overheads. The RAHB is compatible with the AHB architecture, and add Rendezvous communication protocol in the AMBA architecture to perform the direct transmission of data. Without referring the memory, the RAHB can improve the efficiency of communication in multi-core. For experimental evaluation, we evaluate the performance between RAHB and AHB, RAHB speedup (B/s) is average up to 50% for different data length and performance up 30% to 40% for executing test program. Inter-processor communication AMBA multi-core Rendezvous
13	JavaFlow : a Java DataFlow Machine Ascott, Robert John 10 February 2015 (has links) The JavaFlow, a Java DataFlow Machine is a machine design concept implementing a Java Virtual Machine aimed at addressing technology roadmap issues along with the ability to effectively utilize and manage very large numbers of processing cores. Specific design challenges addressed include: design complexity through a common set of repeatable structures; low power by featuring unused circuits and ability to power off sections of the chip; clock propagation and wire limits by using locality to bring data to processing elements and a Globally Asynchronous Locally Synchronous (GALS) design; and reliability by allowing portions of the design to be bypassed in case of failures. A Data Flow Architecture is used with multiple heterogeneous networks to connect processing elements capable of executing a single Java ByteCode instruction. Whole methods are cached in this DataFlow fabric, and the networks plus distributed intelligence are used for their management and execution. A mesh network is used for the DataFlow transfers; two ordered networks are used for management and control flow mapping; and multiple high speed rings are used to access the storage subsystem and a controlling General Purpose Processor (GPP). Analysis of benchmarks demonstrates the potential for this design concept. The design process was initiated by analyzing SPEC JVM benchmarks which identified a small number methods contributing to a significant percentage of the overall ByteCode operations. Additional analysis established static instruction mixes to prioritize the types of processing elements used in the DataFlow Fabric. The overall objective of the machine is to provide multi-threading performance for Java Methods deployed to this DataFlow fabric. With advances in technology it is envisioned that from 1,000 to 10,000 cores/instructions could be deployed and managed using this structure. This size of DataFlow fabric would allow all the key methods from the SPEC benchmarks to be resident. A baseline configuration is defined with a compressed dataflow structure and then compared to multiple configurations of instruction assignments and clock relationships. Using a series of methods from the SPEC benchmark running independently, IPC (Instructions per Cycle) performance of the sparsely populated heterogeneous structure is 40% of the baseline. The average ratio of instructions to required nodes is 3.5. Innovative solutions to the loading and management of Java methods along with the translation from control flow to DataFlow structure are demonstrated. / text DataFlow Multi-core Java Jvm Bytecode
14	Building a foundation for the future of software practices within the multi-core domain Berg, Celina 31 August 2011 (has links) Multi-core programming presents developers with a dramatic paradigm shift. Where the conceptual models of sequential programming largely supported the decoupling of source from underlying architecture, it is now unwise to develop new patterns, abstractions and parallel software in complete isolation from issues of modern hardware utilization. Challenging issues historically associated with complex systems code are now compounded within the parallel domain. These issues are manifested at all stages of software development including design, development, testing and maintenance. Programmers currently lack the essential tools to even partially automate reasoning techniques, resource utilization and system configuration management. Current trial and error strategies lack a systematic approach that will scale to growing multi-core and multi-processor environments. In fact, current algorithm and data layout conceptual models applied to design, implementation and pedagogy often conflict with effective parallelization strategies. This disertation calls for a rethinking, rebuilding and retooling of conceptual models, taking into account opportunities to introduce parallelism for multi-core architectures from the ground up. In order to establish new conceptual models, we must first 1) identify inherent complexities in multi-core development, 2) establish support strategies to make handling them more explicit and 3) evaluate the impact of these strategies in terms of proposed software development practices and tool support. / Graduate software engineering multi-core parallel programming
15	Power Aware WCET Analysis Bao, Wenlei 25 September 2014 (has links) No description available. Electrical Engineering power, WCET, multi-core
16	Parallelize streaming applications on Microgrid CPUs: A novel application on a scalable, multicore architecture. Mishra, Abhishek 29 September 2014 (has links) No description available. Computer Science multi-core, CMP, streaming applications
17	A hybrid MPI/OpenMP parallelization of the adaptive integral method for multi-core clusters Wei, Fangzhou 02 August 2011 (has links) A hybrid of message passing and shared memory techniques is presented for scalable parallelization of the adaptive integral method (AIM), an FFT based algorithm, on clusters of identical multi-core processors. The proposed hybrid MPI/OpenMP parallelization scheme is based on a nested one-dimensional (1-D) slab decomposition of the 3-D auxiliary uniform grid and the associated AIM calculations: If there are M processors and T cores per processor, the scheme (i) divides the uniform grid into M slabs and MT sub-slabs, (ii) assigns each slab/sub-slab and the associated operations to one of the processors/cores, and (iii) uses MPI for inter-processor data communication and OpenMP for intra-processor data exchange. The MPI/OpenMP parallel AIM is used to accelerate the MOM solution of combined-field integral equations pertinent to the analysis of scattering from perfectly conducting surfaces. The scalability and efficiency of the implementation are investigated theoretically and verified numerically by solving benchmark scattering problems on a (near) petaflop supercomputing cluster of quad-core processors. The timing and speedup results on up to 1024 processors show that the proposed hybrid MPI/OpenMP parallelization exhibits better strong scalability (fixed problem size speedup) compared to pure MPI parallelization when multiple cores are used on each processor. / text AIM CFIE MPI/OpenMP Multi-core processor cluster Multi-core processing Computer architecture
18	Performance Modeling of Multi-core Systems : Caches and Locks Pan, Xiaoyue January 2016 (has links) Performance is an important aspect of computer systems since it directly affects user experience. One way to analyze and predict performance is via performance modeling. In recent years, multi-core systems have made processors more powerful while keeping power consumption relatively low. However the complicated design of these systems makes it difficult to analyze performance. This thesis presents performance modeling techniques for cache performance and synchronization cost on multi-core systems. A cache can be designed in many ways with different configuration parameters including cache size, associativity and replacement policy. Understanding cache performance under different configurations is useful to explore the design choices. We propose a general modeling framework for estimating the cache miss ratio under different cache configurations, based on the reuse distance distribution. On multi-core systems, each core usually has a private cache. Keeping shared data in private caches coherent has an extra cost. We propose three models to estimate this cost, based on information that can be gathered when running the program on a single core. Locks are widely used as a synchronization primitive in multi-threaded programs on multi-core systems. While they are often necessary for protecting shared data, they also introduce lock contention, which causes performance issues. We present a model to predict how much contention a lock has on multi-core systems, based on information obtainable from profiling a run on a single core. If lock contention is shown to be a performance bottleneck, one of the ways to mitigate it is to use another lock implementation. However, it is costly to investigate if adopting another lock implementation would reduce lock contention since it requires reimplementation and measurement. We present a model for forecasting lock contention with another lock implementation without replacing the current lock implementation. performance modeling performance analysis multi-core cache lock
19	Static Execution Time Analysis of Parallel Systems Gustavsson, Andreas January 2016 (has links) The past trend of increasing processor throughput by increasing the clock frequency and the instruction level parallelism is no longer feasible due to extensive power consumption and heat dissipation. Therefore, the current trend in computer hardware design is to expose explicit parallelism to the software level. This is most often done using multiple, relatively slow and simple, processing cores situated on a single processor chip. The cores usually share some resources on the chip, such as some level of cache memory (which means that they also share the interconnect, e.g., a bus, to that memory and also all higher levels of memory). To fully exploit this type of parallel processor chip, programs running on it will have to be concurrent. Since multi-core processors are the new standard, even embedded real-time systems will (and some already do) incorporate this kind of processor and concurrent code. A real-time system is any system whose correctness is dependent both on its functional and temporal behavior. For some real-time systems, a failure to meet the temporal requirements can have catastrophic consequences. Therefore, it is crucial that methods to derive safe estimations on the timing properties of parallel computer systems are developed, if at all possible. This thesis presents a method to derive safe (lower and upper) bounds on the execution time of a given parallel system, thus showing that such methods must exist. The interface to the method is a small concurrent programming language, based on communicating and synchronizing threads, that is formally (syntactically and semantically) defined in the thesis. The method is based on abstract execution, which is itself based on abstract interpretation techniques that have been commonly used within the field of timing analysis of single-core computer systems, to derive safe timing bounds in an efficient (although, over-approximative) way. The thesis also proves the soundness of the presented method (i.e., that the estimated timing bounds are indeed safe) and evaluates a prototype implementation of it. / Den strategi som historiskt sett använts för att öka processorers prestanda (genom ökad klockfrekvens och ökad instruktionsnivåparallellism) är inte längre hållbar på grund av den ökade energikonsumtion som krävs. Därför är den nuvarande trenden inom processordesign att låta mjukvaran påverka det parallella exekveringsbeteendet. Detta görs vanligtvis genom att placera multipla processorkärnor på ett och samma processorchip. Kärnorna delar vanligtvis på några av processorchipets resurser, såsom cache-minne (och därmed också det nätverk, till exempel en buss, som ansluter kärnorna till detta minne, samt alla minnen på högre nivåer). För att utnyttja all den prestanda som denna typ av processorer erbjuder så måste mjukvaran som körs på dem kunna delas upp över de tillgängliga kärnorna. Eftersom flerkärniga processorer är standard idag så måste även realtidssystem baseras på dessa och den nämnda typen av kod. Ett realtidssystem är ett datorsystem som måste vara både funktionellt och tidsmässigt korrekt. För vissa typer av realtidssystem kan ett inkorrekt tidsmässigt beteende ha katastrofala följder. Därför är det ytterst viktigt att metoder för att analysera och beräkna säkra gränser för det tidsmässiga beteendet hos parallella datorsystem tas fram. Denna avhandling presenterar en metod för att beräkna säkra gränser för exekveringstiden hos ett givet parallellt system, och visar därmed att sådana metoder existerar. Gränssnittet till metoden är ett litet formellt definierat trådat programmeringsspråk där trådarna tillåts kommunicera och synkronisera med varandra. Metoden baseras på abstrakt exekvering för att effektivt beräkna de säkra (men ofta överskattade) gränserna för exekveringstiden. Abstrakt exekvering baseras i sin tur på abstrakta interpreteringstekniker som vida används inom tidsanalys av sekventiella datorsystem. Avhandlingen bevisar även korrektheten hos den presenterade metoden (det vill säga att de beräknade gränserna för det analyserade systemets exekveringstid är säkra) och utvärderar en prototypimplementation av den. / Worst-Case Execution Time Analysis of Parallel Systems / RALF3 - Software for Embedded High Performance Architectures WCET analysis parallel systems multi-core multicore threaded programming language
20	An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor Parker, Samuel J. January 2015 (has links) Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration). 005.2

Search results