Global ETD Search

21	Оптимизација CFD симулације на групама вишејезгарних хетерогених архитектура / Optimizacija CFD simulacije na grupama višejezgarnih heterogenih arhitektura / Optimization of CFD simulations on groups of many-core heterogeneous architectures Tekić Jelena 07 October 2019 (has links) <p>Предмет  истраживања  тезе  је  из области  паралелног  програмирања,<br />имплементација  CFD  (Computational Fluid  Dynamics)  методе  на  више<br />хетерогених  вишејезгарних  уређаја истовремено.  У  раду  је  приказано<br />неколико  алгоритама  чији  је  циљ убрзање  CFD  симулације  на персоналним  рачунарима.  Показано је  да  описано  решење  постиже задовољавајуће  перформансе  и  на HPC  уређајима  (Тесла  графичким картицама).  Направљена  је симулација  у  микросервис архитектури  која  је  портабилна  и флексибилна и додатно олакшава рад на персоналним рачунарима.</p> / <p>Predmet  istraživanja  teze  je  iz oblasti  paralelnog  programiranja,<br />implementacija  CFD  (Computational Fluid  Dynamics)  metode  na  više<br />heterogenih  višejezgarnih  uređaja istovremeno.  U  radu  je  prikazano<br />nekoliko  algoritama  čiji  je  cilj ubrzanje  CFD  simulacije  na personalnim  računarima.  Pokazano je  da  opisano  rešenje  postiže zadovoljavajuće  performanse  i  na HPC  uređajima  (Tesla  grafičkim karticama).  Napravljena  je simulacija  u  mikroservis arhitekturi  koja  je  portabilna  i fleksibilna i dodatno olakšava rad na personalnim računarima.</p> / <p>The  case  study  of  this  dissertation belongs  to  the  field  of  parallel programming,  the  implementation  of CFD  (Computational  Fluid  Dynamics) method  on  several  heterogeneous multiple  core  devices  simultaneously. The  paper  presents  several  algorithms aimed  at  accelerating  CFD  simulation on common computers. Also it has been shown  that  the  described  solution achieves  satisfactory  performance  on<br />HPC  devices  (Tesla  graphic  cards). Simulation is  created    in  micro-service architecture that is portable and flexible and  makes  it  easy  to  test  CFD<br />simulations on common computers.</p>
22	PERFORMANCE EVALUATION OF MEMORY AND COMPUTATIONALLY BOUND CHEMISTRY APPLICATIONS ON STREAMING GPGPUS AND MULTI-CORE X86 CPUS Weber III, Frederick E 01 May 2010 (has links) In recent years, multi-core processors have come to dominate the field in desktop and high performance computing. Graphics processors traditionally used in CAD, video games, and other 3-d applications, have become more programmable and are now suitable for general purpose computing. This thesis explores multi-core processors and GPU performance and limitations in two computational chemistry applications: a memory bound component of ab-initio modeling and a computationally bound Monte Carlo simulation. For the applications presented in this thesis, exploiting multiple processors is done using a variety of tools and languages including OpenMP and MKL. Brook+ and the Compute Abstraction Layer streaming environments are used to accelerate applications on AMD GPUs. This thesis gives qualitative assertions about these languages and tools regarding ease of use and optimization in addition to quantitative analyses of performance. GPUs can yield modest performance improvements with little effort in some applications and even larger speedups with simple optimizations. GPU multi-core Monte Carlo parallel computing Computer and Systems Architecture
23	A Multi-core processor for hard real-time systems Paolieri, Marco 04 November 2011 (has links) The increasing demand for new functionalities in current and future hard real-time embedded systems, like the ones deployed in automotive and avionics industries, is driving an increment in the performance required in current embedded processors. Multi-core processors represent a good design solution to cope with such higher performance requirements due to their better performance-per-watt ratio while maintaining the core design simple. Moreover, multi-cores also allow executing mixed-criticality level workloads composed of tasks with and without hard real-time requirements, maximizing the utilization of the hardware resources while guaranteeing low cost and low power consumption. Despite those benefits, current multi-core processors are less analyzable than single-core ones due to the interferences between different tasks when accessing hardware shared resources. As a result, estimating a meaningful Worst-Case Execution Time (WCET) estimation - i.e. to compute an upper bound of the application's execution time - becomes extremely difficult, if not even impossible, because the execution time of a task may change depending on the other threads running at the same time. This makes the WCET of a task dependent on the set of inter-task interferences introduced by the co-running tasks. Providing a WCET estimation independent from the other tasks (time composability property) is a key requirement in hard real-time systems. This thesis proposes a new multi-core processor design in which time composability is achieved, hence enabling the use of multi-cores in hard real-time systems. With our proposals the WCET estimation of a HRT is independent from the other co-running tasks. To that end, we design a multi-core processor in which the maximum delay a request from a Hard Real-time Task (HRT), accessing a hardware shared resource can suffer due to other tasks is bounded: our processor guarantees that a request to a shared resource cannot be delayed longer than a given Upper Bound Delay (UBD). In addition, the UBD allows identifying the impact that different processor configurations may have on the WCET by determining the sensitivity of a HRT to different resource allocations. This thesis proposes an off-line task allocation algorithm (called IA3: Interference-Aware Allocation Algorithm), that allocates tasks in a task set based on the HRT's sensitivity to different resource allocations. As a result the hardware shared resources used by HRTs are minimized, by allowing Non Hard Real-time Tasks (NHRTs) to use the rest of resources. Overall, our proposals provide analyzability for the HRTs allowing NHRTs to be executed into the same chip without any effect on the HRTs. The previous first two proposals of this thesis focused on supporting the execution of multi-programmed workloads with mixed-criticality levels (composed of HRTs and NHRTs). Higher performance could be achieved by implementing multi-threaded applications. As a first step towards supporting hard real-time parallel applications, this thesis proposes a new hardware/software approach to guarantee a predictable execution of software pipelined parallel programs. This thesis also investigates a solution to verify the timing correctness of HRTs without requiring any modification in the core design: we design a hardware unit which is interfaced with the processor and integrated into a functional-safety aware methodology. This unit monitors the execution time of a block of instructions and it detects if it exceeds the WCET. Concretely, we show how to handle timing faults on a real industrial automotive platform. / La creciente demanda de nuevas funcionalidades en los sistemas empotrados de tiempo real actuales y futuros en industrias como la automovilística y la de aviación, está impulsando un incremento en el rendimiento necesario en los actuales procesadores empotrados. Los procesadores multi-núcleo son una solución eficiente para obtener un mayor rendimiento ya que aumentan el rendimiento por vatio, manteniendo el diseño del núcleo simple. Por otra parte, los procesadores multi-núcleo también permiten ejecutar cargas de trabajo con niveles de tiempo real mixtas (formadas por tareas de tiempo real duro y laxo así como tareas sin requerimientos de tiempo real), maximizando así la utilización de los recursos de procesador y garantizando el bajo consumo de energía. Sin embargo, a pesar los beneficios mencionados anteriormente, los actuales procesadores multi-núcleo son menos analizables que los de un solo núcleo debido a las interferencias surgidas cuando múltiples tareas acceden simultáneamente a los recursos compartidos del procesador. Como resultado, la estimación del peor tiempo de ejecución (conocido como WCET) - es decir, una cota superior del tiempo de ejecución de la aplicación - se convierte en extremadamente difícil, si no imposible, porque el tiempo de ejecución de una tarea puede cambiar dependiendo de las otras tareas que se estén ejecutando concurrentemente. Determinar una estimación del WCET independiente de las otras tareas es un requisito clave en los sistemas empotrados de tiempo real duro. Esta tesis propone un nuevo diseño de procesador multi-núcleo en el que el tiempo de ejecución de las tareas se puede componer, lo que permitirá el uso de procesadores multi-núcleo en los sistemas de tiempo real duro. Para ello, diseñamos un procesador multi-núcleo en el que la máxima demora que puede sufrir una petición de una tarea de tiempo real duro (HRT) para acceder a un recurso hardware compartido debido a otras tareas está acotado, tiene un límite superior (UBD). Además, UBD permite identificar el impacto que las diferentes posibles configuraciones del procesador pueden tener en el WCET, mediante la determinación de la sensibilidad en la variación del tiempo de ejecución de diferentes reservas de recursos del procesador. Esta tesis propone un algoritmo estático de reserva de recursos (llamado IA3), que asigna tareas a núcleos en función de dicha sensibilidad. Como resultado los recursos compartidos del procesador usados por tareas HRT se reducen al mínimo, permitiendo que las tareas sin requerimiento de tiempo real (NHRTs) puedas beneficiarse del resto de recursos. Por lo tanto, las propuestas presentadas en esta tesis permiten el análisis del WCET para tareas HRT, permitiendo así mismo la ejecución de tareas NHRTs en el mismo procesador multi-núcleo, sin que estas tengan ningún efecto sobre las tareas HRT. Las propuestas presentadas anteriormente se centran en el soporte a la ejecución de múltiples cargas de trabajo con diferentes niveles de tiempo real (HRT y NHRTs). Sin embargo, un mayor rendimiento puede lograrse mediante la transformación una tarea en múltiples sub-tareas paralelas. Esta tesis propone una nueva técnica, con soporte del procesador y del sistema operativo, que garantiza una ejecución analizable del modelo de ejecución paralela software pipelining. Esta tesis también investiga una solución para verificar la corrección del WCET de HRT sin necesidad de ninguna modificación en el diseño de la base: un nuevo componente externo al procesador se conecta a este sin necesidad de modificarlo. Esta nueva unidad monitorea el tiempo de ejecución de un bloque de instrucciones y detecta si se excede el WCET. Esta unidad permite detectar fallos de sincronización en sistemas de computación utilizados en automóviles. Real-time Multi-core Processor Embedded systems Woet 004
24	Simplifying the Creation of Multi-core Processors: An Interconnection Architecture and Tool Framework Grossman, Samuel Robert January 2012 (has links) The contribution of this thesis is two-fold: an on-chip interconnection architecture designed specifically for multi-core processors and a tool framework that simplifies the process of designing a multi-core processor. Both contributions primarily target ASIC fabrication, though prototyping on an FPGA is also supported. SG-Multi, the on-chip interconnection architecture, distinguishes itself from other interconnection architectures by emphasizing universal adaptability; that is, a primary design goal is to ensure compatibility with industry-supplied cores originally intended for other architectures. This goal is achieved through the use of bus adapters and without introducing clock cycle latency. SG-Multi is a multi-bus architecture that uses slave-side arbitration and supports multiple simultaneous transactions between independent devices. All transactions are pipelined in two stages, an address phase and a data phase, and for improved performance slave devices must signal their status for a given clock cycle at the beginning of that cycle. SG-Multi Designer, the tool framework which builds systems that use SG-Multi, provides a higher level of abstraction compared to other competing system-building solutions; the set of components with which a designer must be concerned is much more limited, and low-level details such as hardware interface compatibility are removed from active consideration. Experimental results demonstrate that the hardware cost of using SG-Multi is reasonable compared to using a processor's native bus architecture, although the current implementation of arbitration is identifiable as an area for future improvement. It is also shown that SG-Multi is scalable; the reference systems grow linearly with respect to the number of cores when tested for ASIC fabrication and slightly sublinearly when tested for FPGA prototyping, and the maximum achievable clock frequency remains almost constant as the number of cores grows beyond four. Because the reference systems tested are an accurate reflection of the types of systems SG-Multi Designer produces, it is concluded that the abstraction model used by SG-Multi Designer does not over-simplify the design process in a way that causes excessive performance degradation or increased hardware resource consumption. multi-core EDA hardware Verilog ASIC FPGA Electrical and Computer Engineering
25	Design of an Asynchronous Ring Bus Architecture for Multi-Core Systems Lei, Kin-fong 18 August 2010 (has links) In the multi-core systems, the data transfer between cores becomes a major challenge. The on-chip interconnect networks should be low latency, high throughput, scalability, better router or arbitration strategy, and low power consumption. An asynchronous ring bus, which is 33 bit width, adopting dual-rail single-track data protocol is proposed in this thesis. It provides not only robust but also high-speed asynchronous circuits condition. Owing to asynchronous circuits design, there are different transfer times in different hop counts. The shorter the distance is, the faster the data can be transferred. Unlink the synchronous ring bus, the bus frequency must be limited by the longest hop count latency. On the other hand, the transmission time of asynchronous circuits will not be held up by the longest distance even though the number of core is increased. For providing higher throughput, multiple cores which are able to access the bus simultaneously make a direct connection between each other. In bus arbitration, distribution arbiter is adopted to arbitrate the right to use the bus and solve the collision. Finally, the system performance in different arbitration strategies has been estimated in TSMC 0.18£gm process in this thesis. The transmission time of the shortest distance is 1.5 ns approximately, and the longest distance first has a better performance in different arbitration strategies. On-Chip Interconnect Networks Asynchronous Ring Bus Multi-Core Systems
26	Asynchronous Ring Network Mechanism with A Fair Arbitration Strategy for Network on Chip Wong, Chen-Ang 14 August 2012 (has links) The multi-core systems are usually implemented on homogeneous or heterogeneous cores, in order to design the better NOC (network on chip), it must consider the performance, scalability, simplifies hardware design and arbitration strategy at the on chip network. The routers are designed with circuit-switched network, circuit switching is asynchronous circuits and routers have no queuing (buffering), therefore, it is simple and efficient in implementation. Synchronous circuit is network with a clock source, but the distributing global clock has many problems such as power consumption, increasing the area and Clock skew. Ring topology with multi-transaction bus architecture. It could make multiple packets to access the bus at the same time, so that the multi-transaction bus architecture is better to get more throughputs. When the number of cores increase, the central arbiter circuit is more complexity, this thesis presents an SAP (self-adjusting priority) schedule that can fairly adjust priorities of each component by appropriately exchanging weighting at distributed arbiter. When numerous requests encounter contention on a network, a winner owning the highest priority will exchange its priority with the lowest priority of these requests. This principle guarantees that winners will decreased the opportunity of incurring network at the next time. In opposition, these losers can obtain the higher priority than that of the original. Therefore, the proposed scheme not only offers fair strategy, but also simplifies hardware design. switch circuit multi-core systems arbitration strategy Arbiter distributed system
27	PERFORMANCE EVALUATION OF MEMORY AND COMPUTATIONALLY BOUND CHEMISTRY APPLICATIONS ON STREAMING GPGPUS AND MULTI-CORE X86 CPUS Weber III, Frederick E 01 May 2010 (has links) In recent years, multi-core processors have come to dominate the field in desktop and high performance computing. Graphics processors traditionally used in CAD, video games, and other 3-d applications, have become more programmable and are now suitable for general purpose computing. This thesis explores multi-core processors and GPU performance and limitations in two computational chemistry applications: a memory bound component of ab-initio modeling and a computationally bound Monte Carlo simulation. For the applications presented in this thesis, exploiting multiple processors is done using a variety of tools and languages including OpenMP and MKL. Brook+ and the Compute Abstraction Layer streaming environments are used to accelerate applications on AMD GPUs. This thesis gives qualitative assertions about these languages and tools regarding ease of use and optimization in addition to quantitative analyses of performance. GPUs can yield modest performance improvements with little effort in some applications and even larger speedups with simple optimizations. GPU multi-core Monte Carlo parallel computing Computer and Systems Architecture
28	Static Timing Analysis of Parallel Systems Using Abstract Execution Gustavsson, Andreas January 2014 (has links) The Power Wall has stopped the past trend of increasing processor throughput by increasing the clock frequency and the instruction level parallelism.Therefore, the current trend in computer hardware design is to expose explicit parallelism to the software level.This is most often done using multiple processing cores situated on a single processor chip.The cores usually share some resources on the chip, such as some level of cache memory (which means that they also share the interconnect, e.g. a bus, to that memory and also all higher levels of memory), and to fully exploit this type of parallel processor chip, programs running on it will have to be concurrent.Since multi-core processors are the new standard, even embedded real-time systems will (and some already do) incorporate this kind of processor and concurrent code. A real-time system is any system whose correctness is dependent both on its functional and temporal output. For some real-time systems, a failure to meet the temporal requirements can have catastrophic consequences. Therefore, it is of utmost importance that methods to analyze and derive safe estimations on the timing properties of parallel computer systems are developed. This thesis presents an analysis that derives safe (lower and upper) bounds on the execution time of a given parallel system.The interface to the analysis is a small concurrent programming language, based on communicating and synchronizing threads, that is formally (syntactically and semantically) defined in the thesis.The analysis is based on abstract execution, which is itself based on abstract interpretation techniques that have been commonly used within the field of timing analysis of single-core computer systems, to derive safe timing bounds in an efficient (although, over-approximative) way.Basically, abstract execution simulates the execution of several real executions of the analyzed program in one go.The thesis also proves the soundness of the presented analysis (i.e. that the estimated timing bounds are indeed safe) and includes some examples, each showing different features or characteristics of the analysis. / Worst-Case Execution Time Analysis of Parallel Systems / RALF3 - Software for Embedded High Performance Architectures WCET analysis parallel systems multi-core multicore threaded programming language
29	Accelerating Mixed-Abstraction SystemC Models on Multi-Core CPUs and GPUs Kaushik, Anirudh Mohan January 2014 (has links) Functional verification is a critical part in the hardware design process cycle, and it contributes for nearly two-thirds of the overall development time. With increasing complexity of hardware designs and shrinking time-to-market constraints, the time and resources spent on functional verification has increased considerably. To mitigate the increasing cost of functional verification, research and academia have been engaged in proposing techniques for improving the simulation of hardware designs, which is a key technique used in the functional verification process. However, the proposed techniques for accelerating the simulation of hardware designs do not leverage the performance benefits offered by multiprocessors/multi-core and heterogeneous processors available today. With the growing ubiquity of powerful heterogeneous computing systems, which integrate multi-processor/multi-core systems with heterogeneous processors such as GPUs, it is important to utilize these computing systems to address the functional verification bottleneck. In this thesis, I propose a technique for accelerating SystemC simulations across multi-core CPUs and GPUs. In particular, I focus on accelerating simulation of SystemC models that are described at both the Register-Transfer Level (RTL) and Transaction Level (TL) abstractions. The main contributions of this thesis are: 1.) a methodology for accelerating the simulation of mixed abstraction SystemC models defined at the RTL and TL abstractions on multi-core CPUs and GPUs and 2.) An open-source static framework for parsing, analyzing, and performing source-to-source translation of identified portions of a SystemC model for execution on multi-core CPUs and GPUs.
30	Simplifying the Creation of Multi-core Processors: An Interconnection Architecture and Tool Framework Grossman, Samuel Robert January 2012 (has links) The contribution of this thesis is two-fold: an on-chip interconnection architecture designed specifically for multi-core processors and a tool framework that simplifies the process of designing a multi-core processor. Both contributions primarily target ASIC fabrication, though prototyping on an FPGA is also supported. SG-Multi, the on-chip interconnection architecture, distinguishes itself from other interconnection architectures by emphasizing universal adaptability; that is, a primary design goal is to ensure compatibility with industry-supplied cores originally intended for other architectures. This goal is achieved through the use of bus adapters and without introducing clock cycle latency. SG-Multi is a multi-bus architecture that uses slave-side arbitration and supports multiple simultaneous transactions between independent devices. All transactions are pipelined in two stages, an address phase and a data phase, and for improved performance slave devices must signal their status for a given clock cycle at the beginning of that cycle. SG-Multi Designer, the tool framework which builds systems that use SG-Multi, provides a higher level of abstraction compared to other competing system-building solutions; the set of components with which a designer must be concerned is much more limited, and low-level details such as hardware interface compatibility are removed from active consideration. Experimental results demonstrate that the hardware cost of using SG-Multi is reasonable compared to using a processor's native bus architecture, although the current implementation of arbitration is identifiable as an area for future improvement. It is also shown that SG-Multi is scalable; the reference systems grow linearly with respect to the number of cores when tested for ASIC fabrication and slightly sublinearly when tested for FPGA prototyping, and the maximum achievable clock frequency remains almost constant as the number of cores grows beyond four. Because the reference systems tested are an accurate reflection of the types of systems SG-Multi Designer produces, it is concluded that the abstraction model used by SG-Multi Designer does not over-simplify the design process in a way that causes excessive performance degradation or increased hardware resource consumption. multi-core EDA hardware Verilog ASIC FPGA Electrical and Computer Engineering

Search results