21 |
Leveraging Processor-diversity For Improved Performance In Heterogeneous-ISA SystemsPang, Yihan 05 November 2019 (has links)
The purpose of this thesis is to investigate the effectiveness of executing High Performance Computing (HPC) workloads on multiprocessors with heterogeneous Instruction Set Architecture (ISA) cores. ISA-heterogeneity in processor designs provides a unique dimension for researchers to explore performance benefits through diversity in design choices. Additionally, each application has a natural preference to one processor in a selected group of processors (we defined this term as processor-preference), and processor-preference is highly affected by processor design choices. Thus, a system with heterogeneous-ISA cores offers an intriguing design perspective, packing heterogeneous-ISA cores in the same processor or system that compensate each other in dynamic workload scenarios. This thesis considers dynamic migrating applications with different processor-preferences across ISA-different cores to exploit the potential of this idea. With SIMD instructions getting more attention from chip designers, this thesis also presents the necessary modifications for a general compiler/run-time infrastructure to transform the dynamic program state of SIMD regions at run-time from one ISA format to another for cross-ISA migration and execution. Lastly, this thesis presents a processor-preference-aware scheduling policy that makes dynamic cross-ISA migration decisions that improve overall system throughput compared to homogeneous-ISA systems. This thesis prototypes a heterogeneous-ISA system using an Intel Xeon Gold 5118 x86-64 server and a Cavium ThunderX ARMv8 server and evaluates the effectiveness of our infrastructure and scheduling policy. Our results reveal that heterogeneous-ISA systems that are processor-preference-aware and with cross-ISA execution migration capability can yield throughput gains up to 36\% compared to traditional homogeneous ISA systems. / Master of Science / The author of this thesis has a family full of non-engineers. To persuade family members that the work of this thesis is meaningful, aka the author is not procrastinating in school, the author decided to draw an analogy between processors and cars.
Suppose in an alternative universe, cars (systems) can be powered by engines (processors) that uses two different fuel-sources (ISAs): gasoline or electric (single-ISA) processors but not both (heterogeneous-ISA). Car manufacturers (chip designers) can build engines with different design choices (processors with varying design options): engines combined with turbochargers for gasoline-powered cars, high-performance batteries combined with energy-efficient batteries for electric-powered cars (added extended instruction sets, CPU designs that target vastly different use cases, etc.). However, each design choice is limited to improving performance for a specific type of fuel-source based engine. For example, having battery alternatives has no performance impact on gasoline-powered engines. As time passes by, car manufacturers have exhausted options to make a drastic improvement to their existing engine designs (limited performance gains in recent chips).
To tackle this problem, in this thesis, the author first examined the usage of cars: driving on the road (running applications). The author's study found that no single engine is suitable for all routes (no single processor is good for all workloads), and cars powered by different fuel-source based engines showed a significant diversity in performance (application performance varies drastically between systems with processors built on different ISAs). Gasoline-powered cars perform well on high-speed roads, whereas electric-powered cars perform well on low-speed roads. Unfortunately, in real life, a person's commute (a workload of applications) consists of a mixture of high-speed roads and low-speed roads, and one cannot know the exact percentage of each kind of path they travel (exact application composition in a workload) beforehand. Therefore it is challenging for a person to make the correct car selection for the incoming commute (choose the right system for a workload).
This thesis tries to solve this commuting problem by building a car that has multiple engines fitted to suit different road needs (systems with processors that have vastly different use cases). This thesis looks at a particular dimension of combining various fuel-powered engines in the same car (a system with heterogeneous-ISA processors). The author believes that adding diversity in fuel-powered engine selections provide an exciting dimension in car design choices (adding ISA-heterogeneity in processors provide a unique dimension in system design). Thus, this thesis focuses on estimating a theoretical multi fuel-powered car's performance by combining two different fuel-powered cars into a single mega-car using some framework (Popcorn Linux). This framework allows this mega-car to be driven by a combined fuel source with fuel intake freely transfer between fuel-sources (cross-ISA migration and execution) based on road conditions (application encountered). Based on the evaluation of this new prototype, the author finds that in a real-life scenario (workload with mixed application combination), cars with multiple fuel-source based engines have better performance than two single fuel-source based cars (systems with heterogeneous-ISAs processors perform better than systems with homogeneous-ISAs processors). The author hopes that this study can help build the foundation for the development of hybrid cars (system with heterogeneous-ISAs in the same processor) in the future as well as the consideration of modifying existing car into a mega-car with multiple engines suited for different road needs for improved commute performance for now.
Ultimately, this thesis is not about cars. The author hopes that by explaining the research done in this paper through cars, general audiences can understand what this work is trying to investigate and what solution they have provided. In this work, we investigate the potential of a system with heterogeneous-ISA processors. This thesis prototypes one such system and finds that heterogeneous-ISA systems have performance benefits than traditional homogeneous-ISA systems over a series of experiment evaluations.
|
22 |
SIMD-Swift: Improving Performance of Swift Fault DetectionOleksenko, Oleksii 20 January 2016 (has links) (PDF)
The general tendency in modern hardware is an increase in fault rates, which is caused by the decreased operation voltages and feature sizes. Previously, the issue of hardware faults was mainly approached only in high-availability enterprise servers and in safety-critical applications, such as transport or aerospace domains. These fields generally have very tight requirements, but also higher budgets. However, as fault rates are increasing, fault tolerance solutions are starting to be also required in applications that have much smaller profit margins. This brings to the front the idea of software-implemented hardware fault tolerance, that is, the ability to detect and tolerate hardware faults using software-based techniques in commodity CPUs, which allows to get resilience almost for free. Current solutions, however, are lacking in performance, even though they show quite good fault tolerance results.
This thesis explores the idea of using the Single Instruction Multiple Data (SIMD) technology for executing all program\'s operations on two copies of the same data. This idea is based on the observation that SIMD is ubiquitous in modern CPUs and is usually an underutilized resource. It allows us to detect bit-flips in hardware by a simple comparison of two copies under the assumption that only one copy is affected by a fault.
We implemented this idea as a source-to-source compiler which performs hardening of a program on the source code level. The evaluation of our several implementations shows that it is beneficial to use it for applications that are dominated by arithmetic or logical operations, but those that have more control-flow or memory operations are actually performing better with the regular instruction replication. For example, we managed to get only 15% performance overhead on Fast Fourier Transformation benchmark, which is dominated by arithmetic instructions, but memory-access-dominated Dijkstra algorithm has shown a high overhead of 200%.
|
23 |
Parallell beräkning av omslutande volymer / Parallel Computation of Bounding VolumesWinberg, Olov, Karlsson, Mattias January 2010 (has links)
<p>This paper presents techniques for speeding up commonly used algorithms forbounding volume (BV) computation, such as the AABB, sphere and k-DOP. Byexploiting the possibilities of parallelismin modern processors, the result exceedsthe expected theoretical result. The methods focus on data-level-parallelism(DLP) using Intel’s SSE instructions, for operations on 4 parallel independentsingle precision floating point values, with a theoretical speed-up factor of 4 ondata throughput. Still, a speed-up between 7–9 are shown in the computation ofAABBs and k-DOPs. For the computation of tight fitting spheres the speed-upfactor halts at approximately 4 due to a limiting data dependency. In addition,further parallelization by multithreading algorithms on multi-core CPUs showsspeed-up factors of 14 on 2 cores and reaching 25 on 4 cores, compared to nonparallel algorithms.</p>
|
24 |
Parallell beräkning av omslutande volymer / Parallel Computation of Bounding VolumesWinberg, Olov, Karlsson, Mattias January 2010 (has links)
This paper presents techniques for speeding up commonly used algorithms forbounding volume (BV) computation, such as the AABB, sphere and k-DOP. Byexploiting the possibilities of parallelismin modern processors, the result exceedsthe expected theoretical result. The methods focus on data-level-parallelism(DLP) using Intel’s SSE instructions, for operations on 4 parallel independentsingle precision floating point values, with a theoretical speed-up factor of 4 ondata throughput. Still, a speed-up between 7–9 are shown in the computation ofAABBs and k-DOPs. For the computation of tight fitting spheres the speed-upfactor halts at approximately 4 due to a limiting data dependency. In addition,further parallelization by multithreading algorithms on multi-core CPUs showsspeed-up factors of 14 on 2 cores and reaching 25 on 4 cores, compared to nonparallel algorithms.
|
25 |
Adaptation du calcul de la Transformée de Fourier Rapide sur une architecture mixte CPU/GPU intégrée / Adaptation of the Fast Fourier Transform processing on hybride integrated CPU/GPU architectureBergach, Mohamed Amine 02 October 2015 (has links)
Les architectures multi-cœurs Intel Core (IvyBridge, Haswell,...) contiennent à la fois des cœurs CPU généralistes (4), mais aussi des cœurs dédiés GPU embarqués sur cette même puce (16 et 40 respectivement). Dans le cadre de l'activité de la société Kontron (qui participe à ce financement de nature CIFRE) un objectif important est de calculer efficacement sur cette architecture des tableaux et séquences de transformées de Fourier rapides (FFT), comme par exemple on en trouve dans des applications radar. Alors que des bibliothèques natives (mais propriétaires) existent chez Intel pour les CPU, rien de tel n'est actuellement disponible pour la partie GPU. L'objectif de la thèse était donc de définir le placement efficace de modules FFT, en étudiant au niveau théorique la forme optimale permettant de regrouper des étages de calcul d'une telle FFT en fonction de la localité des données sur un cœur de calcul unique. Ce choix a priori permet d'espérer une efficacité des traitements, en ajustant la taille de la mémoire disponible à celles des données nécessaires. Ensuite la multiplicité des cœurs reste exploitable pour disposer plusieurs FFT calculées en parallèle, sans interférence (sauf contention du bus entre CPU et GPU). Nous avons obtenu des résultats significatifs, tant au niveau de l'implantation d'une FFT (1024 points) sur un cœur CPU SIMD, exprimée en langage C, que pour l'implantation d'une FFT de même taille sur un cœur GPU SIMT, exprimée alors en OpenCL. De plus nos résultats permettent de définir des règles pour synthétiser automatiquement de telles solutions, en fonction uniquement de la taille de la FFT son nombre d'étages plus précisément), et de la taille de la mémoire locale pour un coeur de calcul donné. Les performances obtenues sont supérieures à celles de la bibliothèque native Intel pour CPU), et démontrent un gain important de consommation sur GPU. Tous ces points sont détaillés dans le document de thèse. Ces résultats devraient donner lieu à exploitation au sein de la société Kontron. / Multicore architectures Intel Core (IvyBridge, Haswell…) contain both general purpose CPU cores (4) and dedicated GPU cores embedded on the same chip (16 and 40 respectively). As part of the activity of Kontron (the company partially funding this CIFRE scholarship), an important objective is to efficiently compute arrays and sequences of fast Fourier transforms (FFT) such as one finds in radar applications, on this architecture. While native (but proprietary) libraries exist for Intel CPU, nothing is currently available for the GPU part.The aim of the thesis was to define the efficient placement of FFT modules, and to study theoretically the optimal form for grouping computing stages of such FFT according to data locality on a single computing core. This choice should allow processing efficiency, by adjusting the memory size available to the required application data size. Then the multiplicity of cores is exploitable to compute several FFT in parallel, without interference (except for possible bus contention between the CPU and the GPU). We have achieved significant results, both in the implementation of an FFT (1024 points) on a SIMD CPU core, expressed in C, and in the implementation of a FFT of the same size on a GPU SIMT core, then expressed in OpenCL. In addition, our results allow to define rules to automatically synthesize such solutions, based solely on the size of the FFT (more specifically its number of stages), and the size of the local memory for a given computing core. The performances obtained are better than the native Intel library for CPU, and demonstrate a significant gain in consumption on GPU. All these points are detailed in the thesis document.
|
26 |
SIMD-Swift: Improving Performance of Swift Fault DetectionOleksenko, Oleksii 02 December 2015 (has links)
The general tendency in modern hardware is an increase in fault rates, which is caused by the decreased operation voltages and feature sizes. Previously, the issue of hardware faults was mainly approached only in high-availability enterprise servers and in safety-critical applications, such as transport or aerospace domains. These fields generally have very tight requirements, but also higher budgets. However, as fault rates are increasing, fault tolerance solutions are starting to be also required in applications that have much smaller profit margins. This brings to the front the idea of software-implemented hardware fault tolerance, that is, the ability to detect and tolerate hardware faults using software-based techniques in commodity CPUs, which allows to get resilience almost for free. Current solutions, however, are lacking in performance, even though they show quite good fault tolerance results.
This thesis explores the idea of using the Single Instruction Multiple Data (SIMD) technology for executing all program\'s operations on two copies of the same data. This idea is based on the observation that SIMD is ubiquitous in modern CPUs and is usually an underutilized resource. It allows us to detect bit-flips in hardware by a simple comparison of two copies under the assumption that only one copy is affected by a fault.
We implemented this idea as a source-to-source compiler which performs hardening of a program on the source code level. The evaluation of our several implementations shows that it is beneficial to use it for applications that are dominated by arithmetic or logical operations, but those that have more control-flow or memory operations are actually performing better with the regular instruction replication. For example, we managed to get only 15% performance overhead on Fast Fourier Transformation benchmark, which is dominated by arithmetic instructions, but memory-access-dominated Dijkstra algorithm has shown a high overhead of 200%.
|
27 |
Modeling and algorithm adaptation for a novel parallel DSP processor / Modellering och algorithm-anpassning för en ny parallell DSP-processorKraigher, Olof, Olsson, Johan January 2009 (has links)
<p>The P3RMA (Programmable, Parallel, and Predictable Random Memory Access) processor, currently being developed at Linköping University Sweden, is an attempt to solve the problems of parallel computing by utilizing a parallel memory subsystem and splitting the complexity of address computations with the complexity of data computations. It is targeted at embedded low power low cost computing for mobile phones, handsets and basestations among many others. By studying the radix-2 FFT using the P3RMA concept we have shown that even algorithms with a complex addressing pattern can be adapted to fully utilize a parallel datapath while only requiring additional simple addressing hardware. By supporting this algorithm with a SIMT instruction almost 100% utilization of the datapath can be achieved. A simulator framework for this processor has been proposed and implemented. This simulator has a very flexible structure featuring modular addition of new instructions and configurable hardware parameters. The simulator might be used by hardware developers and firmware developers in the future.</p>
|
28 |
Implementation of Action Recognition Algorithm on Multiple-Streaming Multimedia UnitLin, Tzu-chun 03 August 2010 (has links)
Action recognition had become prosperous in development and been broadly applied in several sectors. From homeland security, personal property, home caring, even the smart environment and the motion-sensing games, are in its territories
This paper analysis the algorithm of Action recognition for embedded system, finds that there are many blocks can use the parallel execution to compute more efficiently. This paper tries to implement action recognition algorithm on Multiple-Streaming Multimedia Unit (MSMU). MSMU is a MMX-like SIMD architecture, with SIMD Operation and Data Storage. By introduction the concept of multiple streaming, MSMU will be able to modulate the amount of parallel data streams dynamically via switching the instruction mode. With Mode Switching and new added transfer instruction to compute 2D image processing, study the benefit of the instruction mode switching
Through comparing the 128-bit SSE architecture and MSMU architecture with the practical example, highlight the problems that exploiting the subword parallelisms facing and bring out the advantage of Multistreaming.
For the algorithm, study the slicing the minimum element and using the bitwise operation approach to better efficiency. Compare to embedded SIMD architecture "WMMX", MSMU can achieve 3.49¡Ñ overall speedup.
|
29 |
MP3 Decoding Software Implementation for a DSP-enhanced MicrocontrollerChen, Shi-Wei 09 January 2004 (has links)
Multimedia workloads have always held an important role in embedded applications. Products are multifarious, such as various modeling mobile phone, MP3 player which is deft and convenient to carry and PDA which is popular with workers. We touch them all the time in our life. So these kinds of products are usually not high price. If their design cost and production cost are lower than others, then they can earn profits in this competition market. In so much multimedia applications, the most popular MP3 is our research goal.
The design methods of multimedia audio application are using high performance CPU or combining general purpose processor with a DSP. Their performance satisfied the demand of multimedia application really, but the system hardware cost will increase at the same time. It is not the best solution in embedded products which emphasizing that low cost is better than high performance.
So, my thesis will focus on MP3 algorithm optimization. We analyzed MP3 decoder algorithms, and found out the key operation. Using the SIMD operation feature of low cost multimedia processor development from our lab (It¡¦s named ME-MCU) to accelerate the processor speed. Then, I don¡¦t need a strong CPU or DSP, and I also can complete the MP3 decode operations as well. When I optimized the MP3 algorithm, I hope to provide some suggestion for ME-MCU modification. And the multimedia application will more agree with ME-MCU.
|
30 |
Stereoseende i realtid / Real-time Stereo VisionArvidsson, Lars January 2007 (has links)
<p>In this thesis, two real-time stereo methods have been implemented and evaluated. The first one is based on blockmatching and the second one is based on local phase. The goal was to be able to run the algorithms at real-time and examine which one is best. The blockmatching method performed better than the phase based method, both in speed and accuracy. SIMD operations (Single Instruction Multiple Data) have been used in the processor giving a speed boost by a factor of two.</p> / <p>I det här exjobbet har två stereometoder för realtidstillämpningar implementerats och utvärderats. Den ena bygger på blockmatchning och den andra på lokal fas. Målet var att kunna köra metoderna i realtid och undersöka vilken av dem som fungerar bäst. Blockmatchningsmetoden gav gott resultat medan den fasbaserade fungerade sämre, både vad gäller hastighet och precision. SIMD-operationer (Single Instruction Multiple Data) användes hos processorn vilket resulterade en i fördubbling av prestandan.</p>
|
Page generated in 0.022 seconds