Global ETD Search

1	Efficient use of Multi-core Technology in Interactive Desktop Applications Karlsson, Johan January 2015 (has links) The emergence of multi-core processors has successfully ended the era where applications could enjoy free and regular performance improvements without source code modifications. This thesis aims to gather experiences from the work of retrofitting parallelism into a desktop application originally written for sequential execution. The main contribution is the underlying theory and the performance evaluation, experiments and tests of the parallel software regions compared to its sequential counterparts. The feasibility is demonstrated as the theory is put into use when a complex commercially active desktop application is being rewritten to support parallelism. The thesis finds no simple guaranteed solution to the problem of making a serial application execute in parallel. However, experiments and tests proves that many of the evaluated methods offers tangible performance advantages compared to sequential execution. Multi-core processors parallelism
2	DESIGN OF PRIORITIZED LRU CIRCUITS FOR CACHE OF MULTI-CORE REAL-TIME SYSTEMS Gopalakrishnan, Lavanya 01 August 2011 (has links) With the advancement of technology, multi-cores with shared cache have been used in real-time applications. In such systems, some cores run real-time applications and some cores run other non-critical applications that do not have strict deadline. Due to the sharing of cache by multi-core processors, problems predicting the actual execution time and the execution time of real-time applications have emerged. To address these problems, cache memory with prioritized replacement policy is proposed. Most of the work is carried out in high-level hardware designs and software based application level designs. No low-level hardware implementations of cache memory with prioritized replacement circuits have been designed to the best of my knowledge. My thesis focuses on designing a LRU replacement circuit that is prioritized based on the application the processor is running. Real-time applications acquire priority in using the cache memory over other applications which enhance the seamless execution of the real-time application and hence supports execution time predictability which in turn helps improve the potential of multi-core computing of real-time systems. The speed, size and power overhead are analyzed by placing the N-way set associative LRU as a part of cache of size 128KB designed using 65nm CMOS technology. Cache Contention LRU Multi-core processors Priority Cache shared cache
3	High Performance and Scalable MPI Intra-node Communication Middleware for Multi-core Clusters Chai, Lei 27 August 2009 (has links) No description available. Computer Science MPI Cluster Computing Multi-core Processors
4	Modèles de programmation et d'exécution pour les architectures parallèles et hybrides. Applications à des codes de simulation pour la physique. / Programming models and execution models for parallel and hybrid architectures. Application to physics simulations. Ospici, Matthieu 03 July 2013 (has links) Nous nous intéressons dans cette thèse aux grandes architectures parallèles hybrides, c'est-à-dire aux architectures parallèles qui sont une combinaison de processeurs généraliste (Intel Xeon par exemple) et de processeurs accélérateur (GPU Nvidia). L'exploitation efficace de ces grappes hybrides pour le calcul haute performance est au cœur de nos travaux. L'hétérogénéité des ressources de calcul au sein des grappes hybrides pose de nombreuses problématiques lorsque l'on souhaite les exploiter efficacement avec de grandes applications scientifiques existantes. Deux principales problématiques ont été traitées. La première concerne le partage des accélérateurs pour les applications MPI et la seconde porte sur la programmation et l'exécution concurrente de code entre CPU et accélérateur. Les architectures hybrides sont très hétérogènes : en fonction des architectures, le ratio entre le nombre d'accélérateurs et le nombre de coeurs CPU est très variable. Ainsi, nous avons tout d'abord proposé une notion de virtualisation d'accélérateur, qui permet de donner l'illusion aux applications qu'elles ont la capacité d'utiliser un nombre d'accélérateurs qui n'est pas lié au nombre d'accélérateurs physiques disponibles dans le matériel. Un modèle d'exécution basé sur un partage des accélérateurs est ainsi mis en place et permet d'exposer aux applications une architecture hybride plus homogène. Nous avons également proposé des extensions aux modèles de programmation basés sur MPI / threads afin de traiter le problème de l'exécution concurrente entre CPU et accélérateurs. Nous avons proposé pour cela un modèle basé sur deux types de threads, les threads CPU et accélérateur, permettant de mettre en place des calculs hybrides exploitant simultanément les CPU et les accélérateurs. Dans ces deux cas, le déploiement et l'exécution du code sur les ressources hybrides est crucial. Nous avons pour cela proposé deux bibliothèques logicielles S_GPU 1 et S_GPU 2 qui ont pour rôle de déployer et d'exécuter les calculs sur le matériel hybride. S_GPU 1 s'occupant de la virtualisation, et S_GPU 2 de l'exploitation concurrente CPU -- accélérateurs. Pour observer le déploiement et l'exécution du code sur des architectures complexes à base de GPU, nous avons intégré des mécanismes de traçage qui permettent d'analyser le déroulement des programmes utilisant nos bibliothèques. La validation de nos propositions a été réalisée sur deux grandes application scientifiques : BigDFT (simulation ab-initio) et SPECFEM3D (simulation d'ondes sismiques). Nous les avons adapté afin qu'elles puissent utiliser S_GPU 1 (pour BigDFT) et S_GPU 2 (pour SPECFEM3D). / We focus on large parallel hybrid architectures based on a combination of general processors (eg Intel Xeon) and accelerators (Nvidia GPU). Using with efficiency these hybrid clusters for high performance computing is central in our work. The heterogeneity of computing resources in hybrid clusters leads to many issues when we want to use large scientific applications on it. Two main issues were addressed in this thesis. The first one concerns the sharing of accelerators for MPI applications and the second one focuses on programming and concurrent execution of application between CPUs and accelerators. Hybrid architectures are very heterogeneous: for each cluster, the ratio between the number of accelerators and the number of CPU cores can be different. Thus, we first propose a concept of accelerator virtualization, which allows applications to view an architecture in which the number of accelerators is not related to the number of physical accelerators. An execution model based on the sharing of accelerators is proposed. We also propose extensions to the programming model based on MPI + threads to address the problem of concurrent execution between CPUs and accelerators. We propose a system based on two types of threads (CPU and accelerator threads) to implement hybrid calculations simultaneously exploiting the CPU and accelerators model. In both cases, the deployment and the execution of code on hybrid resources is critical. Consequently, we propose two software libraries, called S_GPU 1 and S_GPU 2, designed to deploy and perform calculations on the hybrid hardware. S_GPU 1 deals with virtualization and S_GPU 2 allows concurrent operations on CPUs and accelerators. To observe the deployment and the execution of code on complex hybrid architectures, we integrated trace mechanisms for analyzing the progress of the programs using our libraries. The validation of our proposals has been carried out on two large scientific applications: BigDFT (ab-initio simulation) and SPECFEM3D (simulation of seismic waves). Architectures hybrides (CPU / GPU) Processeurs multicoeurs Gpu Hybrid hardware (CPU / GPU) Multi-core processors Gpu
5	Energy-aware Thread and Data Management in Heterogeneous Multi-Core, Multi-Memory Systems Su, Chun-Yi 03 February 2015 (has links) By 2004, microprocessor design focused on multicore scaling"increasing the number of cores per die in each generation "as the primary strategy for improving performance. These multicore processors typically equip multiple memory subsystems to improve data throughput. In addition, these systems employ heterogeneous processors such as GPUs and heterogeneous memories like non-volatile memory to improve performance, capacity, and energy efficiency. With the increasing volume of hardware resources and system complexity caused by heterogeneity, future systems will require intelligent ways to manage hardware resources. Early research to improve performance and energy efficiency on heterogeneous, multi-core, multi-memory systems focused on tuning a single primitive or at best a few primitives in the systems. The key limitation of past efforts is their lack of a holistic approach to resource management that balances the tradeoff between performance and energy consumption. In addition, the shift from simple, homogeneous systems to these heterogeneous, multicore, multi-memory systems requires in-depth understanding of efficient resource management for scalable execution, including new models that capture the interchange between performance and energy, smarter resource management strategies, and novel low-level performance/energy tuning primitives and runtime systems. Tuning an application to control available resources efficiently has become a daunting challenge; managing resources in automation is still a dark art since the tradeoffs among programming, energy, and performance remain insufficiently understood. In this dissertation, I have developed theories, models, and resource management techniques to enable energy-efficient execution of parallel applications through thread and data management in these heterogeneous multi-core, multi-memory systems. I study the effect of dynamic concurrent throttling on the performance and energy of multi-core, non-uniform memory access (NUMA) systems. I use critical path analysis to quantify memory contention in the NUMA memory system and determine thread mappings. In addition, I implement a runtime system that combines concurrent throttling and a novel thread mapping algorithm to manage thread resources and improve energy efficient execution in multi-core, NUMA systems. In addition, I propose an analytical model based on the queuing method that captures important factors in multi-core, multi-memory systems to quantify the tradeoff between performance and energy. The model considers the effect of these factors in a holistic fashion that provides a general view of performance and energy consumption in contemporary systems. Finally, I focus on resource management of future heterogeneous memory systems, which may combine two heterogeneous memories to scale out memory capacity while maintaining reasonable power use. I present a new memory controller design that combines the best aspects of two baseline heterogeneous page management policies to migrate data between two heterogeneous memories so as to optimize performance and energy. / Ph. D. Thread Management Multi-core Processors Performance Modeling and Analysis Power-Aware Computing Heterogeneous Memory Data Management
6	Extension of the SkePU Skeleton ProgrammingFramework for Multi-core CPU and Multi-GPU Systems for MPI-based Clusters Mangaraj, Swadhin K January 2013 (has links) SkePU (Skeleton Programming Framework for Multi-core CPU and Multi-GPU Systems) is a parallel computing framework developed by Johan Enmyren and Christoph Kessler at Linköpings Universitet. This C++ template library provides a simple and unified interface for specifying data-parallel computations with the help of skeletons and is targeted to multiple backends e.g. for a sequential CPU, parallel CPUs using MPI and OpenMP or GPUs using CUDA and OpenCL. SkePU is comprised of seven data-parallel skeletons and one task-parallel skeleton and these skeletons use two types of containers: vector and matrix to model real-life parallel applications. In this thesis, we address the extension of the SkePU framework by extending the matrix container (which stores 2-D data values) that can efficiently use the existing skeletons to develop parallel scientific applications on large-scale clusters using MPI. This piece of work focuses on the distribution of the matrix among the participating processes which after receiving their share of data can execute the application in parallel. This work covers all of the seven data-parallel skeletons. Each skeleton has been tested with a small application program. In addition to measurement of performance improvement from the application program’s execution time, we have also done a communication cost analysis for all skeletons with MPI using the LogGP model. In order to evaluate and test the operational efficiency of the extension, we have considered a PDE solver application. Through this application, we have demonstrated the performance gain and scalability of the extended framework. The performance improvement was more when computational load dominates the memory I/O operations. The results show that using the extension can serve as a viable approach while implementing real-life parallel applications on large-scale clusters. SkePU Skeleton Programming Framework C++ template library Matrix container
7	Maîtrise de la couche hyperviseur sur les architectures multi-coeurs COTS dans un contexte avionique / Hypervisor control of COTS multi-cores processors in order to enforce determinism for future avionics equipment Jean, Xavier 18 June 2015 (has links) Nous nous intéressons dans cette thèse à la maîtrise de processeurs multi-cœurs COTS dans le but de les rendre utilisables dans des équipements avioniques, qui ont des exigences temps réelles dures. L’objectif est de permettre l'application de méthodes connues d’évaluation de pire temps d’exécution (WCET) sur un ensemble de tâches représentatif d’applications avioniques. Au cours de leur exécution, les tâches exécutées sur différents cœurs vont accéder simultanément à des ressources matérielles qui sont partagées entre les cœurs, en particulier la mémoire principale. Cela pourra entraîner des mises en attente de certains accès que l'on qualifie d'interférences. Ces interférences peuvent avoir un impact élevé sur le temps d'exécution du logiciel embarqué. Sur un processeur COTS, qui est acheté dans le commerce et vise un marché plus large que l'avionque, cet impact n'est pas borné. Nous cherchons à garantir l'absence d'interférences grâce à des moyens logiciels, dans la mesure où les processeurs COTS ne proposent pas de mécanismes adéquats au niveau matériel. Nous cherchons à étendre des concepts de logiciel déterministe de telle sorte à les rendre compatibles avec un objectif de réutilisation de logiciel existant. A cet effet, nous introduisons la notion de logiciel de contrôle, qui est un élément fonctionnellement neutre, répliqué sur tous les cœurs, et qui contrôle les dates des accès des cœurs aux ressources communes de telle sorte à offrir une isolation temporelle entre ces accès. Nous étudions dans cette thèse le problème de faisabilité d'un logiciel de contrôle sur un processeur COTS, et de son efficacité vis à vis d'applications avioniques. / We focus in this thesis on issues related to COTS multi-core processors mastering, especially regarding hard real-time constraints, in order to enable their usage in future avionics equipment. We aim at applying existing Worst Case Execution Time (WCET) evaluation methods on a set of tasks similar to those we can find in avionics software. At runtime, tasks executed among different cores are likely to access hardware resources at the same time, e.g. the main memory. It may lead to additional delays due to hardware contention, called “interferences”. Interferences slow down embedded software within ranges that may be important. Additionnally, no bound has been established for their impact on WCET when using COTS processors, that target larger markets than avionics. We try to provide guarantees that all interferences are eliminated through software, as COTS processors do not provide adequate mechanisms at hardware level. We extend deterministic software concepts that have been developed in the state of the art, in order to make them compliant with the use of legacy software. We introduce the concept of "control software", which is functionnaly neutral, is replicated among all cores, and performs active control of core's accesses to shared resources, so that concurrent accesses are temporally isolated. We formalize and study in this thesis the problem of control software feasibility on COTS processors, and questions of efficiency with regard to legacy avionics software. Processeur multi-coeurs Logiciel de contrôle Virtualisation Hyperviseur Avionique modulaire intégrée Multi-core processors Control software Virtualization Hypervisor Integrated modular avionics
8	Parallel Viterbi Search For Continuous Speech Recognition On A Multi-Core Architecture Parihar, Naveen 11 December 2009 (has links) State-of-the-art speech-recognition systems can successfully perform simple tasks in real-time on most computers, when the tasks are performed in controlled and noiseree environments. However, current algorithms and processors are not yet powerful enough for real-time large-vocabulary conversational speech recognition in noisy, real-world environments. Parallel processing can improve the real-time performance of speech recognition systems and increase their applicability, and developing an effective approach to parallelization is especially important given the recent trend toward multi-core processor design. In this dissertation, we introduce methods for parallelizing a single-pass across-word n-gram lexical-tree based Viterbi recognizer, which is the most popular architecture for Viterbi-based large vocabulary continuous speech recognition. We parallelize two different open-source implementations of such a recognizer, one developed at Mississippi State University and the other developed at Rheinisch-Westfalische Technische Hochschule University in Germany. We describe three methods for parallelization. The first, called parallel fast likelihood computation, parallelizes likelihood computations by decomposing mixtures among CPU cores, so that each core computes the likelihood of the set of mixtures allocated to it. A second method, lexical-tree division, parallelizes the search management component of a speech recognizer by dividing the lexical tree among the cores. A third and alternative method for parallelizing the search-management component of a speech recognizer, called lexical-tree copies decomposition, dynamically distributes the active lexical-tree copies among the cores. All parallelization methods were tested on two and four cores of an Intel Core2 Quad processor and significantly improved real-time performance. Several challenges for parallelizing a lexical-tree based Viterbi speech recognizer are also identified and discussed. fast gaussian calculations fast likelihood computations prefix tree lexical tree parallel speech decoding parallel speech recognition multi-core processors
9	New Techniques for Building Timing-Predictable Embedded Systems Guan, Nan January 2013 (has links) Embedded systems are becoming ubiquitous in our daily life. Due to close interaction with physical world, embedded systems are typically subject to timing constraints. At design time, it must be ensured that the run-time behaviors of such systems satisfy the pre-specified timing constraints under any circumstance. In this thesis, we develop techniques to address the timing analysis problems brought by the increasing complexity of underlying hardware and software on different levels of abstraction in embedded systems design. On the program level, we develop quantitative analysis techniques to predict the cache hit/miss behaviors for tight WCET estimation, and study two commonly used replacement policies, MRU and FIFO, which cannot be analyzed adequately using the state-of-the-art qualitative cache analysis method. Our quantitative approach greatly improves the precision of WCET estimation and discloses interesting predictability properties of these replacement policies, which are concealed in the qualitative analysis framework. On the component level, we address the challenges raised by multi-core computing. Several fundamental problems in multiprocessor scheduling are investigated. In global scheduling, we propose an analysis method to rule out a great part of impossible system behaviors for better analysis precision, and establish conditions to guarantee the bounded responsiveness of computing tasks. In partitioned scheduling, we close a long standing open problem to generalize the famous Liu and Layland's utilization bound in uniprocessor real-time scheduling to multiprocessor systems. We also propose to use cache partitioning for multi-core systems to avoid contentions on shared caches, and solve the underlying schedulability analysis problem. On the system level, we present techniques to improve the Real-Time Calculus (RTC) analysis framework in both efficiency and precision. First, we have developed Finitary Real-Time Calculus to solve the scalability problem of the original RTC due to period explosion. The key idea is to only maintain and operate on a limited prefix of each curve that is relevant to the final results during the whole analysis procedure. We further improve the analysis precision of EDF components in RTC, by precisely bounding the response time of each computation request. Real-time systems WCET analysis cache analysis abstract interpretation multiprocessor scheduling fixed-priority scheduling EDF multi-core processors response time analysis utilization bound real-time calculus scalability
10	Large Scale Graph Processing in a Distributed Environment Upadhyay, Nitesh January 2017 (has links) (PDF) Graph algorithms are ubiquitously used across domains. They exhibit parallelism, which can be exploited on parallel architectures, such as multi-core processors and accelerators. However, real world graphs are massive in size and cannot fit into the memory of a single machine. Such large graphs are partitioned and processed in a distributed cluster environment which consists of multiple GPUs and CPUs. Existing frameworks that facilitate large scale graph processing in the distributed cluster have their own style of programming and require extensive involvement by the user in communication and synchronization aspects. Adaptation of these frameworks appears to be an overhead for a programmer. Furthermore, these frameworks have been developed to target only CPU clusters and lack the ability to harness the GPU architecture. We provide a back-end framework to the graph Domain Specific Language, Falcon, for large scale graph processing on CPU and GPU clusters. The Motivation behind choosing this DSL as a front-end is its shared-memory based imperative programmability feature. Our framework generates Giraph code for CPU clusters. Giraph code runs on the Hadoop cluster and is known for scalable and fault-tolerant graph processing. For GPU cluster, Our framework applies a set of optimizations to reduce computation and communication latency, and generates efficient CUDA code coupled with MPI. Experimental evaluations show the scalability and performance of our framework for both CPU and GPU clusters. The performance of the framework generated code is comparable to the manual implementations of various algorithms in distributed environments. Distributed Environment Multi-core Processors Artificial Intelligence Computer Network Scale Graph Graph Algorithms Bulk Synchronous Parallel (BSP) Model Large Scale Graph CPU Cluster Giraph Computer Science

Search results