Spelling suggestions: "subject:"shared resource"" "subject:"hhared resource""
1 |
Current heterogeneous CPU-GPU architectures integrate general purpose CPUs and highly thread-level parallelized GPUs (Graphic Processing Units) in the same die. This dissertation focuses on improving the energy efficiency and performance for the heterogeneous CPU-GPU system.
Leakage energy has become an increasingly large fraction of total energy consumption, making it important to reduce leakage energy for improving the overall energy efficiency. Cache occupies a large on-chip area, which are good targets for leakage energy reduction. For the CPU cache, we study how to reduce the cache leakage energy efficiently in a hybrid SPM (Scratch-Pad Memory) and cache architecture. For the GPU cache, the access pattern of GPU cache is different from the CPU, which usually has little locality and high miss rate. In addition, GPU can hide memory latency more effectively due to multi-threading. Because of the above reasons, we find it is possible to place the cache lines of the GPU data caches into the low power mode more aggressively than traditional leakage management for CPU caches, which can reduce more leakage energy without significant performance degradation.
The contention in shared resources between CPU and GPU, such as the last level cache (LLC), interconnection network and DRAM, may degrade both CPU and GPU performance. We propose a simple yet effective method based on probability to control the LLC replacement policy for reducing the CPU’s inter-core conflict misses caused by GPU without significantly impacting GPU performance. In addition, we develop two strategies to combine the probability based method for the LLC and an existing technique called virtual channel partition (VCP) for the interconnection network to further improve the CPU performance.
For a specific graph application of Breadth first search (BFS), which is a basis for graph search and a core building block for many higher-level graph analysis applications, it is a typical example of parallel computation that is inefficient on GPU architectures. In a graph, a small portion of nodes may have a large number of neighbors, which leads to irregular tasks on GPUs. These irregularities limit the parallelism of BFS executing on GPUs. Unlike the previous works focusing on fine-grained task management to address the irregularity, we propose Virtual-BFS (VBFS) to virtually change the graph itself. By adding virtual vertices, the high-degree nodes in the graph are divided into groups that have an equal number of neighbors, which increases the parallelism such that more GPU threads can work concurrently. This approach ensures correctness and can significantly improve both the performance and energy efficiency on GPUs.
2 |
Resource Optimized Scheduling For Enhanced Power Efficiency And Throughput On Chip Multi Processor PlatformsKundan, Shivam 01 May 2024 (has links) (PDF)
The parallel nature of process execution on Chip Multi-Processors (CMPs) has boosted levels of application performance far beyond the capabilities of erstwhile single-core designs. Generally, CMPs offer improved performance by integrating multiple simpler cores onto a single die that share certain computing resources among them such as last-level caches, data buses, and main memory. This ensures architectural simplicity while also boosting performance for multi-threaded applications. However, a major trade-off associated with this approach is that concurrently executing applications incur performance degradation if their collective resource requirements exceed the total amount of resources available to the system. If dynamic resource allocation is not carefully considered, the potential performance gain from having multiple cores may be outweighed by the losses due to contention for allocation of shared resources. Additionally, CMPs with inbuilt dynamic voltage-frequency scaling (DVFS) mechanisms may try to compensate for the performance bottleneck by scaling to higher clock frequencies. For performance degradation due to shared-resource contention, this does not necessarily improve performance but does ensure a significant penalty on power consumption due to the quadratic relation of electrical power and voltage (P_dynamic ∝ V^2 * f).This dissertation presents novel methodologies for balancing the competing requirements of high performance, fairness of execution, and enforcement of priority, while also ensuring overall power efficiency of CMPs. Specifically, we (1) Analyze the problem of resource interference during concurrent process execution and propose two fine-grained scheduling methodologies for improving overall performance and fairness, (2) Develop an approach for enforcement of priority (i.e., minimum performance) for specific processes while avoiding resource starvation for others, and (3) Present a machine-learning approach for maximizing the power efficiency (performance-per-Watt) of CMPs through estimation of a workload's performance and power consumption limits at different clock frequencies.As modern computing workloads become increasingly dynamic, and computers themselves become increasingly ubiquitous, the problem of finding the ideal balance between performance and power consumption of CMPs is of particular relevance today, especially given the unprecedented proliferation of embedded devices for use in Internet-of-Things, edge computing, smart wearables, and even exotic experiments such as space probes comprised entirely of a CMP, sensors, and an antenna ("space chips"). Additionally, reducing power consumption while maintaining constant performance can contribute to addressing the growing problem of dark silicon.
3 |
Analyse temporelle des systèmes temps-réels sur architectures pluri-coeurs / Many-Core Timing Analysis of Real-Time SystemsRihani, Hamza 01 December 2017 (has links)
La prédictibilité est un aspect important des systèmes temps-réel critiques. Garantir la fonctionnalité de ces systèmespasse par la prise en compte des contraintes temporelles. Les architectures mono-cœurs traditionnelles ne sont plussuffisantes pour répondre aux besoins croissants en performance de ces systèmes. De nouvelles architectures multi-cœurssont conçues pour offrir plus de performance mais introduisent d'autres défis. Dans cette thèse, nous nous intéressonsau problème d’accès aux ressources partagées dans un environnement multi-cœur.La première partie de ce travail propose une approche qui considère la modélisation de programme avec des formules desatisfiabilité modulo des théories (SMT). On utilise un solveur SMT pour trouverun chemin d’exécution qui maximise le temps d’exécution. On considère comme ressource partagée un bus utilisant unepolitique d’accès multiple à répartition dans le temps (TDMA). On explique comment la sémantique du programme analyséet le bus partagé peuvent être modélisés en SMT. Les résultats expérimentaux montrent une meilleure précision encomparaison à des approches simples et pessimistes.Dans la deuxième partie, nous proposons une analyse de temps de réponse de programmes à flot de données synchroness'exécutant sur un processeur pluri-cœur. Notre approche calcule l'ensemble des dates de début d'exécution et des tempsde réponse en respectant la contrainte de dépendance entre les tâches. Ce travail est appliqué au processeur pluri-cœurindustriel Kalray MPPA-256. Nous proposons un modèle mathématique de l'arbitre de bus implémenté sur le processeur. Deplus, l'analyse de l'interférence sur le bus est raffinée en prenant en compte : (i) les temps de réponseet les dates de début des tâches concurrentes, (ii) le modèle d'exécution, (iii) les bancsmémoires, (iv) le pipeline des accès à la mémoire. L'évaluation expérimentale est réalisé sur desexemples générés aléatoirement et sur un cas d'étude d'un contrôleur de vol. / Predictability is of paramount importance in real-time and safety-critical systems, where non-functional properties --such as the timing behavior -- have high impact on the system's correctness. As many safety-critical systems have agrowing performance demand, classical architectures, such as single-cores, are not sufficient anymore. One increasinglypopular solution is the use of multi-core systems, even in the real-time domain. Recent many-core architectures, such asthe Kalray MPPA, were designed to take advantage of the performance benefits of a multi-core architecture whileoffering certain predictability. It is still hard, however, to predict the execution time due to interferences on sharedresources (e.g., bus, memory, etc.).To tackle this challenge, Time Division Multiple Access (TDMA) buses are often advocated. In the first part of thisthesis, we are interested in the timing analysis of accesses to shared resources in such environments. Our approach usesSatisfiability Modulo Theory (SMT) to encode the semantics and the execution time of the analyzed program. To estimatethe delays of shared resource accesses, we propose an SMT model of a shared TDMA bus. An SMT-solver is used to find asolution that corresponds to the execution path with the maximal execution time. Using examples, we show how theworst-case execution time estimation is enhanced by combining the semantics and the shared bus analysis in SMT.In the second part, we introduce a response time analysis technique for Synchronous Data Flow programs. These are mappedto multiple parallel dependent tasks running on a compute cluster of the Kalray MPPA-256 many-core processor. Theanalysis we devise computes a set of response times and release dates that respect the constraints in the taskdependency graph. We derive a mathematical model of the multi-level bus arbitration policy used by the MPPA. Further,we refine the analysis to account for (i) release dates and response times of co-runners, (ii)task execution models, (iii) use of memory banks, (iv) memory accesses pipelining. Furtherimprovements to the precision of the analysis were achieved by considering only accesses that block the emitting core inthe interference analysis. Our experimental evaluation focuses on randomly generated benchmarks and an avionics casestudy.
4 |
Evaluation analytique du temps de réponse des systèmes de commande en réseau en utilisant l’algèbre (max,+) / Networked automation systems response time evaluation using (Max,+) algebraAddad, Boussad 01 July 2011 (has links)
Les systèmes de commande en réseau (SCR) sont de plus en plus répandus dans le milieu industriel. Ils procurent en effet de nombreux avantages en termes de coût, de flexibilité, de maintenance, etc. Cependant,l’introduction d’un réseau, qui par nature est composé de ressources partagées, impacte considérablement les performances temporelles des systèmes de commande. Un signal de commande par exemple n’arrive à destination qu’après un certain délai. Pour s’assurer que ce délai soit inférieur à un certain seuil de sécurité ou du respect d’autres contraintes temps réels de ces systèmes, une évaluation au préalable, avant la mise en service d’un SCR, s’avère donc nécessaire. Dans nos travaux de recherche, nous nous intéressons à la réactivité des SCR client/serveur et évaluons leur temps de réponse.Notre contribution dans ces travaux est d’adopter une approche analytique à base de l’algèbre (Max,+) et remédier aux problèmes des méthodes existantes comme l’explosion combinatoire de la vérification formelle ou de la non exhaustivité des approches par simulation. Après modélisation des SCR client/serveur à l’aide de Graphe d’Evénements Temporisés puis représentation de leurs dynamiques à l’aides d’équations (Max,+) linéaires, nous obtenons des formules de calcul direct du temps de réponse. Plus précisément, nous adoptons une analyse déterministe pour calculer les bornes, minimale et maximale, du temps de réponse puis une analyse stochastique pour calculer la fonction de sa distribution. De plus, nous prenons en compte dans nos travaux tous les délais élémentaires qui composent le temps de réponse, y compris les délais de bout-en-bout, dus à la traversée du seul réseau de communication. Ce dernier étant naturellement composé de ressources partagées, rendant l’utilisation des modèles (Max,+) classiques impossibles, nous introduisons une nouvelle approche de modélisation à base du formalisme (Max,+) mais prenant en compte le concept de conflit ou ressource partagée.L’exemple d’un réseau de type Ethernet est considéré pour évaluer ces délais de bout-en-bout. Par ailleurs, cette nouvelle méthode (Max,+) est assez générique et reste applicable à de nombreux systèmes impliquant des ressources partagées, au delà des seuls réseaux de communication. Enfin, pour vérifier la validité des résultats obtenus dans nos travaux, notamment la formule de la borne maximale du temps de réponse, une compagne de mesures expérimentales sont menées sur une plateforme dédiée. Différentes configurations et conditions de trafic dans un réseau Ethernet sont considérées. / Networked automation systems (NAS) are more and more used in industry, given the several advantages they provide like flexibility, low cost, ease of maintenance, etc. However, the use of a communication network in SCR means in essence sharing some resources and therefore strikingly impacts their time performances. For instance, a control signal does get to its destination (actuator) only after a non zero delay. So, to guarantee that such a delay is shorter than a given threshold or other time constraints well respected, an a priori evaluation is necessary before operating the SCR. In our research activities, we are interested in client/server SCR reactivity and the evaluation of their response time.Our contribution in this investigation is the introduction of a (Max,+) Algebra-based analytic approach to solve some problems, faced in the existing methods like state explosion of model checking or the non exhaustivity of simulation. So, after getting Timed Event Graphs based models of the SCR and their linear state (Max,+) representation, we obtain formulae that enables to calculate straightforwardly the SCR response times. More precisely, we obtain formulae of the bounds of response time by adopting a deterministic analysis and other formulae to calculate the probability density of response time by considering a stochastic analysis. Moreover, in our investigation we take into account every single elementary delay involved in the response time, including the end-to-end delays, due exclusively to crossing the communication network. This latter being however constituted of shared resources, making by the way the use of TEG and (Max,+) Algebra impossible, we introduce a novel approach to model the communication network. This approach brings to life a new class of Petri nets, called Conflicting Timed Event Graphs (CTEG), which enables us to solve the problem of the shared resources. We also manage to represent the CTEG dynamics using recurrent (Max,+) equations and therefore calculate the end to-end delays. An Ethernet-based network is studied as an example to apply this novel approach. Note by the way that the field of application of this approach borders largely communication networks and is quite possible when dealing with other systems.Finally, to validate the different results of our research activities and the related hypotheses, especially the maximal bound of response time formula, we carry out lots of experimental measurements on a lab facility. We compare the measures to the formula predictions and check their agreement under different conditions.
Page generated in 0.0683 seconds