Spelling suggestions: "subject:"gem5"" "subject:"em5""
1 |
Performance evaluation of Cache-Based Multi-Core Architectures with Networks-on-ChipRajkumar, Robin Kingsley 01 December 2012 (has links)
Multi-core architectures are the future for high-performance computing and are omnipresent these days; what was a vision some twenty years back is now a reality with most personal computers/laptops now running on multi-cores making them ubiquitous in today's world. However, as the number of cores continue scaling with time, there will be serious throughput and performance issues with relation to the network topologies used in connecting the cores. Among possible network topologies under consideration in modern multi-core systems, the `Mesh' topology is widely used. In terms of performance, the `Point to Point topology' would outperform all other topologies such as Crossbar, Mesh and Torus. The `Point to Point' topology does include additional expenses with respect to more links needed to connect each core to every other core in the network. Its expensive implementation cost is the reason it is not preferred in the industry for general use systems. But, for research purposes it serves as the best network topology alternative to the `Mesh' for higher speed in computer systems. However, the characteristics of the tasks executing on the cores will also have a significant impact on topology performance. So, with the scaling of multi-cores from 10 to 1000 cores per chip and more, selection of the right network topology is of importance. Another interesting factor to consider is the effect of the cache on these multi-core systems with respect to each of these topologies. Cache coherency is and will be a major cause for throughput decrease as cores scale. In our work, we are using the Modified-Exclusive-Shared-Invalid (MESI) Cache Coherency protocol for all the above mentioned network topologies considered. In this thesis, we investigate the effect of varying cache parameters such as the sizes of L1 Instruction cache, L1 Data cache and L2 cache and their respective associativities on each network topology. Various combinations of all these four parameters were considered as we ran experiments. We use the gem5 Computer Architecture Simulator for running our experiments with 4 core models. For benchmark purposes, we use the SPLASH-2 set of {\it `High Performance Computing'} benchmarks. A benchmark is assigned to each core. We also observe the effects of running benchmarks with similar characteristics on all cores versus comparing them with a set of different benchmarks while keeping all other parameters constant. Through our results, we attempt to give researchers and the industry at large a better view of the advantages and disadvantages along with the relationship between multi-cores, the cache and network topologies for multi-core systems.
|
2 |
Modifying Instruction Sets In The Gem5 Simulator To Support Fault Tolerant DesignsZhang, Chuan 23 November 2015 (has links)
Traditional fault tolerant techniques such as hardware or time redundancy incur high overhead and are inefficient for checking arithmetic operations. Our objective is to study an alternative approach of adding new instructions to check arithmetic operations. These checking instructions either rely on error detecting code or calculate approximate results and consequently, consume much less execution time. To evaluate the effectiveness of such an approach we wish to modify several benchmarks to use checking instructions and run simulation experiments to find out their execution time and memory usage. However, the checking instructions are not included in the instruction set and as a result, are not supported by current architecture simulators. Therefore, another objective of this thesis is to develop a method for inserting new instructions in the Gem5 simulator and cross compiler. The insertion process is integrated into a software tool called Gtool. Gtool can add an error checking capability to C programs by using the new instructions.
|
3 |
Dataset for Machine Learning Based Cache Timing Attacks and MitigationKalidasan, Vishnu Kumar 05 June 2024 (has links)
Cache side-channel attacks have evolved alongside increasingly complex microprocessor architectural designs. The attacks and their prevention mechanisms, such as cache partitioning, OS kernel isolation, and various hardware/operating system enhancements, have similarly progressed. Nonetheless, side-channel attacks necessitate effective and efficient prevention mechanisms or alterations to hardware architecture. Recently, machine learning (ML) is an emerging method for detecting and defending such attacks. However, The effectiveness of machine learning relies on the dataset it is trained on. The datasets for training these ML models today are not vast enough to enhance the robustness and consistency of the model performance. This thesis aims to enhance the ML method for exploring various cache side-channel attacks and defenses by offering a more reasonable and potentially realistic dataset to distinguish between the attacker and the victim process. The dataset is gathered through a computer system simulation model, which is subsequently utilized to train both the attacker and detector agents of the model. Different ways to collect datasets using the system simulation are explored. A New Dataset for training and detecting cache side-channel attacks is also explored and methodized. Lastly, the effectiveness of the dataset is studied by training a Flush+Reload attacker and detector model performance. / Master of Science / Imagine a spy trying to steal secret information from a computer by listening to its clicks and whirs. That's kind of what a side-channel attack is. The computer uses a special memory called a cache to speed things up, but attackers can spy on this cache to learn bits and pieces of what the computer is working on. Numerous ways to mitigate such attacks have been proposed, but they were either costly to implement in terms of resources or the performance offset of the computer is large. New types of attacks are also being researched and discovered. More recently, Machine learning (ML) models are used for detecting or defending cache side-channel attacks.
Currently the training ground truth or the input dataset for the ML models is not vast enough to enhance the robustness and consistency of the model performance. This thesis project aims to enhance the ML approach for exploring and detecting existing and unknown Cache side-channel attacks by offering a more reasonable and potentially realistic training ground (dataset). The dataset is gathered through a computer system simulation model, which is subsequently utilized to train the ML models. Different ways to collect datasets using the computer system simulation are explored. A New Dataset for training and detecting Cache side-channel attacks is also explored and methodised. Lastly, the effectiveness of the dataset is studied by training a Flush+Reload attacker performance.
|
4 |
Understanding Multicore Performance : Efficient Memory System Modeling and SimulationSandberg, Andreas January 2014 (has links)
To increase performance, modern processors employ complex techniques such as out-of-order pipelines and deep cache hierarchies. While the increasing complexity has paid off in performance, it has become harder to accurately predict the effects of hardware/software optimizations in such systems. Traditional microarchitectural simulators typically execute code 10 000×–100 000× slower than native execution, which leads to three problems: First, high simulation overhead makes it hard to use microarchitectural simulators for tasks such as software optimizations where rapid turn-around is required. Second, when multiple cores share the memory system, the resulting performance is sensitive to how memory accesses from the different cores interleave. This requires that applications are simulated multiple times with different interleaving to estimate their performance distribution, which is rarely feasible with today's simulators. Third, the high overhead limits the size of the applications that can be studied. This is usually solved by only simulating a relatively small number of instructions near the start of an application, with the risk of reporting unrepresentative results. In this thesis we demonstrate three strategies to accurately model multicore processors without the overhead of traditional simulation. First, we show how microarchitecture-independent memory access profiles can be used to drive automatic cache optimizations and to qualitatively classify an application's last-level cache behavior. Second, we demonstrate how high-level performance profiles, that can be measured on existing hardware, can be used to model the behavior of a shared cache. Unlike previous models, we predict the effective amount of cache available to each application and the resulting performance distribution due to different interleaving without requiring a processor model. Third, in order to model future systems, we build an efficient sampling simulator. By using native execution to fast-forward between samples, we reach new samples much faster than a single sample can be simulated. This enables us to simulate multiple samples in parallel, resulting in almost linear scalability and a maximum simulation rate close to native execution. / CoDeR-MP / UPMARC
|
5 |
Speculative Interference: A Modern Spectre Attack / Spekulativ Interferens: En Modern Spectre-attackBorg, Isak January 2021 (has links)
Since the Spectre family of attacks were made public knowledge in January of 2018, researchers, manufacturers and interested individuals have experimented a lot with creating defences against it. But there have also been a lot of research aimed at circumventing these defences and finding alternative side-channels and mechanisms for performing Spectre-type attacks. This thesis implements and demonstrates a proof of concept of one of these newfound attacks known as a Speculative interference attack. This is done in a simulated environment, which to our knowledge has not been done before at the time of writing this report. After the 'basic' version of a Spectre attack has been explained, the thesis will explain how the more advanced interference attack works and how it is implemented in the simulated environment. In the end the results gained with the attack will be presented, which should convince the reader of the relevance and possibilities of the attack. / Efter att säkerhetsattackerna kända som Spectre offentliggjordes i Januari 2018 har bådeforskare, utvecklare och intresserade individer experimenterat med att ta fram försvar mot dem. Det har också spenderats mycket resurser och tid på att finna sätt att kringgå dessa försvar och att hitta alternativa sido-kanaler och mekanismer som kan utnyttjas för att genomföra en Spectre-attack. Den här uppsatsen demonstrerar en fungerande implementation av en av dessa nyfunna attacker, känd som en ’Speculative interference attack’. Detta görs i en simulerad miljö, vilken enligt vår kännedom inte tidigare har gjorts vid genomförandet av detta arbete. Efter att en mer grundläggande version av en Spectre-attack har förklarats kommer uppsatsen att gå igenom hur den mer avancerade ’interference’ attacken fungerar och hur den är implementerad. I slutändan kommer de resultat attacken tagit fram att redogöras, vilket bör övertyga läsaren om attackens relevans och möjligheter.
|
6 |
Evaluating Gem5 and QEMU Virtual Platforms for ARM Multicore ArchitecturesFuentes Morales, Jose Luis Bismarck January 2016 (has links)
Accurate virtual platforms allow for crucial, early, and inexpensive assessments about the viability and hardware constraints of software/hardware applications. The growth of multicore architectures in both number of cores and relevance in the industry, in turn, demands the emergence of faster and more efficient virtual platforms to make the benefits of single core simulation and emulation available to their multicore successors whilst maintaining accuracy, development costs, time, and efficiency at acceptable levels. The goal of this thesis is to find optimal virtual platforms to perform hardware design space exploration for multi-core architectures running filtering functions, particularly, a discrete signal filtering Matlab algorithm used for oil surveying applications running on an ARM Cortex-A53 quadcore CPU. In addition to the filtering algorithm, the PARSEC benchmark suite was also used to test platform compliance under workloads with diverse characteristics. Upon reviewing multiple virtual platforms, the gem5 simulator and the QEMU emulator were chosen to be tested due to their ubiquitousness, prominence and flexibility. A Raspberry Pi Model B was used as reference to measure how closely these tools can model a commonly used embedded platform. The results show that each of the virtual platforms is best suited for different scenarios. The QEMU emulator with KVM support yielded the best performance, albeit requiring access to a host with the same architecture as the target, and not guaranteeing timing accuracy. The most accurate setup was the gem5 simulator using a simplified cache system and an Out-of-Order detailed ARM CPU model.
|
7 |
Génération dynamique de code pour l'optimisation énergétique / Online Auto-Tuning for Performance and Energy through Micro-Architecture Dependent Code GenerationEndo, Fernando Akira 18 September 2015 (has links)
Dans les systèmes informatiques, la consommation énergétique est devenue le facteur le plus limitant de la croissance de performance observée pendant les décennies précédentes. Conséquemment, les paradigmes d'architectures d'ordinateur et de développement logiciel doivent changer si nous voulons éviter une stagnation de la performance durant les décennies à venir.Dans ce nouveau scénario, des nouveaux designs architecturaux et micro-architecturaux peuvent offrir des possibilités d'améliorer l'efficacité énergétique des ordinateurs, grâce à la spécialisation matérielle, comme par exemple les configurations de cœurs hétérogènes, des nouvelles unités de calcul et des accélérateurs. D'autre part, avec cette nouvelle tendance, le développement logiciel devra faire face au manque de portabilité de la performance entre les matériels toujours en évolution et à l'écart croissant entre la performance exploitée par les programmeurs et la performance maximale exploitable du matériel. Pour traiter ce problème, la contribution de cette thèse est une méthodologie et la preuve de concept d'un cadriciel d'auto-tuning à la volée pour les systèmes embarqués. Le cadriciel proposé peut à la fois adapter du code à une micro-architecture inconnue avant la compilation et explorer des possibilités d'auto-tuning qui dépendent des données d'entrée d'un programme.Dans le but d'étudier la capacité de l'approche proposée à adapter du code à des différentes configurations micro-architecturales, j'ai développé un cadriciel de simulation de processeurs hétérogènes ARM avec exécution dans l'ordre ou dans le désordre, basé sur les simulateurs gem5 et McPAT. Les expérimentations de validation ont démontré en moyenne des erreurs absolues temporels autour de 7 % comparé aux ARM Cortex-A8 et A9, et une estimation relative d'énergie et de performance à 6 % près pour le benchmark Dhrystone 2.1 comparée à des CPUs Cortex-A7 et A15 (big.LITTLE). Les résultats de validation temporelle montrent que gem5 est beaucoup plus précis que les simulateurs similaires existants, dont les erreurs moyennes sont supérieures à 15 %.Un composant important du cadriciel d'auto-tuning à la volée proposé est un outil de génération dynamique de code, appelé deGoal. Il définit un langage dédié dynamique et bas-niveau pour les noyaux de calcul. Pendant cette thèse, j'ai porté deGoal au jeu d'instructions ARM Thumb-2 et créé des nouvelles fonctionnalités pour l'auto-tuning à la volée. Une validation préliminaire dans des processeurs ARM ont montré que deGoal peut en moyenne générer du code machine avec une qualité équivalente ou supérieure comparé aux programmes de référence écrits en C, et même par rapport à du code vectorisé à la main.La méthodologie et la preuve de concept de l'auto-tuning à la volée dans des processeurs embarqués ont été développées autour de deux applications basées sur noyau de calcul, extraits de la suite de benchmark PARSEC 3.0 et de sa version vectorisée à la main PARVEC.Dans l'application favorable, des accélérations de 1.26 et de 1.38 ont été observées sur des cœurs réels et simulés, respectivement, jusqu'à 1.79 et 2.53 (toutes les surcharges dynamiques incluses).J'ai aussi montré par la simulation que l'auto-tuning à la volée d'instructions SIMD aux cœurs d'exécution dans l'ordre peut surpasser le code de référence vectorisé exécuté par des cœurs d'exécution dans le désordre similaires, avec une accélération moyenne de 1.03 et une amélioration de l'efficacité énergétique de 39 %.L'application défavorable a été choisie pour montrer que l'approche proposée a une surcharge négligeable lorsque des versions de noyau plus performantes ne peuvent pas être trouvées.En faisant tourner les deux applications sur les processeurs réels, la performance de l'auto-tuning à la volée est en moyenne seulement 6 % en dessous de la performance obtenue par la meilleure implémentation de noyau trouvée statiquement. / In computing systems, energy consumption is limiting the performance growth experienced in the last decades. Consequently, computer architecture and software development paradigms will have to change if we want to avoid a performance stagnation in the next decades.In this new scenario, new architectural and micro-architectural designs can offer the possibility to increase the energy efficiency of hardware, thanks to hardware specialization, such as heterogeneous configurations of cores, new computing units and accelerators. On the other hand, with this new trend, software development should cope with the lack of performance portability to ever changing hardware and with the increasing gap between the performance that programmers can extract and the maximum achievable performance of the hardware. To address this issue, this thesis contributes by proposing a methodology and proof of concept of a run-time auto-tuning framework for embedded systems. The proposed framework can both adapt code to a micro-architecture unknown prior compilation and explore auto-tuning possibilities that are input-dependent.In order to study the capability of the proposed approach to adapt code to different micro-architectural configurations, I developed a simulation framework of heterogeneous in-order and out-of-order ARM cores. Validation experiments demonstrated average absolute timing errors around 7 % when compared to real ARM Cortex-A8 and A9, and relative energy/performance estimations within 6 % for the Dhrystone 2.1 benchmark when compared to Cortex-A7 and A15 (big.LITTLE) CPUs.An important component of the run-time auto-tuning framework is a run-time code generation tool, called deGoal. It defines a low-level dynamic DSL for computing kernels. During this thesis, I ported deGoal to the ARM Thumb-2 ISA and added new features for run-time auto-tuning. A preliminary validation in ARM processors showed that deGoal can in average generate equivalent or higher quality machine code compared to programs written in C, including manually vectorized codes.The methodology and proof of concept of run-time auto-tuning in embedded processors were developed around two kernel-based applications, extracted from the PARSEC 3.0 suite and its hand vectorized version PARVEC. In the favorable application, average speedups of 1.26 and 1.38 were obtained in real and simulated cores, respectively, going up to 1.79 and 2.53 (all run-time overheads included). I also demonstrated through simulations that run-time auto-tuning of SIMD instructions to in-order cores can outperform the reference vectorized code run in similar out-of-order cores, with an average speedup of 1.03 and energy efficiency improvement of 39 %. The unfavorable application was chosen to show that the proposed approach has negligible overheads when better kernel versions can not be found. When both applications run in real hardware, the run-time auto-tuning performance is in average only 6 % way from the performance obtained by the best statically found kernel implementations.
|
Page generated in 0.0532 seconds