71 |
Avaliação do compartilhamento das memórias cache no desempenho de arquiteturas multi-core / Performance evaluation of shared cache memory for multi-core architecturesAlves, Marco Antonio Zanata January 2009 (has links)
No atual contexto de inovações em multi-core, em que as novas tecnologias de integração estão fornecendo um número crescente de transistores por chip, o estudo de técnicas de aumento de vazão de dados é de suma importância para os atuais e futuros processadores multi-core e many-core. Com a contínua demanda por desempenho computacional, as memórias cache vêm sendo largamente adotadas nos diversos tipos de projetos arquiteturais de computadores. Os atuais processadores disponíveis no mercado apontam na direção do uso de memórias cache L2 compartilhadas. No entanto, ainda não está claro quais os ganhos e custos inerentes desses modelos de compartilhamento da memória cache. Assim, nota-se a importância de estudos que abordem os diversos aspectos do compartilhamento de memória cache em processadores com múltiplos núcleos. Portanto, essa dissertação visa avaliar diferentes compartilhamentos de memória cache, modelando e aplicando cargas de trabalho sobre as diferentes organizações, a fim de obter resultados significativos sobre o desempenho e a influência do compartilhamento da memória cache em processadores multi-core. Para isso, foram avaliados diversos compartilhamentos de memória cache, utilizando técnicas tradicionais de aumento de desempenho, como aumento da associatividade, maior tamanho de linha, maior tamanho de memória cache e também aumento no número de níveis de memória cache, investigando a correlação entre essas arquiteturas de memória cache e os diversos tipos de aplicações da carga de trabalho. Os resultados mostram a importância da integração entre os projetos de arquitetura de memória cache e o projeto físico da memória, a fim de obter o melhor equilíbrio entre tempo de acesso à memória cache e redução de faltas de dados. Nota-se nos resultados, dentro do espaço de projeto avaliado, que devido às limitações físicas e de desempenho, as organizações 1Core/L2 e 2Cores/L2, com tamanho total igual a 32 MB (bancos de 2 MB compartilhados), tamanho de linha igual a 128 bytes, representam uma boa escolha de implementação física em sistemas de propósito geral, obtendo um bom desempenho em todas aplicações avaliadas sem grandes sobrecustos de ocupação de área e consumo de energia. Além disso, como conclusão desta dissertação, mostra-se que, para as atuais e futuras tecnologias de integração, as tradicionais técnicas de ganho de desempenho obtidas com modificações na memória cache, como aumento do tamanho das memórias, incremento da associatividade, maiores tamanhos da linha, etc. não devem apresentar ganhos reais de desempenho caso o acréscimo de latência gerado por essas técnicas não seja reduzido, a fim de equilibrar entre a redução na taxa de faltas de dados e o tempo de acesso aos dados. / In the current context of innovations in multi-core processors, where the new integration technologies are providing an increasing number of transistors inside chip, the study of techniques for increasing data throughput has great importance for the current and future multi-core and many-core processors. With the continuous demand for performance, the cache memories have been widely adopted in various types of architectural designs of computers. Nowadays, processors on the market point out for the use of shared L2 cache memory. However, it is not clear the gains and costs of these shared cache memory models. Thus, studies that address different aspects of shared cache memory have great importance in context of multi-core processors. Therefore, this dissertation aims to evaluate different shared cache memory, modeling and applying workloads on different organizations in order to obtain significant results from the performance and the influence of the shared cache memory multi-core processors. Thus, several types of shared cache memory were evaluated using traditional techniques to increase performance, such as increasing the associativity, larger line size, larger cache memory and also the increase on the cache memory hierarchy, investigating the correlation between the cache memory architecture and the workload applications. The results show the importance of integration between cache memory architecture project and memory physical design in order to obtain the best trade-off between cache memory access time and cache misses. According to the results, within evaluations, due to physical limitations and performance, organizations 1Core/L2 and 2Cores/L2 with total cache size equal to 32MB, using banks of 2 MB, line size equal to 128 bytes, represent a good choice for physical implementation in general purpose systems, obtaining a good performance in all evaluated applications without major extra costs of area occupation and power consumption. Furthermore, as a conclusion in this dissertation is shown that, for current and future integration technologies, traditional techniques for performance gain obtained with changes in the cache memory such as, increase of the memory size, increasing the associativity, larger line sizes etc.. should not lead to real performance gains if the additional latency generated by these techniques was not treated, in order to balance between the reduction of cache miss rate and the data access time.
|
72 |
Co-projeto de hardware e software de um escalonador de processos para arquiteturas multicore heterogêneas baseadas em computação reconfigurável / Hardware and software co-design of a process scheduler for heterogeneous multicore architectures based on reconfigurable computingMaikon Adiles Fernandez Bueno 05 November 2013 (has links)
As arquiteturas multiprocessadas heterogêneas têm como objetivo principal a extração de maior desempenho da execução dos processos, por meio da utilização de núcleos apropriados às suas demandas. No entanto, a extração de maior desempenho é dependente de um mecanismo eficiente de escalonamento, capaz de identificar as demandas dos processos em tempo real e, a partir delas, designar o processador mais adequado, de acordo com seus recursos. Este trabalho tem como objetivo propor e implementar o modelo de um escalonador para arquiteturas multiprocessadas heterogêneas, baseado em software e hardware, aplicado ao sistema operacional Linux e ao processador SPARC Leon3, como prova de conceito. Nesse sentido, foram implementados monitores de desempenho dentro dos processadores, os quais identificam as demandas dos processos em tempo real. Para cada processo, sua demanda é projetada para os demais processadores da arquitetura e em seguida é realizado um balanceamento visando maximizar o desempenho total do sistema, distribuindo os processos entre processadores, de modo a diminuir o tempo total de processamento de todos os processos. O algoritmo de maximização Hungarian, utilizado no balanceamento do escalonador, foi desenvolvido em hardware, proporcionando paralelismo e maior desempenho na execução do algoritmo. O escalonador foi validado por meio da execução paralela de diversos benchmarks, resultando na diminuição dos tempos de execução em relação ao escalonador sem suporte à heterogeneidade / Heterogeneous multiprocessor architectures have as main objective the extraction of higher performance from processes through the use of appropriate cores to their demands. However, the extraction of higher performance is dependent on an efficient scheduling mechanism, able to identify in real-time the demands of processes and to designate the most appropriate processor according to their resources. This work aims at design and implementations of a model of a scheduler for heterogeneous multiprocessor architectures based on software and hardware, applied to the Linux operating system and the SPARC Leon3 processor as proof of concept. In this sense, performance monitors have been implemented within the processors, which in real-time identifies the demands of processes. For each process, its demand is projected for the other processors in the architecture and then it is performed a balancing to maximize the total system performance by distributing processes among processors. The Hungarian maximization algorithm, used in balancing scheduler was developed in hardware, providing greater parallelism and performance in the execution of the algorithm. The scheduler has been validated through the parallel execution of several benchmarks, resulting in decreased execution times compared to the scheduler without the heterogeneity support
|
73 |
Modelica PARallel benchmark suite (MPAR) - a test suite for evaluating the performance of parallel simulations of Modelica modelsHemmati Moghadam, Afshin January 2011 (has links)
Using the object-oriented, equation-based modeling language Modelica, it is possible to model and simulate computationally intensive models. To reduce the simulation time, a desirable approach is to perform the simulations on parallel multi-core platforms. For this purpose, several works have been carried out so far, the most recent one includes language enhancements with explicit parallel programing language constructs in the algorithmic parts of the Modelica language. This extension automatically generates parallel simulation code for execution on OpenCL-enabled platforms, and it has been implemented in the open-source OpenModelica environment. However, to ensure that this extension as well as future developments regarding parallel simulations of Modelica models are feasible, performing a systematic benchmarking with respect to a set of appropriate Modelica models is essential, which is the main focus of study in this thesis. In this thesis a benchmark test suite containing computationally intensive Modelica models which are relevant for parallel simulations is presented. The suite is used in this thesis as a means for evaluating the feasibility and performance measurements of the generated OpenCL code when using the new Modelica language extension. In addition, several considerations and suggestions on how the modeler can efficiently parallelize sequential models to achieve better performance on OpenCL-enabled GPUs and multi-coreCPUs are also given. The measurements have been done for both sequential and parallel implementations of the benchmark suite using the generated code from the OpenModelica compiler on different hardware configurations including single and multi-core CPUs as well as GPUs. The gained results in this thesis show that simulating Modelica models using OpenCL as a target language is very feasible. In addition, it is concluded that for models with large data sizes and great level of parallelism, it is possible to achieve considerable speedup on GPUs compared to single and multi-core CPUs.
|
74 |
REAL-TIME CHALLENGES OF VEHICULAR EMBEDDED SYSTEMS ON MULTI-CORE - A MAPPING STUDYIyer, Shankar Vanchesan January 2017 (has links)
The increasing complexity of vehicular embedded systems has encouraged researchers and practitioners to adopt model-driven engineering in the development of these systems. In particular, several modelling languages have been introduced for representing the vehicular software architecture and its quality attributes. Current trend in the automotive domain is to shift from single-core architectures to multi-core ones in the attempt of providing the computational power required from the next generation of vehicles, particularly autonomous ones. On the one hand, multi-core architectures introduce new real-time challenges in the development of these systems like core-interdependency. On the other hand, it is pivotal that modelling languages continue to be effective in representing the new vehicular architectures together with their novel concerns, to continue to benefit from model-based methodologies. In this thesis, we present a systematic mapping study focusing i) on the real-time challenges introduced by the adoption of multi-core architectures and ii) on the extent of the modelling support for the resolution of these challenge, in the automotive domain.
|
75 |
On the Design of Real-Time Systems on Multi-Core Platforms under UncertaintyWANG, TIANYI 26 June 2015 (has links)
Real-time systems are computing systems that demand the assurance of not only the logical correctness of computational results but also the timing of these results. To ensure timing constraints, traditional real-time system designs usually adopt a worst-case based deterministic approach. However, such an approach is becoming out of sync with the continuous evolution of IC technology and increased complexity of real-time applications. As IC technology continues to evolve into the deep sub-micron domain, process variation causes processor performance to vary from die to die, chip to chip, and even core to core. The extensive resource sharing on multi-core platforms also significantly increases the uncertainty when executing real-time tasks. The traditional approach can only lead to extremely pessimistic, and thus, unpractical design of real-time systems.
Our research seeks to address the uncertainty problem when designing real-time systems on multi-core platforms. We first attacked the uncertainty problem caused by process variation. We proposed a virtualization framework and developed techniques to optimize the system's performance under process variation. We further studied the problem on peak temperature minimization for real-time applications on multi-core platforms. Three heuristics were developed to reduce the peak temperature for real-time systems. Next, we sought to address the uncertainty problem in real-time task execution times by developing statistical real-time scheduling techniques. We studied the problem of fixed-priority real-time scheduling of implicit periodic tasks with probabilistic execution times on multi-core platforms. We further extended our research for tasks with explicit deadlines. We introduced the concept of harmonic to a more general task set, i.e. tasks with explicit deadlines, and developed new task partitioning techniques. Throughout our research, we have conducted extensive simulations to study the effectiveness and efficiency of our developed techniques.
The increasing process variation and the ever-increasing scale and complexity of real-time systems both demand a paradigm shift in the design of real-time applications. Effectively dealing with the uncertainty in design of real-time applications is a challenging but also critical problem. Our research is such an effort in this endeavor, and we conclude this dissertation with discussions of potential future work.
|
76 |
Design and implementation of massively parallel fine-grained processor arraysWalsh, Declan January 2015 (has links)
This thesis investigates the use of massively parallel fine-grained processor arrays to increase computational performance. As processors move towards multi-core processing, more energy-efficient processors can be designed by increasing the number of processor cores on a single chip rather than increasing the clock frequency of a single processor. This can be done by making processor cores less complex, but increasing the number of processor cores on a chip. Using this philosophy, a processor core can be reduced in complexity, area, and speed to form a very small processor which can still perform basic arithmetic operations. Due to the small area occupation this can be multiplied and scaled to form a large scale parallel processor array to offer a significant performance. Following this design methodology, two fine-grained parallel processor arrays are designed which aim to achieve a small area occupation with each individual processor so that a larger array can be implemented over a given area. To demonstrate scalability and performance, SIMD parallel processor array is designed for implementation on an FPGA where each processor can be implemented using four ‘slices’ of a Xilinx FPGA. With such small area utilization, a large fine-grained processor can be implemented on these FPGAs. A 32 × 32 processor array is implemented and fast processing demonstrated using image processing tasks. An event-driven MIMD parallel processor array is also designed which occupies a small amount of area and can be scaled up to form much larger arrays. The event-driven approach allows the processor to enter an idle mode when no events are occurring local to the processor, reducing power consumption. The processor can switch to operational mode when events are detected. The processor core is designed with a multi-bit data path and ALU and contains its own instruction memory making the array a multi-core processor array. With area occupation of primary concern, the processor is relatively simple and connects with its four nearest direct neighbours. A small 8 × 8 prototype chip is implemented in a 65 nm CMOS technology process which can operate at a clock frequency of 80 MHz and offer a peak performance of 5.12 GOPS which can be scaled up to larger arrays. An application of the event-driven processor array is demonstrated using a simulation model of the processor. An event-driven algorithm is demonstrated to perform distributed control of distributed manipulator simulator by separating objects based on their physical properties.
|
77 |
The Development of Hardware Multi-core Test-bed on Field Programmable Gate ArrayShivashanker, Mohan 24 March 2011 (has links)
The goal of this project is to develop a flexible multi-core hardware test-bed on field programmable gate array (FPGA) that can be used to effectively validate the theoretical research on multi-core computing, especially for the power/thermal aware computing. Based on a commercial FPGA test platform, i.e. Xilinx Virtex5 XUPV5 LX110T, we develop a homogeneous multi-core test-bed with four software cores, each of which can dynamically adjust its performance using software. We also enhance the operating system support for this test platform with the development of hardware and software primitives that are useful in dealing with inter-process communication, synchronization, and scheduling for processes on multiple cores. An application based on matrix addition and multiplication on multi-core is implemented to validate the applicability of the test bed.
|
78 |
Minimizing the unpredictability that real-time tasks suffer due to inter-core cache interference.Satka, Zenepe, Hodžić, Hena January 2020 (has links)
Since different companies are introducing new capabilities and features on their products, the demand for computing power and performance in real-time systems is increasing. To achieve higher performance, processor's manufactures have introduced multi-core platforms. These platforms provide the possibility of executing different tasks in parallel on multiple cores. Since tasks share the same cache level, they face some interference that affects their timing predictability. This thesis is divided into two parts. The first part presents a survey on the existing solutions that others have introduced to solve the problem of cache interference that tasks face on multi-core platforms. The second part's focus is on one of the hardware-based techniques introduced by Intel Cooperation to achieve timing predictability of real-time tasks. This technique is called Cache Allocation Technology (CAT) and the main idea of it is to divide last level cache on some partitions called classes of services that will be reserved to specific cores. Since tasks of one core can only access the assigned partition of it, cache interference will be decreased and a better real-time tasks' performance will be achieved. In the end to evaluate CAT efficiency an experiment is conducted with different test cases and the obtained results show a significant difference on real-time tasks' performance when CAT is applied.
|
79 |
Reliabilibity Aware Thermal Management of Real-time Multi-core SystemsXu, Shikang 18 March 2015 (has links)
Continued scaling of CMOS technology has led to increasing working temperature of VLSI circuits. High temperature brings a greater probability of permanent errors (failure) in VLSI circuits, which is a critical threat for real-time systems. As the multi-core architecture is gaining in popularity, this research proposes an adaptive workload assignment approach for multi-core real-time systems to balance thermal stress among cores. While previously developed scheduling algorithms use temperature as the criterion, the proposed algorithm uses reliability of each core in the system to dynamically assign tasks to cores. The simulation results show that the proposed algorithm gains as large as 10% benefit in system reliability compared with commonly used static assignment while algorithms using temperature as criterion gain 4%. The reliability difference between cores, which indicates the imbalance of thermal stress on each core, is as large as 25 times smaller when proposed algorithm is applied.
|
80 |
Approches de parallélisation automatique et d'ordonnancement pour la co-simulation de modèles numériques sur processeurs multi-coeurs / Automatic parallelization and scheduling approaches for co-simulation of numerical models on multi-core processorsSaidi, Salah Eddine 18 April 2018 (has links)
Lors de la conception de systèmes cyber-physiques, des modèles issus de différents environnements de modélisation doivent être intégrés afin de simuler l'ensemble du système et estimer ses performances. Si certaines parties du système sont disponibles, il est possible de connecter ces parties à la simulation dans une approche Hardware-in-the-Loop (HiL). La simulation doit alors être effectuée en temps réel où les modèles réagissent périodiquement aux composants réels. En utilisant des modèles complexes, il devient difficile d'assurer une exécution rapide ou en temps réel sans utiliser des architectures multiprocesseurs. FMI (Functional Mocked-up Interface), un standard pour l'échange de modèles et la co-simulation, offre de nouvelles possibilités d'exécution multi-cœurs des modèles. L'un des objectifs de cette thèse est de permettre l'extraction du parallélisme potentiel dans une co-simulation multi-rate. Nous nous appuyons sur l'approche RCOSIM qui permet la parallélisation de modèles FMI. Des améliorations sont proposées dans le but de surmonter les limitations de RCOSIM. Nous proposons de nouveaux algorithmes pour permettre la prise en charge de modèles multi-rate. Les améliorations permettent de gérer des contraintes spécifiques telles que l'exclusion mutuelle et les contraintes temps réel. Nous proposons des algorithmes pour l'ordonnancement des co-simulations, en tenant compte de différentes contraintes. Ces algorithmes visent à accélérer la co-simulation ou assurer son exécution temps réel dans une approche HiL. Les solutions proposées sont testées sur des co-simulations synthétiques et validées sur un cas industriel. / When designing cyber-physical systems, engineers have to integrate models from different modeling environments in order to simulate the whole system and estimate its global performances. If some parts of the system are available, it is possible to connect these parts to the simulation in a Hardware-in-the-Loop (HiL) approach. In this case, the simulation has to be performed in real-time where models periodically react to the real components. The increase of requirements on the simulation accuracy and its validity domain requires more complex models. Using such models, it becomes hard to ensure fast or real-time execution without using multiprocessor architectures. FMI (Functional Mocked-up Interface), a standard for model exchange and co-simulation, offers new opportunities for multi-core execution of models. One goal of this thesis is the extraction of potential parallelism in a set of interconnected multi-rate models. We build on the RCOSIM approach which allows the parallelization of FMI models. In the first part of the thesis, improvements have been proposed to overcome the limitations of RCOSIM. We propose new algorithms in order to allow handling multi-rate models and schedule them on multi-core processors. The improvements allow handling specific constraints such as mutual exclusion and real-time constraints. Second, we propose algorithms for the allocation and scheduling of co-simulations, taking into account different constraints. These algorithms aim at accelerating the execution of the co-simulation or ensuring its real-time execution in a HiL approach. The proposed solutions have been tested on synthetic co-simulations and validated against an industrial use case.
|
Page generated in 0.0418 seconds