Spelling suggestions: "subject:"manycore"" "subject:"manycores""
1 |
Contributions au co-design de noyaux irréguliers sur architectures manycore : cas du remaillage anisotrope multi-échelle en mécanique des fluides numérique / A co-design approach of irregular kernels on manycore architectures : case of multi-scale anisotropic remeshing in computational fluid dynamicsRakotoarivelo, Hoby 06 July 2018 (has links)
La simulation numérique d'écoulements complexes telles que les turbulences ou la propagation d'ondes de choc implique un temps de calcul conséquent pour une précision industrielle acceptable. Pour accélérer ces simulations, deux recours peuvent être combinés : l'adaptation de maillages afin de réduire le nombre de points d'une part, et le parallélisme pour absorber la charge de calcul d'autre part. Néanmoins réaliser un portage efficient des noyaux adaptatifs sur des architectures massivement parallèles n'est pas triviale. Non seulement chaque tâche relative à un voisinage local du domaine doit être propagée, mais le fait de traiter une tâche peut générer d'autres tâches potentiellement conflictuelles. De plus, les tâches en question sont caractérisées par une faible intensité arithmétique ainsi qu'une faible réutilisation de données déjà accédées. Par ailleurs, l'avènement de nouveaux types de processeurs dans le paysage du calcul haute performance implique un certain nombre de contraintes algorithmiques. Dans un contexte de réduction de la consommation électrique, ils sont caractérisés par de multiples cores faiblement cadencés et une hiérarchie mémoire profonde impliquant un coût élevé et asymétrique des accès-mémoire. Ainsi maintenir un rendement optimal des cores implique d'exposer un parallélisme très fin et élevé d'une part, ainsi qu'un fort taux de réutilisation de données en cache d'autre part. Ainsi la vraie question est de savoir comment structurer ces noyaux data-driven et data-intensive de manière à respecter ces contraintes ?Dans ce travail, nous proposons une approche qui concilie les contraintes de localité et de convergence en termes d'erreur et qualité de mailles. Plus qu'une parallélisation, elle s'appuie une re-conception des noyaux guidée par les contraintes hardware en préservant leur précision. Plus précisément, nous proposons des noyaux locality-aware pour l'adaptation anisotrope de variétés différentielles triangulées, ainsi qu'une parallélisation lock-free et massivement multithread de noyaux irréguliers. Bien que complémentaires, ces deux axes proviennent de thèmes de recherche distinctes mêlant informatique et mathématiques appliquées. Ici, nous visons à montrer que nos stratégies proposées sont au niveau de l'état de l'art pour ces deux axes. / Numerical simulations of complex flows such as turbulence or shockwave propagation often require a huge computational time to achieve an industrial accuracy level. To speedup these simulations, two alternatives may be combined : mesh adaptation to reduce the number of required points on one hand, and parallel processing to absorb the computation workload on the other hand. However efficiently porting adaptive kernels on massively parallel architectures is far from being trivial. Indeed each task related to a local vicintiy need to be propagated, and it may induce new conflictual tasks though. Furthermore, these tasks are characterized by a low arithmetic intensity and a low reuse rate of already cached data. Besides, new kind of accelerators have arised in high performance computing landscape, involving a number of algorithmic constraints. In a context of electrical power consumption reduction, they are characterized by numerous underclocked cores and a deep hierarchy memory involving asymmetric expensive memory accesses. Therefore, kernels must expose a high degree of concurrency and high cached-data reuse rate to maintain an optimal core efficiency. The real issue is how to structure these data-driven and data-intensive kernels to match these constraints ?In this work, we provide an approach which conciliates both locality constraints and convergence in terms of mesh error and quality. More than a parallelization, it relies on redesign of kernels guided by hardware constraints while preserving accuracy. In fact, we devise a set of locality-aware kernels for anisotropic adaptation of triangulated differential manifold, as well as a lock-free and massively multithread parallelization of irregular kernels. Although being complementary, those axes come from distinct research themes mixing informatics and applied mathematics. Here, we aim to show that our devised schemes are as efficient as the state-of-the-art for both axes.
|
2 |
Application-aware Performance Optimization for Software Managed Manycore ArchitecturesJanuary 2019 (has links)
abstract: One of the main goals of computer architecture design is to improve performance without much increase in the power consumption. It cannot be achieved by adding increasingly complex intelligent schemes in the hardware, since they will become increasingly less power-efficient. Therefore, parallelism comes up as the solution. In fact, the irrevocable trend of computer design in near future is still to keep increasing the number of cores while reducing the operating frequency. However, it is not easy to scale number of cores. One important challenge is that existing cores consume too much power. Another challenge is that cache-based memory hierarchy poses a serious limitation due to the rapidly increasing demand of area and power for coherence maintenance.
In this dissertation, opportunities to resolve the aforementioned issues were explored in two aspects.
Firstly, the possibility of removing hardware cache altogether, and replacing it with scratchpad memory with software management was explored. Scratchpad memory consumes much less power than caches. However, as data management logic is completely shifted to Software, how to reduce software overhead is challenging. This thesis presents techniques to manage scratchpad memory judiciously by exploiting application semantics and knowledge of data access patterns, thereby enabling optimization of data movement across the memory hierarchy. Experimental results show that the optimization was able to reduce stack data management overhead by 13X, produce better code mapping in more than 80% of the case, and improve performance by 83% in heap management.
Secondly, the possibility of using software branch hinting to replace hardware branch prediction to completely eliminate power consumption on corresponding hardware components was explored. As branch predictor is removed from hardware, software logic is responsible for reducing branch penalty. Techniques to minimize the branch penalty by optimizing branch hint placement were proposed, which can reduce branch penalty by 35.4% over the state-of-the-art. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2019
|
3 |
Architectural Design Space Exploration of Heterogeneous ManycoresXypolitidis, Benard, Shabani, Rudin January 2015 (has links)
Exploring the benefits of heterogeneous architectures is becoming more desirable dueto migration from single core to manycore architectural systems. A fast way to explorethe heterogeneity is through an architectural design space exploration (ADSE) tool,which gives the designer the option to explore design alternatives before the actualimplementation. Heracles Designer is an ADSE tool which allows the user to modifylarge aspects of the architecture. At present, Heracles Designer is equipped with asingle type of processing core, a MIPS CPU.We have extended the Heracles System in order to enable the system to model het-erogeneity. Our system is called the Heterogeneous Heracles System (HHS), where adifferent type of processing core, the OpenRISC CPU, is interfaced into the HeraclesSystem. Test programs are executed on both the MIPS and OpenRISC CPUs, whichhave provided promising results. In order to provide the designer with the option tomodify the system architecture without changing the source code, a GUI named AD-SET was created. ADSET provides the designer with the ability to modify the coresettings, memory system configuration and network topology configuration.In the HHS the MIPS core can only execute basic instructions, while the OpenRISCcan execute more advanced instructions, giving a designer the option to explore theeffects of heterogeneity based on the big little architectural concept. The results of ourwork provides an infrastructure on how to integrate different types of processing coresinto the HHS.
|
4 |
Improving Performance and Quality-of-Service through the Task-Parallel Model : Optimizations and Future Directions for OpenMPPodobas, Artur January 2015 (has links)
With the failure of Dennard's scaling, which stated that shrinking transistors will be more power-efficient, computer hardware has today become very divergent. Initially the change only concerned the number of processor on a chip (multicores), but has today further escalated into complex heterogeneous system with non-intuitive properties -- properties that can improve performance and power consumption but also strain the programmer expected to develop on them. Answering these challenges is the OpenMP task-parallel model -- a programming model that simplifies writing parallel software. Our focus in the thesis has been to explore performance and quality-of-service directions of the OpenMP task-parallel model, particularly by taking architectural features into account. The first question tackled is: what capabilities does existing state of the art runtime-systems have and how do they perform? We empirically evaluated the performance of several modern task-parallel runtime-systems. Performance and power-consumption was measured through the use of benchmarks and we show that the two primary causes for bottlenecks in modern runtime-systems lies in either the task management overheads or how tasks are being distributed across processors. Next, we consider quality-of-service improvements in task-parallel runtime-systems. Striving to improve execution performance, current state of the art runtime-systems seldom take dynamic architectural features such as temperature into account when deciding how work should be distributed across the processors, which can lead to overheating. We developed and evaluated two strategies for thermal-awareness in task-parallel runtime-systems. The first improves performance when the computer system is constrained by temperature while the second strategy strives to reduce temperature while meeting soft real-time objectives. We end the thesis by focusing on performance. Here we introduce our original contribution called BLYSK -- a prototype OpenMP framework created exclusively for performance research. We found that overheads in current runtime-systems can be expensive, which often lead to performance degradation. We introduce a novel way of preserving task-graphs throughout application runs: task-graphs are recorded, identified and optimized the first time an OpenMP application is executed and are later re-used in following executions, removing unnecessary overheads. Our proposed solution can nearly double the performance compared with other state of the art runtime-systems. Performance can also be improved through heterogeneity. Today, manufacturers are placing processors with different capabilities on the same chip. Because they are different, their power-consuming characteristics and performance differ. Heterogeneity adds another dimension to the multiprocessing problem: how should work be distributed across the heterogeneous processors?We evaluated the performance of existing, homogeneous scheduling algorithms and found them to be an ill-match for heterogeneous systems. We proposed a novel scheduling algorithm that dynamically adjusts itself to the heterogeneous system in order to improve performance. The thesis ends with a high-level synthesis approach to improve performance in task-parallel applications. Rather than limiting ourselves to off-the-shelf processors -- which often contains a large amount of unused logic -- our approach is to automatically generate the processors ourselves. Our method allows us to generate application-specific hardware from the OpenMP task-parallel source code. Evaluated using FPGAs, the performance of our System-on-Chips outperformed other soft-cores such as the NiosII processor and were also comparable in performance with modern state of the art processors such as the Xeon PHI and the AMD Opteron. / <p>QC 20151016</p>
|
5 |
Exploiting tightly-coupled coresBates, Daniel January 2014 (has links)
As we move steadily through the multicore era, and the number of processing cores on each chip continues to rise, parallel computation becomes increasingly important. However, parallelising an application is often difficult because of dependencies between different regions of code which require cores to communicate. Communication is usually slow compared to computation, and so restricts the opportunities for profitable parallelisation. In this work, I explore the opportunities provided when communication between cores has a very low latency and low energy cost. I observe that there are many different ways in which multiple cores can be used to execute a program, allowing more parallelism to be exploited in more situations, and also providing energy savings in some cases. Individual cores can be made very simple and efficient because they do not need to exploit parallelism internally. The communication patterns between cores can be updated frequently to reflect the parallelism available at the time, allowing better utilisation than specialised hardware which is used infrequently. In this dissertation I introduce Loki: a homogeneous, tiled architecture made up of many simple, tightly-coupled cores. I demonstrate the benefits in both performance and energy consumption which can be achieved with this arrangement and observe that it is also likely to have lower design and validation costs and be easier to optimise. I then determine exactly where the performance bottlenecks of the design are, and where the energy is consumed, and look into some more-advanced optimisations which can make parallelism even more profitable.
|
6 |
Advanced System-Scale and Chip-Scale Interconnection Networks for Ultrascale SystemsShalf, John Marshall 18 January 2011 (has links)
The path towards realizing next-generation petascale and exascale computing is increasingly dependent on building supercomputers with unprecedented numbers of processors. Given the rise of multicore processors, the number of network endpoints both on-chip and off-chip is growing exponentially, with systems in 2018 anticipated to contain thousands of processing elements on-chip and billions of processing elements system-wide. To prevent the interconnect from dominating the overall cost of future systems, there is a critical need for scalable interconnects that capture the communication requirements of target ultrascale applications. It is therefore essential to understand high-end application communication characteristics across a broad spectrum of computational methods, and utilize that insight to tailor interconnect designs to the specific requirements of the underlying codes. This work makes several unique contributions towards attaining that goal. First, the communication traces for a number of high-end application communication requirements, whose computational methods include: finite-difference, lattice-Boltzmann, particle-in-cell, sparse linear algebra, particle mesh ewald, and FFT-based solvers.
This thesis presents an introduction to the fit-tree approach for designing network infrastructure that is tailored to application requirements. A fit-tree minimizes the component count of an interconnect without impacting application performance compared to a fully connected network. The last section introduces a methodology for reconfigurable networks to implement fit-tree solutions called Hybrid Flexibly Assignable Switch Topology (HFAST). HFAST uses both passive (circuit) and active (packet) commodity switch components in a unique way to dynamically reconfigure interconnect wiring to suit the topological requirements of scientific applications. Overall the exploration points to several promising directions for practically addressing both the on-chip and off-chip interconnect requirements of future ultrascale systems. / Master of Science
|
7 |
Algoritmo de prefetching de dados temporizado para sistemas multiprocessadores baseados em NOCSILVEIRA, Maria Cireno Ribeiro 09 March 2015 (has links)
Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2016-03-15T13:58:26Z
No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
UFPE-MEI 2015-078 - Maria Cireno Ribeiro Silveira.pdf: 4578273 bytes, checksum: 1c434494e0c03cb02156a37ebfd1c7da (MD5) / Made available in DSpace on 2016-03-15T13:58:26Z (GMT). No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
UFPE-MEI 2015-078 - Maria Cireno Ribeiro Silveira.pdf: 4578273 bytes, checksum: 1c434494e0c03cb02156a37ebfd1c7da (MD5)
Previous issue date: 2015-03-09 / O prefetching é uma técnica considerada e ciente para mitigar um problema já conhecido
em sistemas computacionais: a diferença entre o desempenho do processador e do acesso
à memória. O objetivo do prefetching é aproximar o dado do processador retirando-o da
memória e carregando na cache local. Uma vez que o dado seja requisitado pelo processador,
ele já estará disponível na cache, reduzindo a taxa de perdas e a penalidade do
sistema. Para sistemas multiprocessadores baseados em NoCs a e ciência do prefetching
é ainda mais crítica em relação ao desempenho, uma vez que o tempo de acesso ao dado
varia dependendo da distância entre processador e memória e do tráfego da rede.
Este trabalho propõe um algoritmo de prefetching de dados temporizado, que tem
como objetivo minimizar a penalidade dos núcleos através uma solução de prefetching
baseada em predição de tempo para sistemas multiprocessadores baseados em NoC. O
algoritmo utiliza um processo pró-ativo iniciado pelo servidor para realizar requisições
de prefetching baseado no histórico de perdas de cache e informações da NoC. Nos experimentos
realizados para 16 núcleos, o algoritmo proposto reduziu a penalidade dos
processadores em 53,6% em comparação com o prefetching baseado em eventos (faltas na
cache), sendo a maior redução de 29% da penalidade. / The prefetching technique is an e ective approach to mitigate a well-known problem in
multi-core processors: the gap between computing and data access performance. The
goal of prefetching is to approximate data to the CPU by retrieving the data from the
memory and loading it in the cache. When the data is requested by the CPU, it is already
available in the cache, reducing the miss rate and penalty. In multiprocessor NoC-based
systems the prefetching e ciency is even more critical to system performance, since the
access time depends of the distance between the requesting processor and the memory
and also of the network tra c.
This work proposes a temporized data prefetching algorithm that aims to minimize
the penalty of the cores through one prefetching solution based on time prediction for
multiprocessor NoC-based systems. The algorithm utilizes a proactive process initiated by
the server to request prefetching data based on cache miss history and NoC's information.
In the experiments for 16 cores, the proposed algorithm has successfully reduced the
processors penalty in 53,6% compared to the event-based prefetching and the best case
was a penalty reduction of 29%.
|
8 |
Protocoles scalables de cohérence des caches pour processeurs manycore à espace d'adressage partagé visant la basse consommation. / Scalable cache coherence protocols for energy-efficient shared memory manycore processorsLiu, Hao 27 January 2016 (has links)
L'architecture TSAR (Tera-Scale ARchitecture) développée conjointement par BULL, le Lip6 et le CEA-LETI est une architecture manycore CC-NUMA extensible jusqu'à 1024 cœurs. Le protocole de cohérence de cache DHCCP dans l'architecture TSAR repose sur le principe du répertoire global distribué en utilisant la stratégie d'écriture simultanée afin de passer à l'échelle, mais cette scalabilité a un coût énergétique important que nous cherchons à réduire. Actuellement, les plus grosses entreprises dans le domaine des semi-conducteurs, comme Intel ou AMD, utilisent les protocoles MESI ou MOESI dans leurs processeurs multicoeurs. Ces types de protocoles utilisent la stratégie d'écriture différée pour réduire la forte consommation énergétique due aux écritures. Mais la complexité d'implémentation et la forte augmentation de ce trafic de cohérence quand le nombre de processeurs augmente limite le passage à l'échelle de ces protocoles au-delà de quelques dizaines de coeurs. Dans cette thèse, nous proposons un nouveau protocole de cohérence de cache utilisant une méthode hybride pour traiter les écritures dans le cache L1 privé : pour les lignes non partagées, le contrôleur de cache L1 utilise la stratégie d'écriture différée, de façon à modifier les lignes localement. Pour les lignes partagées, le contrôleur de cache L1 utilise la stratégie d'écriture immédiate pour éviter l'état de propriété exclusive sur ces lignes partagées. Cette méthode, appelée RWT pour Released Write Through, passe non seulement à l'échelle, mais réduit aussi significativement la consommation énergétique liée aux écritures. Nous avons aussi optimisé la solution actuelle pour gérer la cohérence des TLBs dans l'architecture TSAR, en termes de performance et de consommation énergétique. Enfin, nous introduisons dans cette thèse un nouveau petit cache, appelé micro-cache, entre le coeur et le cache L1, afin de réduire le nombre d'accès au cache d'instructions. / The TSAR architecture (Tera-Scale ARchitecture) developed jointly by Lip6 Bull and CEA-LETI is a CC-NUMA manycore architecture which is scalable up to 1024 cores. The DHCCP cache coherence protocol in the TSAR architecture is a global directory protocol using the write-through policy in the L1 cache for scalability purpose, but this write policy causes a high power consumption which we want to reduce. Currently the biggest semiconductors companies, such as Intel or AMD, use the MESI MOESI protocols in their multi-core processors. These protocols use the write-back policy to reduce the high power consumption due to writes. However, the complexity of implementation and the sharp increase in the coherence traffic when the number of processors increases limits the scalability of these protocols beyond a few dozen cores. In this thesis, we propose a new cache coherence protocol using a hybrid method to process write requests in the L1 private cache : for exclusive lines, the L1 cache controller chooses the write-back policy in order to modify locally the lines as well as eliminate the write traffic for exclusive lines. For shared lines, the L1 cache controller uses the write-through policy to simplify the protocol and in order to guarantee the scalability. We also optimized the current solution for the TLB coherence problem in the TSAR architecture. The new method which is called CC-TLB not only improves the performance, but also reduces the energy consumption. Finally, this thesis introduces a new micro cache between the core and the L1 cache, which allows to reduce the number of accesses to the instruction cache, in order to save energy.
|
9 |
Exécution sécurisée de plusieurs machines virtuelles sur une plateforme Manycore / Executing secured virtual machines within a Manycore architectureDévigne, Clément 06 July 2017 (has links)
Les architectures manycore, qui comprennent un grand nombre de cœurs, sont un moyen de répondre à l'augmentation continue de la quantité de données numériques à traiter par les infrastructures proposant des services de cloud computing. Ces données, qui peuvent concerner des entreprises aussi bien que des particuliers, sont sensibles par nature, et c'est pourquoi la problématique d'isolation est primordiale. Or, depuis le début du développement du cloud computing, des techniques de virtualisation sont de plus en plus utilisées pour permettre à différents utilisateurs de partager physiquement les mêmes ressources matérielles. Cela est d'autant plus vrai pour les architectures manycore, et il revient donc en partie aux architectures de garantir la confidentialité et l'intégrité des données des logiciels s'exécutant sur la plateforme. Nous proposons dans cette thèse un environnement de virtualisation sécurisé pour une architecture manycore. Notre mécanisme s'appuie sur des composants matériels et un logiciel hyperviseur pour isoler plusieurs systèmes d'exploitation s'exécutant sur la même architecture. L'hyperviseur est en charge de l'allocation des ressources pour les systèmes d'exploitation virtualisés, mais ne possède pas de droit d'accès sur les ressources allouées à ces systèmes. Ainsi, une faille de sécurité dans l'hyperviseur ne met pas en péril la confidentialité ou l'intégrité des données des systèmes virtualisés. Notre solution est évaluée en utilisant un prototype virtuel précis au cycle et a été implémentée dans une architecture manycore à mémoire partagée cohérente. Nos évaluations portent sur le surcoût matériel et sur la dégradation en performance induits par nos mécanismes. Enfin, nous analysons la sécurité apportée par notre solution. / Manycore architectures, which comprise a lot of cores, are a way to answer the always growing demand for digital data processing, especially in a context of cloud computing infrastructures. These data, which can belong to companies as well as private individuals, are sensitive by nature, and this is why the isolation problematic is primordial. Yet, since the beginning of cloud computing, virtualization techniques are more and more used to allow different users to physically share the same hardware resources. This is all the more true for manycore architectures, and it partially comes down to the architectures to guarantee that data integrity and confidentiality are preserved for the software it executes. We propose in this thesis a secured virtualization environment for a manycore architecture. Our mechanism relies on hardware components and a hypervisor software to isolate several operating systems running on the same architecture. The hypervisor is in charge of allocating resources for the virtualized operating systems, but does not have the right to access the resources allocated to these systems. Thus, a security flaw in the hypervisor does not imperil data confidentiality and integrity of the virtualized systems. Our solution is evaluated on a cycle-accurate virtual prototype and has been implemented in a coherent shared memory manycore architecture. Our evaluations target the hardware and performance overheads added by our mechanisms. Finally, we analyze the security provided by our solution.
|
10 |
Étude de transformations et d’optimisations de code parallèle statique ou dynamique pour architecture "many-core" / Study of transformations and static or dynamic parallel code optimization for manycore architectureGallet, Camille 13 October 2016 (has links)
L’évolution des supercalculateurs, de leur origine dans les années 60 jusqu’à nos jours, a fait face à 3 révolutions : (i) l’arrivée des transistors pour remplacer les triodes, (ii) l’apparition des calculs vectoriels, et (iii) l’organisation en grappe (clusters). Ces derniers se composent actuellement de processeurs standards qui ont profité de l’accroissement de leur puissance de calcul via une augmentation de la fréquence, la multiplication des cœurs sur la puce et l’élargissement des unités de calcul (jeu d’instructions SIMD). Un exemple récent comportant un grand nombre de cœurs et des unités vectorielles larges (512 bits) est le co-proceseur Intel Xeon Phi. Pour maximiser les performances de calcul sur ces puces en exploitant aux mieux ces instructions SIMD, il est nécessaire de réorganiser le corps des nids de boucles en tenant compte des aspects irréguliers (flot de contrôle et flot de données). Dans ce but, cette thèse propose d’étendre la transformation nommée Deep Jam pour extraire de la régularité d’un code irrégulier et ainsi faciliter la vectorisation. Ce document présente notre extension et son application sur une mini-application d’hydrodynamique multi-matériaux HydroMM. Ces travaux montrent ainsi qu’il est possible d’obtenir un gain de performances significatif sur des codes irréguliers. / Since the 60s to the present, the evolution of supercomputers faced three revolutions : (i) the arrival of the transistors to replace triodes, (ii) the appearance of the vector calculations, and (iii) the clusters. These currently consist of standards processors that have benefited of increased computing power via an increase in the frequency, the proliferation of cores on the chip and expansion of computing units (SIMD instruction set). A recent example involving a large number of cores and vector units wide (512-bit) is the co-proceseur Intel Xeon Phi. To maximize computing performance on these chips by better exploiting these SIMD instructions, it is necessary to reorganize the body of the loop nests taking into account irregular aspects (control flow and data flow). To this end, this thesis proposes to extend the transformation named Deep Jam to extract the regularity of an irregular code and facilitate vectorization. This thesis presents our extension and application of a multi-material hydrodynamic mini-application, HydroMM. Thus, these studies show that it is possible to achieve a significant performance gain on uneven codes.
|
Page generated in 0.0594 seconds