Global ETD Search

61	High-Performance Network-on-Chip Design for Many-Core Processors Wang, Boqian January 2020 (has links) With the development of on-chip manufacturing technologies and the requirements of high-performance computing, the core count is growing quickly in Chip Multi/Many-core Processors (CMPs) and Multiprocessor System-on-Chip (MPSoC) to support larger scale parallel execution. Network-on-Chip (NoC) has become the de facto solution for CMPs and MPSoCs in addressing the communication challenge. In the thesis, we tackle a few key problems facing high-performance NoC designs. For general-purpose CMPs, we encompass a full system perspective to design high-performance NoC for multi-threaded programs. By exploring the cache coherence under the whole system scenario, we present a smart communication service called Advance Virtual Channel Reservation (AVCR) to provide a highway to target packets, which can greatly reduce their contention delay in NoC. AVCR takes advantage of the fact that we can know or predict the destination of some packets ahead of their arrival at the Network Interface (NI). Exploiting the time interval before a packet is ready, AVCR establishes an end-to-end highway from the source NI to the destination NI. This highway is built up by reserving the Virtual Channel (VC) resources ahead of the target packet transmission and offering priority service to flits in the reserved VC in the wormhole router, which can avoid the target packets’ VC allocation and switch arbitration delay. Besides, we also propose an admission control method in NoC with a centralized Artificial Neural Network (ANN) admission controller, which can improve system performance by predicting the most appropriate injection rate of each node using the network performance information. In the online control process, a data preprocessing unit is applied to simplify the ANN architecture and make the prediction results more accurate. Based on the preprocessed information, the ANN predictor determines the control strategy and broadcasts it to each node where the admission control will be applied. For application-specific MPSoCs, we focus on developing high-performance NoC and NI compatible with the common AMBA AXI4 interconnect protocol. To offer the possibility of utilizing the AXI4 based processors and peripherals in the on-chip network based system, we propose a whole system architecture solution to make the AXI4 protocol compatible with the NoC based communication interconnect in the many-core system. Due to possible out-of-order transmission in the NoC interconnect, which conflicts with the ordering requirements specified by the AXI4 protocol, in the first place, we especially focus on the design of the transaction ordering units, realizing a high-performance and low cost solution to the ordering requirements. The microarchitectures and the functionalities of the transaction ordering units are also described and explained in detail for ease of implementation. Then, we focus on the NI and the Quality of Service (QoS) support in NoC. In our design, the NI is proposed to make the NoC architecture independent from the AXI4 protocol via message format conversion between the AXI4 signal format and the packet format, offering high flexibility to the NoC design. The NoC based communication architecture is designed to support high-performance multiple QoS schemes. The NoC system contains Time Division Multiplexing (TDM) and VC subnetworks to apply multiple QoS schemes to AXI4 signals with different QoS tags and the NI is responsible for traffic distribution between two subnetworks. Besides, a QoS inheritance mechanism is applied in the slave-side NI to support QoS during packets’ round-trip transfer in NoC. / Med utvecklingen av tillverkningsteknologi av on-chip och kraven på högpresterande da-toranläggning växer kärnantalet snabbt i Chip Multi/Many-core Processors (CMPs) ochMultiprocessor Systems-on-Chip (MPSoCs) för att stödja större parallellkörning. Network-on-Chip (NoC) har blivit den de facto lösningen för CMP:er och MPSoC:er för att mötakommunikationsutmaningen. I uppsatsen tar vi upp några viktiga problem med hög-presterande NoC-konstruktioner.Allmänna CMP:er omfattas ett fullständigt systemperspektiv för att design högprester-ande NoC för flertrådad program. Genom att utforska cachekoherensen under hela system-scenariot presenterar vi en smart kommunikationstjänst, AVCR (Advance Virtual ChannelReservation) för att tillhandahålla en motorväg till målpaket, vilket i hög grad kan min-ska deras förseningar i NoC. AVCR utnyttjar det faktum att vi kan veta eller förutsägadestinationen för vissa paket före deras ankomst till nätverksgränssnittet (Network inter-face, NI). Genom att utnyttja tidsintervallet innan ett paket är klart, etablerar AVCRen ände till ände motorväg från källan NI till destinationen NI. Denna motorväg byggsupp genom att reservera virtuell kanal (Virtual Channel, VC) resurser före målpaket-söverföringen och erbjuda prioriterade tjänster till flisar i den reserverade VC i wormholerouter. Dessutom föreslår vi också en tillträdeskontrollmetod i NoC med en centraliseradartificiellt neuronät (Artificial Neural Network, ANN) tillträdeskontroll, som kan förbättrasystemets prestanda genom att förutsäga den mest lämpliga injektionshastigheten för varjenod via nätverksprestationsinformationen. I onlinekontrollprocessen används en förbehan-dlingsenhet på data för att förenkla ANN-arkitekturen och göra förutsägningsresultatenmer korrekta. Baserat på den förbehandlade informationen bestämmer ANN-prediktornkontrollstrategin och sänder den till varje nod där tillträdeskontrollen kommer att tilläm-pas.För applikationsspecifika MPSoC:er fokuserar vi på att utveckla högpresterande NoCoch NI kompatibla med det gemensamma AMBA AXI4 protokoll. För att erbjuda möj-ligheten att använda AXI4-baserade processorer och kringutrustning i det on-chip baseradenätverkssystemet föreslår vi en hel systemarkitekturlösning för att göra AXI4 protokolletkompatibelt med den NoC-baserade kommunikation i det multikärnsystemet. På grundav den out-of-order överföring i NoC, som strider mot ordningskraven som anges i AXI4-protokollet, fokuserar vi i första hand på utformningen av transaktionsordningsenheterna,för att förverkliga en hög prestanda och låg kostnad-lösning på ordningskraven. Sedanfokuserar vi på NI och Quality of Service (QoS)-stödet i NoC. I vår design föreslås NI attgöra NoC-arkitekturen oberoende av AXI4-protokollet via meddelandeformatkonverteringmellan AXI4 signalformatet och paketformatet, vilket erbjuder NoC-designen hög flexi-bilitet. Den NoC-baserade kommunikationsarkitekturen är utformad för att stödja fleraQoS-schema med hög prestanda. NoC-systemet innehåller Time-Division Multiplexing(TDM) och VC-subnät för att tillämpa flera QoS-scheman på AXI4-signaler med olikaQoS-taggar och NI ansvarar för trafikdistribution mellan två subnät. Dessutom tillämpasen QoS-arvsmekanism i slav-sidan NI för att stödja QoS under paketets tur-returöverföringiNoC / <p>QC 20201008</p> Network-on-Chip Chip Multi/Many-core Processors Multiprocessor System-on-Chip High-Performance Computing Cache Coherence Virtual Channel Reservation Admission Control Artificial Neural Network AXI4 Quality of Service Network-on-Chip Chip Multi/Many-core Processors Multiprocessor Sys-tem on a Chip High-Performance Computing Cache Coherence Virtual Channel Reser-vation Admission Control Artificial Neural Network AXI4 Quality of Servic Computer Engineering Datorteknik Computer Systems Datorsystem
62	AGILER: An Adaptive Heterogeneous Tile-Based Many-Core Architecture for RISC-V Processors Kamaleldin, Ahmed, Göhringer, Diana 31 May 2024 (has links) Tile-based many-core architectures are extensively used in modern system-on-chip designs to achieve scalable computing performance with adequate energy efficiency. Heterogeneity is the key element to boost computing performance and keep energy consumption under certain limits for several application domains. However, the steady increase of using many custom heterogeneous tiles leads to an expansion in design and integration cost with limited tiles re-usability. The recent widespread of open-source RISC-V ISA provides the potential to develop modular compute units that can be used for many application domains with high reduction in non-recurring engineering costs. The motivation of this work is to bring design modularity and adaptability features for heterogeneous tile-based many-core architectures by increasing their flexibility to realize different many-core configurations with less design time and costs. In this work, AGILER is proposed as an adaptive tile-base many-core architecture for heterogeneous RISC-V based processors. The proposed architecture consists of modular and adaptable heterogeneous multi-/single-core compute tiles that supports 32-/64-bit RISC-V ISAs with different memory hierarchies. Inter-tile communication is developed based on a scalable network-on-chip architecture to achieve a high degree of system scalability. AGILER supports run-time adaptation through a custom internal reconfiguration manager for dynamic and partial reconfiguration over Xilinx FPGAs. Evaluation results demonstrate that the proposed architecture features a scalable computing performance up to 685 MOPS for 8 x 32-bit tiles and 316 MOPS for 8 x 64-bit tiles with a scalable memory bandwidth up to 7.4 GB/s. AGILER is evaluated on Xilinx Virtex UltrascaleC FPGA with a maximum reconfiguration time of 38.1 ms for a single compute tile. info:eu-repo/classification/ddc/004 ddc:004 info:eu-repo/classification/ddc/621.3 ddc:621.3
63	Analyse temporelle des systèmes temps-réels sur architectures pluri-coeurs / Many-Core Timing Analysis of Real-Time Systems Rihani, Hamza 01 December 2017 (has links) La prédictibilité est un aspect important des systèmes temps-réel critiques. Garantir la fonctionnalité de ces systèmespasse par la prise en compte des contraintes temporelles. Les architectures mono-cœurs traditionnelles ne sont plussuffisantes pour répondre aux besoins croissants en performance de ces systèmes. De nouvelles architectures multi-cœurssont conçues pour offrir plus de performance mais introduisent d'autres défis. Dans cette thèse, nous nous intéressonsau problème d’accès aux ressources partagées dans un environnement multi-cœur.La première partie de ce travail propose une approche qui considère la modélisation de programme avec des formules desatisfiabilité modulo des théories (SMT). On utilise un solveur SMT pour trouverun chemin d’exécution qui maximise le temps d’exécution. On considère comme ressource partagée un bus utilisant unepolitique d’accès multiple à répartition dans le temps (TDMA). On explique comment la sémantique du programme analyséet le bus partagé peuvent être modélisés en SMT. Les résultats expérimentaux montrent une meilleure précision encomparaison à des approches simples et pessimistes.Dans la deuxième partie, nous proposons une analyse de temps de réponse de programmes à flot de données synchroness'exécutant sur un processeur pluri-cœur. Notre approche calcule l'ensemble des dates de début d'exécution et des tempsde réponse en respectant la contrainte de dépendance entre les tâches. Ce travail est appliqué au processeur pluri-cœurindustriel Kalray MPPA-256. Nous proposons un modèle mathématique de l'arbitre de bus implémenté sur le processeur. Deplus, l'analyse de l'interférence sur le bus est raffinée en prenant en compte : (i) les temps de réponseet les dates de début des tâches concurrentes, (ii) le modèle d'exécution, (iii) les bancsmémoires, (iv) le pipeline des accès à la mémoire. L'évaluation expérimentale est réalisé sur desexemples générés aléatoirement et sur un cas d'étude d'un contrôleur de vol. / Predictability is of paramount importance in real-time and safety-critical systems, where non-functional properties --such as the timing behavior -- have high impact on the system's correctness. As many safety-critical systems have agrowing performance demand, classical architectures, such as single-cores, are not sufficient anymore. One increasinglypopular solution is the use of multi-core systems, even in the real-time domain. Recent many-core architectures, such asthe Kalray MPPA, were designed to take advantage of the performance benefits of a multi-core architecture whileoffering certain predictability. It is still hard, however, to predict the execution time due to interferences on sharedresources (e.g., bus, memory, etc.).To tackle this challenge, Time Division Multiple Access (TDMA) buses are often advocated. In the first part of thisthesis, we are interested in the timing analysis of accesses to shared resources in such environments. Our approach usesSatisfiability Modulo Theory (SMT) to encode the semantics and the execution time of the analyzed program. To estimatethe delays of shared resource accesses, we propose an SMT model of a shared TDMA bus. An SMT-solver is used to find asolution that corresponds to the execution path with the maximal execution time. Using examples, we show how theworst-case execution time estimation is enhanced by combining the semantics and the shared bus analysis in SMT.In the second part, we introduce a response time analysis technique for Synchronous Data Flow programs. These are mappedto multiple parallel dependent tasks running on a compute cluster of the Kalray MPPA-256 many-core processor. Theanalysis we devise computes a set of response times and release dates that respect the constraints in the taskdependency graph. We derive a mathematical model of the multi-level bus arbitration policy used by the MPPA. Further,we refine the analysis to account for (i) release dates and response times of co-runners, (ii)task execution models, (iii) use of memory banks, (iv) memory accesses pipelining. Furtherimprovements to the precision of the analysis were achieved by considering only accesses that block the emitting core inthe interference analysis. Our experimental evaluation focuses on randomly generated benchmarks and an avionics casestudy. Interférences sur ressources partagées Processeurs pluri-Cœurs Temps d’exécution pire-Cas Temps de réponse Analyse temporelle Système temps-Réel Shared resource interference Many-Core processors Worst-Case execution time Response time Timing analysis Real-Time systems 004
64	Akcelerace algoritmů na architektuře Larrabee / Algorithm Acceleration on Larrabee Platform Veselý, Ivo January 2010 (has links) Intel Larrabee is one of the first of fully programmable graphical architectures. Thesis describes this many-core architecture by hardware implementation and programmer's model point of view. Larrabee bets on many complete in-order cores, built over x86 instruction set. Cores contains four hardware threads, each with it's own register file, and new vector processing unit. Vector processing unit together with instruction set extension rapidly increases system performance. New cache modes helps to increase throughput even when irregular data structures. This architecture is not focused only on computer graphics nor image processing, but all parallel tasks. Second part of this text deals with hologram synthesis. Specifically, it brings two new methods for patch of point light sources generation with concrete radiation.
65	Réconcilier performance et prédictibilité sur un many-coeur en utilisant des techniques d'ordonnancement hors-ligne / Reconciling performance and predictability on a noc-based mpsoc using off-line scheduling techniques Fakhfakh, Manel 27 June 2014 (has links) Les réseaux-sur-puces (NoCs) utilisés dans les architectures multiprocesseurs-sur-puces posent des défis importants aux approches d'ordonnancement temps réel en ligne (dynamique) et hors-ligne (statique). Un NoC contient un grand nombre de points de contention potentiels, a une capacité de bufferisation limitée et le contrôle réseau fonctionne à l'échelle de petits paquets de données. Par conséquent, l'allocation efficace de ressources nécessite l'utilisation des algorithmes da faible complexité sur des modèles de matériel avec un niveau de détail sans précédent dans l'ordonnancement temps réel. Nous considérons dans cette thèse une approche d'ordonnancement statique sur des architectures massivement parallèles (Massively parallel processor arrays ou MPPAs) caractérisées par un grand nombre (quelques centaines) de c¿urs de calculs. Nous identifions les mécanismes matériels facilitant l'analyse temporelle et l'allocation efficace de ressources dans les MPPAs existants. Nous déterminons que le NoC devrait permettre l'ordonnancement hors-ligne de communications, d'une manière synchronisée avec l'ordonnancement de calculs sur les processeurs. Au niveau logiciel, nous proposons une nouvelle méthode d'allocation et d'ordonnancement capable de synthétiser des ordonnancements globaux de calculs et de communications couvrants toutes les ressources d'exécution, de communication et de la mémoire d'un MPPA. Afin de permettre une utilisation efficace de ressources du matériel, notre méthode prend en compte les spécificités architecturales d'un MPPA et implémente des techniques d'ordonnancement avancées comme la préemption pré-calculée de transmissions de données. Nous avons évalué n / On-chip networks (NoCs) used in multiprocessor systems-on-chips (MPSoCs) pose significant challenges to both on-line (dynamic) and off-line (static) real-time scheduling approaches. They have large numbers of potential contention points, have limited internal buffering capabilities, and network control operates at the scale of small data packets. Therefore, efficient resource allocation requires scalable algorithms working on hardware models with a level of detail that is unprecedented in real-time scheduling. We consider in this thesis a static scheduling approach, and we target massively parallel processor arrays (MPPAs), which are MPSoCs with large numbers (hundreds) of processing cores. We first identify and compare the hardware mechanisms supporting precise timing analysis and efficient resource allocation in existing MPPA platforms. We determine that the NoC should ideally provide the means of enforcing a global communications schedule that is computed off-line (before execution) and which is synchronized with the scheduling of computations on processors. On the software side, we propose a novel allocation and scheduling method capable of synthesizing such global computation and communication schedules covering all the execution, communication, and memory resources in an MPPA. To allow an efficient use of the hardware resources, our method takes into account the specificities of MPPA hardware and implements advanced scheduling techniques such as pre-computed preemption of data transmissions. We evaluate our technique by mapping two signal processing applications, for which we obtain good latency, throughput, and resource use figures. Multiprocesseurs-sur-puce (many-coeur) Réseau-sur-puce (NoC) Ordonnancement hors-ligne Ordonnancement temps réel Chip-multiprocessor (Many-core) On-chip network (NoC) Off-line scheduling Real-time scheduling 004.2
66	Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems Hashmi, Jahanzeb Maqbool 17 September 2020 (has links) No description available. Computer Science

Page generated in 0.0412 seconds