Global ETD Search

111	Improved Heuristics for Partitioned Multiprocessor Scheduling Based on Rate-Monotonic Small-Tasks Müller, Dirk, Werner, Matthias 01 November 2012 (has links) Partitioned preemptive EDF scheduling is very similar to bin packing, but there is a subtle difference. Estimating the probability of schedulability under a given total utilization has been studied empirically before. Here, we show an approach for closed-form formulae for the problem, starting with n = 3 tasks on m = 2 processors. info:eu-repo/classification/ddc/004 ddc:004 Scheduling; Heuristik
112	Dedicated Hardware Context-Switch Services for Real-Time Multiprocessor Systems Allard, Yannick 07 November 2017 (has links) (PDF) Computers are widely present in our daily life and are used in critical applic-ations like cars, planes, pacemakers. Those real-time systems are nowadaysbased on processors which have an increasing complexity and have specifichardware services designed to reduce task preemption and migration over-heads. However using those services can add unpredictable overheads whenthe system has to switch from one task to another in some cases.This document screens existing solutions used in commonly availableprocessors to ease preemption and migration to highlight their strengths andweaknesses. A new hardware service is proposed to speed up task switchingat the L1 cache level, to reduce context switch overheads and to improvesystem predictability.The solution presented is based on stacking several identical cachememories at the L1 level. Each layer is able to save and restore its completestate independently to/from the main memory. One layer can be used forthe active task running on the processor while another layers can be restoredor saved concurrently. The active task can remain in execution until thepreempting task is ready in another layer after restoration from the mainmemory. The context switch between tasks can then be performed in avery short time by switching to the other layer which is now ready to runthe preempting task. Furthermore, the task will be resumed with the exactL1 cache memory state as saved earlier after the previous preemption. Theprevious task state can be sent back to the main memory for future use.Using this mechanism can lead to minimise the time required for migrationsand preemptions and consequently lower overheads and limit cache missesdue to preemptions and usually considered in the cache migration andpreemption delays. Isolation between tasks is also provided as they areexecuted from a dedicated layer.Both uniprocessor and multiprocessor designs are presented along withimplications on the real-time theory induced by the use of this hardware ser-vice. An implementation of the system is characterized and results show im-provements to the maximum and average execution time of a set of varioustasks: When the same size is used for the baseline cache and HwCS layers,94% of the tasks have a better execution time (up to 67%) and 80% have a bet-ter Worst Case Execution Time (WCET). 80% of the tasks are more predictableand the remaining 20% still have a better execution time. When we split thebaseline cache size among layers of the HwCS, measurements show that 75%of the tasks have a better execution time (up to 67%) leading to 50% of thetasks having a better WCET. Only 6% of the tasks suffer from worse executiontime and worse predictability while 75% of the tasks remain more predictablewhen using the HwCS compared to the baseline cache. / Les ordinateurs ont envahi notre quotidien et sont de plus en plus souventutilisés pour remplir des missions critiques. Ces systèmes temps réel sontbasés sur des processeurs dont la complexité augmente sans cesse. Des ser-vices matériels spécifiques permettent de réduire les coûts de préemption etmigration. Malheureusement, ces services ajoutent des temps morts lorsquele système doit passer d’une tâche à une autre.Ce document expose les solutions actuelles utilisées dans les processeurscourants pour mettre en lumière leurs qualités et défauts. Un nouveau ser-vice matériel (HwCS) est proposé afin d’accélérer le changement de tâches aupremier niveau de mémoire (L1) et de réduire ainsi les temps morts dus auxchangements de contextes tout en améliorant la prédictibilité du système.Bien que cette thèse se concentre sur le cache L1, le concept développépeut également s’appliquer aux autres niveaux de mémoire ainsi qu’àtout bloc dépendant du contexte. La solution présentée se base sur unempilement de caches identiques au premier niveau. Chaque couche del’empilement est capable de sauvegarder ou recharger son état vers/depuisla mémoire principale du système en toute autonomie. Une couche peutêtre utilisée par la tâche active pendant qu’une autre peut sauvegarder ourestaurer l’état d’une autre tâche. La tâche active peut ainsi poursuivre sonexécution en attendant que la tâche suivante soit rechargée. Le changementde contexte entre la tâche active et la tâche suivante peut alors avoir lieu enun temps très court. De plus, la tâche reprendra son exécution sur un cacheL1 dont l’état sera identique à celui au moment où elle a été interrompueprécédemment. L’état du cache de la tâche désormais inactive peut êtresauvegardé dans la mémoire principale en vue d’une utilisation ultérieure.Ce mécanisme permet de réduire au strict minimum le temps de calculperdu à cause des préemptions et migrations, les temps de sauvegarde et derechargement de la L1 n’ayant plus d’influence sur l’exécution des tâches. Deplus, chaque niveau étant dédié à une tâche, les interférences entre tâchessont réduites.Les propriétés ainsi que les implications sur les aspects temps réelsthéoriques sont présentées pour des systèmes mono et multiprocesseurs.Une implémentation d’un système uniprocesseur incluant ce servicematériel et sa caractérisation par rapport à l’exécution d’un set de tâchessont également présentées ainsi que les bénéfices apportés par le HwCS:Lorsque les couches du HwCS ont la même taille que le cache de base, 94%des tâches ont un meilleur temps d’exécution (jusqu’à 67%) et 80% ont unmeilleur pire temps d’exécution (WCET). 80% des tâches deviennent plusprédictibles et les 20% restants bénéficient néanmoins d’un meilleur WCET.Toutefois, si la taille du cache est partagée entre les couches du HwCS, lesmesures montrent que 75% des tâches ont un meilleur temps d’exécution,impliquant un meilleur WCET pour la moitié des tâches du système. Seule-ment 6% des tâches voient leur WCET augmenter et leur prédictibilitédiminuer tandis que 75% des tâches améliorent leur prédictibilité grâce auHwCS. / Doctorat en Sciences de l'ingénieur et technologie / info:eu-repo/semantics/nonPublished Technologie informatique hardware Analyse de systèmes informatiques Multiprocessor systems hardware context-switch real-time systems dedicated hardware services
113	ERIS live: A NUMA-aware in-memory storage engine for tera-scale multiprocessor systems Kiefer, Tim, Kissinger, Thomas, Schlegel, Benjamin, Habich, Dirk, Molka, Daniel, Lehner, Wolfgang 12 August 2022 (has links) The ever-growing demand for more computing power forces hardware vendors to put an increasing number of multiprocessors into a single server system, which usually exhibits a non-uniform memory access (NUMA). In-memory database systems running on NUMA platforms face several issues such as the increased latency and the decreased bandwidth when accessing remote main memory. To cope with these NUMA-related issues, a DBMS has to allow flexible data partitioning and data placement at runtime. In this demonstration, we present ERIS, our NUMA-aware in-memory storage engine. ERIS uses an adaptive partitioning approach that exploits the topology of the underlying NUMA platform and significantly reduces NUMA-related issues. We demonstrate throughput numbers and hardware performance counter evaluations of ERIS and a NUMA-unaware index for different workloads and configurations. All experiments are conducted on a standard server system as well as on a system consisting of 64 multiprocessors, 512 cores, and 8 TBs main memory. info:eu-repo/classification/ddc/004 ddc:004
114	Reactive and Proactive Fault-Tolerant Network-on-Chip Architectures using Machine Learning DiTomaso, Dominic F. January 2015 (has links) No description available. Computer Engineering Electrical Engineering Network-on-Chip Fault-tolerance Machine Learning Proactive Fault-tolerant technique Chip Multiprocessor Error Prediction
115	Using maximal feasible subset of constraints to accelerate a logic-based Benders decomposition scheme for a multiprocessor scheduling problem Grgic, Alexander, Andersson, Filip January 2022 (has links) Logic-based Benders decomposition (LBBD) is a strategy for solving discrete optimisation problems. In LBBD, the optimisation problem is divided into a master problem and a subproblem and each part is solved separately. LBBD methods that combine mixed-integer programming and constraint programming have been successfully applied to solve large-scale scheduling and resource allocation problems. Such combinations typically solve an assignment-type master problem and a scheduling-type subproblem. However, a challenge with LBBD methods that have feasibility subproblems are that they do not provide a feasible solution until an optimal solution is found. In this thesis, we show that feasible solutions can be obtained by finding and combining feasible parts of an infeasible master problem assignment. We use these insights to develop an acceleration technique for LBBD that solves a series of subproblems, according to algorithms for constructing a maximal feasible subset of constraints (MaFS). Using a multiprocessor scheduling problem as a benchmark, we study the computational impact from using this technique. We evaluate three variants of LBBD schemes. The first uses MaFS, the second uses irreducible subset of constraints (IIS) and the third combines MaFS with IIS. Computational tests were performed on an instance set of multiprocessor scheduling problems. In total, 83 instances were tested, and their number of tasks varied between 2794 and 10,661. The results showed that when applying our acceleration technique in the decomposition scheme, the pessimistic bounds were strong, but the convergence was slow. The decomposition scheme combining our acceleration technique with the acceleration technique using IIS showed potential to accelerate the method. logic-based Benders decomposition maximal feasible subset of constraints multiprocessor scheduling Computational Mathematics Beräkningsmatematik
116	Hardware-Software Co-Design for Sensor Nodes in Wireless Networks Zhang, Jingyao 11 June 2013 (has links) Simulators are important tools for analyzing and evaluating different design options for wireless sensor networks (sensornets) and hence, have been intensively studied in the past decades. However, existing simulators only support evaluations of protocols and software aspects of sensornet design. They cannot accurately capture the significant impacts of various hardware designs on sensornet performance. As a result, the performance/energy benefits of customized hardware designs are difficult to be evaluated in sensornet research. To fill in this technical void, in first section, we describe the design and implementation of SUNSHINE, a scalable hardware-software emulator for sensornet applications. SUNSHINE is the first sensornet simulator that effectively supports joint evaluation and design of sensor hardware and software performance in a networked context. SUNSHINE captures the performance of network protocols, software and hardware up to cycle-level accuracy through its seamless integration of three existing sensornet simulators: a network simulator TOSSIM, an instruction-set simulator SimulAVR and a hardware simulator GEZEL. SUNSHINE solves several sensornet simulation challenges, including data exchanges and time synchronization across different simulation domains and simulation accuracy levels. SUNSHINE also provides hardware specification scheme for simulating flexible and customized hardware designs. Several experiments are given to illustrate SUNSHINE's simulation capability. Evaluation results are provided to demonstrate that SUNSHINE is an efficient tool for software-hardware co-design in sensornet research. Even though SUNSHINE can simulate flexible sensor nodes (nodes contain FPGA chips as coprocessors) in wireless networks, it does not estimate power/energy consumption of sensor nodes. So far, no simulators have been developed to evaluate the performance of such flexible nodes in wireless networks. In second section, we present PowerSUNSHINE, a power- and energy-estimation tool that fills the void. PowerSUNSHINE is the first scalable power/energy estimation tool for WSNs that provides an accurate prediction for both fixed and flexible sensor nodes. In the section, we first describe requirements and challenges of building PowerSUNSHINE. Then, we present power/energy models for both fixed and flexible sensor nodes. Two testbeds, a MicaZ platform and a flexible node consisting of a microcontroller, a radio and a FPGA based co-processor, are provided to demonstrate the simulation fidelity of PowerSUNSHINE. We also discuss several evaluation results based on simulation and testbeds to show that PowerSUNSHINE is a scalable simulation tool that provides accurate estimation of power/energy consumption for both fixed and flexible sensor nodes. Since the main components of sensor nodes include a microcontroller and a wireless transceiver (radio), their real-time performance may be a bottleneck when executing computation-intensive tasks in sensor networks. A coprocessor can alleviate the burden of microcontroller from multiple tasks and hence decrease the probability of dropping packets from wireless channel. Even though adding a coprocessor would gain benefits for sensor networks, designing applications for sensor nodes with coprocessors from scratch is challenging due to the consideration of design details in multiple domains, including software, hardware, and network. To solve this problem, we propose a hardware-software co-design framework for network applications that contain multiprocessor sensor nodes. The framework includes a three-layered architecture for multiprocessor sensor nodes and application interfaces under the framework. The layered architecture is to make the design of multiprocessor nodes' applications flexible and efficient. The application interfaces under the framework are implemented for deploying reliable applications of multiprocessor sensor nodes. Resource sharing technique is provided to make processor, coprocessor and radio work coordinately via communication bus. Several testbeds containing multiprocessor sensor nodes are deployed to evaluate the effectiveness of our framework. Network experiments are executed in SUNSHINE emulator to demonstrate the benefits of using multiprocessor sensor nodes in many network scenarios. / Ph. D. Sensor networks multiprocessor sensor node Field programmable gate arrays simulator hardware-software co-design power/energy estimation testbeds
117	Um ambiente de execução para suporte à programação paralela com variáveis compartilhadas em sistemas distribuídos heterogêneos. / A runtime system for parallel programing with shared memory paradigm over a heterogeneus distributed systems. Craveiro, Gisele da Silva 31 October 2003 (has links) O avanço na tecnologia de hardware está permitindo que máquinas SMP de 2 a 8 processadores estejam disponíveis a um custo cada vez menor, possibilitando que a incorporação de tais máquinas em aglomerados de PC's ou até mesmo a composição de um aglomerado de SMP's sejam alternativas cada vez mais viáveis para computação de alto desempenho. O grande desafio é extrair o potencial que tal conjunto de máquinas oferece. Uma alternativa é usar um paradigma híbrido de programação para aproveitar a arquitetura de memória compartilhada através de multihreadeing e utilizar o modelo de troca de mensagens para comunicação entre os nós. Contudo, essa estratégia impõe uma tarefa árdua e pouco produtiva para o programador da aplicação. Este trabalho apresenta o sistema CPAR- Cluster que oferece uma abstração de memória compartilhada no topo de um aglomerado formado por nós mono e multiprocessadores. O sistema é implementado no nível de biblioteca e não faz uso de recursos especiais tais como hardware especializado ou alteração na camada de sistema operacional. Serão apresentados os modelos, estratégias, questões de implementação e os resultados obtidos através de testes realizados com a ferramenta e que apresentaram comportamento esperado. / The advance in hardware technologies is making small configuration SMP machines (from 2 to 8 processors) available at a low cost. For this reason, the inclusion of an SMP node into a cluster of PCs or even clusters of SMPs are becoming viable alternatives for high performance computing. The challenge is the exploitation of the computational resources that these platforms provide. A Hybrid programming paradigm which uses shared memory architecture through multihreading and also message passing model for inter node communication is an alternative. However, programming in such paradigm is very hard. This thesis presents CPAR- Cluster, a runtime system, that provides shared memory abstraction on top of a cluster composed by mono and multiprocessor nodes. Its implementation is at the library level and doesn't require special resources such as particular hardware or operating system moditfications. Models, strategies, implementation aspects and results will be presented. cluter of mono and multiprocessor nodes CPAR distributed shared memory heterogeneous distributed system parallel programing programação paralela CPAR sistema distribuído heterogêneo
118	Performance and energy efficiency via an adaptive MorphCore architecture Khubaib 09 July 2014 (has links) The level of Thread-Level Parallelism (TLP), Instruction-Level Parallelism (ILP), and Memory-Level Parallelism (MLP) varies across programs and across program phases. Hence, every program requires different underlying core microarchitecture resources for high performance and/or energy efficiency. Current core microarchitectures are inefficient because they are fixed at design time and do not adapt to variable TLP, ILP, or MLP. I show that if a core microarchitecture can adapt to the variation in TLP, ILP, and MLP, significantly higher performance and/or energy efficiency can be achieved. I propose MorphCore, a low-overhead adaptive microarchitecture built from a traditional OOO core with small changes. MorphCore adapts to TLP by operating in two modes: (a) as a wide-width large-OOO-window core when TLP is low and ILP is high, and (b) as a high-performance low-energy highly-threaded in-order SMT core when TLP is high. MorphCore adapts to ILP and MLP by varying the superscalar width and the out-of-order (OOO) window size by operating in four modes: (1) as a wide-width large-OOO-window core, 2) as a wide-width medium-OOO-window core, 3) as a medium-width large-OOO-window core, and 4) as a medium-width medium-OOO-window core. My evaluation with single-thread and multi-thread benchmarks shows that when highest single-thread performance is desired, MorphCore achieves performance similar to a traditional out-of-order core. When energy efficiency is desired on single-thread programs, MorphCore reduces energy by up to 15% (on average 8%) over an out-of-order core. When high multi-thread performance is desired, MorphCore increases performance by 21% and reduces energy consumption by 20% over an out-of-order core. Thus, for multi-thread programs, MorphCore's energy efficiency is similar to highly-threaded throughput-optimized small and medium core architectures, and its performance is two-thirds of their potential. / text Computer architecture Microprocessor Thread-level parallelism Instruction-level parallelism Memory-level parallelism Adaptive microprocessor Energy efficiency High performance Power efficiency Chip-multiprocessor Heterogeneous computing
119	Trumpųjų bangų sklidimo modelis daugiaprocesorinėje aplinkoje / Development of the model of short wave propagation by using multi-processor environment Mickus, Mykolas 04 November 2013 (has links) Tampriosios bangos (arba akustinės ar bet kokios kitos bangos) sklidimo tyrimai yra svarbūs tokiose srityse kaip seismologija arba neardantis medžiagos testavimas. Tamprioje srityje šis reiškinys aprašomas tampriosios bangos dinamine diferencialine lygtimi. Tačiau šios lygties sprendimas naudojant tokius skaitinius metodus kaip baigtiniai elementai reikalauja sritį padalinti į milijonus elementų. Naujų skaičiavimo technologijų kaip bendros paskirties grafiniai procesoriai (GPU) atsiradimas skaičiavimų laiką leidžia ženkliai sumažinti, tačiau algoritmai turi būti specialiai pritaikomi. Todėl šiame darbe koncentruojamasi į trumpos tampriosios bangos baigtinių elementų modelio sukūrimą ir algoritmų tobulinimą naudojant GPU bei pagrindinį procesorių (CPU). Lygties integravimui buvo pasirinktas centrinių skirtumų metodo (CSM) schema. Ši integravimo schema buvo modifikuota taip, kad būtų galima išskirti tris integravimo algoritmo etapus: išorinės jėgos įvertinimas, elementų deformacijos sąlygotų jėgų įvertinimas bei magų poslinkių, greičių ir jėgų perskaičiavimas. Remiantis strategija pasiūlyta [1] šaltinyje, buvo sukurti lygiagretūs algoritmai 2 ir 3 etapo skaičiavimams atlikti. Toliau antrojo etapo algoritmas buvo optimizuotas 2 kartus. Pirmiausia buvo atsisakyta elementų mazgų indeksų masyvo: tai skaičiavimo laiką sumažino 20%. Po to algoritmas buvo modifikuotas taip, kad elementus būtų galima apdoroti blokais kaip siūloma [12] ir [22] šaltiniuose. Skaičiavimo laiką tai leido... [toliau žr. visą tekstą] / Understanding elastic wave (or acoustic or any other type of wave for that matter) phenomenon is of great importance in areas such as seismology or non destructive testing (NDT). This phenomenon in case of elastic environment is described by dynamic elastic differential equations. However, computational models like finite element method consumes huge amounts of computational power as even for relatively small problems require dividing area of interest into millions of elements. In the advent of general purpose GPU computing new opportunities for speeding up computations as well as challenges for developing high performance algorithms suited for new kinds of processors arise. Therefore this work concentrates on developing a finite element based short elastic wave propagation model on GPU as well as CPU. Central difference explicit wave equation integration scheme has been chosen. It then was slightly modified in order to separate integration algorithm into three phases: external force evaluation, evaluation of forces that occur due to stresses of elements and recalculation of node shifts, speeds and forces. A parallel algorithm has been developed for executing third and seconds phases, based on strategy suggested in [1]. Then the algorithm of the second phase has been optimized 2 times: at first the array of element node indices was eliminated yielding 20% performance boost, then modifications have been made to process elements in blocks by using strategy described at [22]... [to full text] Informatics Baigtinių elementų metodas Trumpoji tamprioji banga Daugiaprocesorinė aplinka OpenCL GPGPU Finite element method Short elastic wave Multiprocessor environment OpenCL GPGPU
120	New Techniques for Building Timing-Predictable Embedded Systems Guan, Nan January 2013 (has links) Embedded systems are becoming ubiquitous in our daily life. Due to close interaction with physical world, embedded systems are typically subject to timing constraints. At design time, it must be ensured that the run-time behaviors of such systems satisfy the pre-specified timing constraints under any circumstance. In this thesis, we develop techniques to address the timing analysis problems brought by the increasing complexity of underlying hardware and software on different levels of abstraction in embedded systems design. On the program level, we develop quantitative analysis techniques to predict the cache hit/miss behaviors for tight WCET estimation, and study two commonly used replacement policies, MRU and FIFO, which cannot be analyzed adequately using the state-of-the-art qualitative cache analysis method. Our quantitative approach greatly improves the precision of WCET estimation and discloses interesting predictability properties of these replacement policies, which are concealed in the qualitative analysis framework. On the component level, we address the challenges raised by multi-core computing. Several fundamental problems in multiprocessor scheduling are investigated. In global scheduling, we propose an analysis method to rule out a great part of impossible system behaviors for better analysis precision, and establish conditions to guarantee the bounded responsiveness of computing tasks. In partitioned scheduling, we close a long standing open problem to generalize the famous Liu and Layland's utilization bound in uniprocessor real-time scheduling to multiprocessor systems. We also propose to use cache partitioning for multi-core systems to avoid contentions on shared caches, and solve the underlying schedulability analysis problem. On the system level, we present techniques to improve the Real-Time Calculus (RTC) analysis framework in both efficiency and precision. First, we have developed Finitary Real-Time Calculus to solve the scalability problem of the original RTC due to period explosion. The key idea is to only maintain and operate on a limited prefix of each curve that is relevant to the final results during the whole analysis procedure. We further improve the analysis precision of EDF components in RTC, by precisely bounding the response time of each computation request. Real-time systems WCET analysis cache analysis abstract interpretation multiprocessor scheduling fixed-priority scheduling EDF multi-core processors response time analysis utilization bound real-time calculus scalability

Search results