Global ETD Search

1	Profiling Concurrent Programs Using Hardware Counters Lessard, Josh January 2005 (has links) Concurrency is a programming tool that is widely used in applications. Concurrent user-level threads can be used to structure the execution of a program in a uniprocessor environment and/or speed up its execution in a multiprocessor setting. Unfortunately, threads may interact with each other in unpredictable ways, often leading to performance problems that are nonexistent in the sequential domain. <br /><br /> A profiler can be used to help locate performance problems in sequential and concurrent programs. A profiler is a tool that monitors, analyzes, and visualizes the execution performance of a program to help users verify its expected behaviour, and locate its bottlenecks and hotspots. One of the important tools a profiler has at its disposal is a set of hardware counters, which are specialized CPU registers that count the occurrences of hardware events as a program executes. Hardware-event counts provide extremely precise insight into the execution behaviour of a program, and can be used to pinpoint portions of code where performance is suboptimal. <br /><br /> This thesis describes the design and implementation of <em>µ</em>Profiler, which is a profiler for sequential and concurrent programs written in a concurrent dialect of the C++ programming language called <em>??</em>C++. <em>??</em>C++ offers user-level concurrency in a uniprocessor or multiprocessor shared-memory environment. A new architecture-abstraction layer is developed, which allows <em>??</em>Profiler to access hardware counters on multiple CPU types. As well, two new profiling metrics are presented, which use the architecture-abstraction layer to gather hardware-event counts for <em>??</em>C++ programs. These metrics offer performance information about <em>??</em>C++ programs that is unavailable by any other means. Computer Science concurrent profiling hardware counters
2	Profiling Concurrent Programs Using Hardware Counters Lessard, Josh January 2005 (has links) Concurrency is a programming tool that is widely used in applications. Concurrent user-level threads can be used to structure the execution of a program in a uniprocessor environment and/or speed up its execution in a multiprocessor setting. Unfortunately, threads may interact with each other in unpredictable ways, often leading to performance problems that are nonexistent in the sequential domain. <br /><br /> A profiler can be used to help locate performance problems in sequential and concurrent programs. A profiler is a tool that monitors, analyzes, and visualizes the execution performance of a program to help users verify its expected behaviour, and locate its bottlenecks and hotspots. One of the important tools a profiler has at its disposal is a set of hardware counters, which are specialized CPU registers that count the occurrences of hardware events as a program executes. Hardware-event counts provide extremely precise insight into the execution behaviour of a program, and can be used to pinpoint portions of code where performance is suboptimal. <br /><br /> This thesis describes the design and implementation of <em>µ</em>Profiler, which is a profiler for sequential and concurrent programs written in a concurrent dialect of the C++ programming language called <em>µ</em>C++. <em>µ</em>C++ offers user-level concurrency in a uniprocessor or multiprocessor shared-memory environment. A new architecture-abstraction layer is developed, which allows <em>µ</em>Profiler to access hardware counters on multiple CPU types. As well, two new profiling metrics are presented, which use the architecture-abstraction layer to gather hardware-event counts for <em>µ</em>C++ programs. These metrics offer performance information about <em>µ</em>C++ programs that is unavailable by any other means. Computer Science concurrent profiling hardware counters
3	Controlling execution time variability using COTS for Safety-critical systems / Contrôler la variabilité du temps d’exécution en utilisant COTS pour les systèmes Safety-critical Bin, Jingyi 10 July 2014 (has links) Au cours de la dernière décennie, le domaine safety-critical s’appuie sur les Commercial Off-The-Shelf (COTS) architectures de mono-coeur malgré leur variabilité du temps d'exécution inhérent. Aujourd'hui, l'industrie safety-critical envisage la possibilité d'utilisation des COTS de multi-coeur en tenant compte de la demande croissante de performance. Cependant, le passage de mono-coeur à multi-coeur aggrave le problème de variabilité du temps d'exécution dû à la contention de ressources partagées. Les techniques standard pour gérer cette variabilité comme sur-approvisionnement de ressources ne peuvent pas être appliquées à multi-coeur en considérant que les safety-marges compenseront la plupart voire tout le gain de performance donné par les multi-coeurs. Une solution possible serait de capturer le comportement des mécanismes de contention potentielle sur les ressources partagées relativement à chaque application co-fonctionnant sur le système. Malheureusement, les caractéristiques sur les mécanismes de contention ne sont pas généralement clairement documentées. Dans la thèse, nous introduisons les techniques de mesure basées sur un ensemble de stressing benchmarks et les hardware monitors à caractériser 1) l'architecture en identifiant les ressources partagées et en étudiant leur mécanisme de contention. 2) les applications en étudiant comment elles se comportent relativement aux ressources partagées. Sur la base de ces informations, nous proposons une technique à estimer le WCET d'une application dans un co-running contexte prédéterminé en simulant le pire cas des contentions sur les ressources partagées produites par co-runners de l'application. / While relying during the last decade on single-core Commercial Off-The-Shelf (COTS) architectures despite their inherent runtime variability, the safety critical industry is now considering a shift to multi-core COTS in order to match the increasing performance requirement. However, the shift to multi-core COTS worsens the runtime variability issue due to the contention on shared hardware resources. Standard techniques to handle this variability such as resource over-provisioning cannot be applied to multi-cores as additional safety margins will offset most if not all the multi-core performance gains. A possible solution would be to capture the behavior of potential contention mechanisms on shared hardware resources relatively to each application co-running on the system. However, the features on contention mechanisms are usually very poorly documented. In this thesis, we introduce measurement techniques based on a set of dedicated stressing benchmarks and architecture hardware monitors to characterize (1) the architecture, by identifying the shared hardware resources and revealing their associated contention mechanisms. (2) the applications, by learning how they behave relatively to shared resources. Based on such information, we propose a technique to estimate the WCET of an application in a pre-determined co-running context by simulating the worst case contention on shared resources produced by the application's co-runners. Safety-critical Multi-coeur WCET Compteurs de performance Safety-critical Multi-core WCET Hardware counters
4	Evaluation of cache memory configurations with performance monitoring in embedded real-time automotive systems : Determining performance characteristics of cache memory with hardware counters and software profiling. / Utvärdning av cacheminnekonfigurationer med prestandamätning i realtidsstyrda fordonssystem : Bestämning av prestandaegenskaper i cacheminnen med hårdvaruräknare och mjukvaruprofilering Westman, Andreas January 2022 (has links) Modern day automotive systems are highly dependent on real-time software control to manage the powertrain and high-level features, such as cruise control. The computational power available has increased tremendously from decades of microcontroller and hardware development on such platforms. In contrast, the access times to the memory are still substantial, creating a significant bottleneck in the system. Therefore, small cache memories are used to reduce access times and improve performance. With significantly smaller but faster memory, the configuration and behaviour of the cache play an important role and are also highly dependent on the platform. Several of the configurations have an impact on the platform behaviour not only in terms of execution time, but also in multithreaded coherency, robustness, security, and internal bus usage. To distinguish performance differences and cache behaviour between configurations, hardware counters and low-level processor events such as bus usage, line fills, reads, and writes are monitored in conjunction with task load profiling. This proves to be an effective measurement method for use in a real-time embedded automotive system to provide both average and worstcase scenarios. In addition, the collected results are used to suggest improvements to the configuration of the platform used for measurements. For example, no major performance benefits were measured from excluding certain parts of the memory to increase hit rate. Less robust write-policies copy-back proved to be more efficient and could be used in combination with error correction to increase security. Memory coherency in multithreaded execution also proved to be inefficient and a major source to increased miss-rate due to snooping. / Moderna fordonssystem är idag mycket beroende av realtidsmjukvara för att effektivt kontrollera både drivlina och med användarfunktioner som till exempel farthållare. Beräkningskraften tillgänglig på de mikrokontroller som används har ökat kraftigt från årtionden av utveckling. Åtkomsttiden mellan processorn och minnet är däremot fortfarande stor och skapar en stor flaskhals i systemet. För att minska åtkomsttiden används cacheminnen med mycket hög prestanda och begränsad minnesmängd. Med väsentligt mindre och snabbare cacheminnen krävs optimerade konfigurationer för att utnyttja minnet effektivt, vilket kan vara svårt då användningen och prestandan är varierande för olika system. Fler cachekonfigurationer påverkar systemet i mer än bara exekveringstid utan och i minnessynkronisering, tillförlitlighet, säkerhet och intern bussanvändning. För att särskilja olika prestandaegenskaper mellan olika konfigurationer används hårdvaruräknare och processorhändelser som bussanvändning, radändringar, läsningar och skrivningar i kombination med profilering av processoranvändning. Det visar sig vara en effektiv metod för att utvärdera olika scenarion som bästa-, sämsta-, och medelfall i realtidssystem i fordon. Utöver det, används resultaten för att föreslå nya konfigurationsförbättringar på plattformen som användes. Några exempel på detta är hur försök till att förbättra minnesträffar i cacheminnet genom att exkludera vissa typer av minnessektioner inte gav någon prestandaförbättring. Mindre tillförlitliga skrivmetoder som copy-back visade sig vara mer effektiva och kunde användas i kombination med feldetektering för att förbättra säkerheten. cache memory real-time hardware counters automotive cacheminne realtid hårdvaruräkare fordon Computer and Information Sciences Data- och informationsvetenskap
5	Controlling execution time variability using COTS for Safety-critical systems Bin, Jingyi 10 July 2014 (has links) (PDF) While relying during the last decade on single-core Commercial Off-The-Shelf (COTS) architectures despite their inherent runtime variability, the safety critical industry is now considering a shift to multi-core COTS in order to match the increasing performance requirement. However, the shift to multi-core COTS worsens the runtime variability issue due to the contention on shared hardware resources. Standard techniques to handle this variability such as resource over-provisioning cannot be applied to multi-cores as additional safety margins will offset most if not all the multi-core performance gains. A possible solution would be to capture the behavior of potential contention mechanisms on shared hardware resources relatively to each application co-running on the system. However, the features on contention mechanisms are usually very poorly documented. In this thesis, we introduce measurement techniques based on a set of dedicated stressing benchmarks and architecture hardware monitors to characterize (1) the architecture, by identifying the shared hardware resources and revealing their associated contention mechanisms. (2) the applications, by learning how they behave relatively to shared resources. Based on such information, we propose a technique to estimate the WCET of an application in a pre-determined co-running context by simulating the worst case contention on shared resources produced by the application's co-runners. Safety-critical Multi-core WCET Hardware counters
6	Généralisation de l’analyse de performance décrémentale vers l’analyse différentielle / Generalization of the decremental performance analysis to differential analysis Bendifallah, Zakaria 17 September 2015 (has links) Une des étapes les plus cruciales dans le processus d’analyse des performances d’une application est la détection des goulets d’étranglement. Un goulet étant tout évènement qui contribue à l’allongement temps d’exécution, la détection de ses causes est importante pour les développeurs d’applications afin de comprendre les défauts de conception et de génération de code. Cependant, la détection de goulets devient un art difficile. Dans le passé, des techniques qui reposaient sur le comptage du nombre d’évènements, arrivaient facilement à trouver les goulets. Maintenant, la complexité accrue des micro-architectures modernes et l’introduction de plusieurs niveaux de parallélisme ont rendu ces techniques beaucoup moins efficaces. Par conséquent, il y a un réel besoin de réflexion sur de nouvelles approches.Notre travail porte sur le développement d’outils d’évaluation de performance des boucles de calculs issues d’applications scientifiques. Nous travaillons sur Decan, un outil d’analyse de performance qui présente une approche intéressante et prometteuse appelée l’Analyse Décrémentale. Decan repose sur l’idée d’effectuer des changements contrôlés sur les boucles du programme et de comparer la version obtenue (appelée variante) avec la version originale, permettant ainsi de détecter la présence ou pas de goulets d’étranglement.Tout d’abord, nous avons enrichi Decan avec de nouvelles variantes, que nous avons conçues, testées et validées. Ces variantes sont, par la suite, intégrées dans une analyse de performance poussée appelée l’Analyse Différentielle. Nous avons intégré l’outil et l’analyse dans une méthodologie d’analyse de performance plus globale appelée Pamda.Nous décrirons aussi les différents apports à l’outil Decan. Sont particulièrement détaillées les techniques de préservation des structures de contrôle du programme,ainsi que l’ajout du support pour les programmes parallèles.Finalement, nous effectuons une étude statistique qui permet de vérifier la possibilité d’utiliser des compteurs d’évènements, autres que le temps d’exécution, comme métriques de comparaison entre les variantes Decan / A crucial step in the process of application performance analysis is the accurate detection of program bottlenecks. A bottleneck is any event which contributes to extend the execution time. Determining their cause is important for application developpers as it enable them to detect code design and generation flaws.Bottleneck detection is becoming a difficult art. Techniques such as event counts,which succeeded to find bottlenecks easily in the past, became less efficient because of the increasing complexity of modern micro-processors, and because of the introduction of parallelism at several levels. Consequently, a real need for new analysis approaches is present in order to face these challenges.Our work focuses on performance analysis and bottleneck detection of computeintensive loops in scientific applications. We work on Decan, a performance analysis and bottleneck detection tool, which offers an interesting and promising approach called Decremental Analysis. The tool, which operates at binary level, is based on the idea of performing controlled modifications on the instructions of a loop, and comparing the new version (called variant) to the original one. The goal is to assess the cost of specific events, and thus the existence or not of bottlenecks.Our first contribution, consists of extending Decan with new variants that we designed, tested and validated. Based on these variants, we developed analysis methods which we used to characterize hot loops and find their bottlenecks. Welater, integrated the tool into a performance analysis methodology (Pamda) which coordinates several analysis tools in order to achieve a more efficient application performance analysis.Second, we introduce several improvements on the Decan tool. Techniquesdeveloped to preserve the control flow of the modified programs, allowed to use thetool on real applications instead of extracted kernels. Support for parallel programs(thread and process based) was also added. Finally, our tool primarily relying on execution time as the main concern for its analysis process, we study the opportunity of also using other hardware generated events, through a study of their stability, precision and overhead Analyse de performances Réécriture binaire Analyse dynamique Analyse statique du code Optimization de code Parallélisme Accès mémoire Performance analysis Binary rewriting Dynamic code analysis Static code analysis Code optimization Parallelism Memory accesses Hardware counters

1

Page generated in 0.0654 seconds