Spelling suggestions: "subject:"[een] PARALLEL APPLICATIONS"" "subject:"[enn] PARALLEL APPLICATIONS""
11 |
Adaptação dinâmica do número de threads em aplicações paralelas openMP para otimizar EDP em sistemas embarcados / Dynamic Adaptation of the number of threads for OpenMP applications in embedded systems to optimize EDPSchwarzrock, Janaina January 2018 (has links)
Aplicações paralelas geralmente são executadas com o máximo número de threads de hardware disponíveis no sistema para maximizar o seu desempenho. Contudo, esta abordagem pode não ser a melhor escolha quando se busca eficiência energética e, em alguns casos, pode até mesmo degradar o desempenho. Desta maneira, o presente trabalho aplica a adaptação dinâmica do número de threads para otimizar o Energy-Delay Product (EDP) de aplicações paralelas OpenMP executadas em sistemas embarcados. Ao contrário de soluções anteriores, que focam em processadores de propósito geral (GPP, do inglês General Purpose Processors), o presente trabalho considera as características intrínsecas de sistemas embarcados, os quais geralmente possuem menos núcleos disponíveis, assim como apresentam diferenças significativas em relação à micro-arquitetura e à hierarquia de memória. Por meio de experimentos realizados em um sistema embarcado real com processador octa-core, este trabalho mostrou que a adaptação dinâmica do número de threads permite, em média, economizar 15,35% no consumo de energia com apenas 3,41% de perda de desempenho, gerando assim 12,47% de otimização de EDP em relação à configuração padrão (uso do máximo número de threads disponíveis no sistema). No melhor caso, a adaptação dinâmica foi capaz de economizar 26,97% em energia enquanto promoveu 25,74% de aumento no desempenho, resultando em 45,77% de melhora no EDP. / Parallel applications usually execute using the maximum number of threads allowed by the available hardware at hand to maximize performance. However, this approach may not be the best when it comes to energy efficiency and may even lead to performance decrease in some particular cases. In this way, the present work proposes a new apporach for the dynamic adaptation of the number of threads to optimize Energy-Delay Product (EDP) of OpenMP applications when running on Embedded Systems. Differently from previous solutions, which focus on General Purpose Processors (GPP), the current one takes into account the intrinsic characteristics of embedded systems, which usually have a lower number of cores and significantly different characteristics concerning the microarchitecture and memory hierarchy when compared to GPPs. Through experiments on a real embedded system with an octa-core processor, this work demonstrates that adapting the number of threads at runtime saves energy, on average, by 15,35% with only 3,41% loss performance, improving the EDP by 12,47% over the default configuration (maximum number of threads available in the system). In the best case, the dynamic adaptation saves 26,97 % in energy while promoting a 25,74 % increase in performance, resulting in a 45,77 % improvement in EDP.
|
12 |
[en] SUPPORT INTEGRATION OF DYNAMIC WORKLOAD GENERATION TO SAMBA FRAMEWORK / [pt] INTEGRAÇÃO DE SUPORTE PARA GERAÇÃO DE CARGA DINÂMICA AO AMBIENTE DE DESENVOLVIMENTO SAMBASERGIO MATEO BADIOLA 25 October 2005 (has links)
[pt] Alexandre Plastino em sua tese de doutorado apresenta um
ambiente de
desenvolvimento de aplicações paralelas SPMD (Single
Program, Multiple Data)
denominado SAMBA que permite a geração de diferentes
versões de uma
aplicação paralela a partir da incorporação de diferentes
algoritmos de
balanceamento de carga disponíveis numa biblioteca
própria. O presente trabalho
apresenta uma ferramenta de geração de carga dinâmica
integrada a este ambiente
que possibilita criar, em tempo de execução, diferentes
perfis de carga externa a
serem aplicados a uma aplicação paralela em estudo. Dessa
forma, pretende-se
permitir que o desenvolvedor de uma aplicação paralela
possa selecionar o
algoritmo de balanceamento de carga mais apropriado frente
a condições variáveis
de carga externa. Com o objetivo de validar a integração
da ferramenta ao
ambiente SAMBA, foram obtidos resultados da execução de
duas aplicações
SPMD distintas. / [en] Alexandre Plastino s tesis presents a framework for the
development of
SPMD parallel applications, named SAMBA, that enables the
generation of
different versions of a parallel application by
incorporating different load
balancing algorithms from an internal library. This
dissertation presents a dynamic
workload generation s tool, integrated to SAMBA, that
affords to create, at
execution time, different external workload profiles to be
applied over a parallel
application in study. The objective is to enable that a
parallel application
developer selects the most appropriated load balancing
algorithm based in its
performance under variable conditions of external
workload. In order to validate
this integration, two SPMD applications were implemented.
|
13 |
Adaptação dinâmica do número de threads em aplicações paralelas openMP para otimizar EDP em sistemas embarcados / Dynamic Adaptation of the number of threads for OpenMP applications in embedded systems to optimize EDPSchwarzrock, Janaina January 2018 (has links)
Aplicações paralelas geralmente são executadas com o máximo número de threads de hardware disponíveis no sistema para maximizar o seu desempenho. Contudo, esta abordagem pode não ser a melhor escolha quando se busca eficiência energética e, em alguns casos, pode até mesmo degradar o desempenho. Desta maneira, o presente trabalho aplica a adaptação dinâmica do número de threads para otimizar o Energy-Delay Product (EDP) de aplicações paralelas OpenMP executadas em sistemas embarcados. Ao contrário de soluções anteriores, que focam em processadores de propósito geral (GPP, do inglês General Purpose Processors), o presente trabalho considera as características intrínsecas de sistemas embarcados, os quais geralmente possuem menos núcleos disponíveis, assim como apresentam diferenças significativas em relação à micro-arquitetura e à hierarquia de memória. Por meio de experimentos realizados em um sistema embarcado real com processador octa-core, este trabalho mostrou que a adaptação dinâmica do número de threads permite, em média, economizar 15,35% no consumo de energia com apenas 3,41% de perda de desempenho, gerando assim 12,47% de otimização de EDP em relação à configuração padrão (uso do máximo número de threads disponíveis no sistema). No melhor caso, a adaptação dinâmica foi capaz de economizar 26,97% em energia enquanto promoveu 25,74% de aumento no desempenho, resultando em 45,77% de melhora no EDP. / Parallel applications usually execute using the maximum number of threads allowed by the available hardware at hand to maximize performance. However, this approach may not be the best when it comes to energy efficiency and may even lead to performance decrease in some particular cases. In this way, the present work proposes a new apporach for the dynamic adaptation of the number of threads to optimize Energy-Delay Product (EDP) of OpenMP applications when running on Embedded Systems. Differently from previous solutions, which focus on General Purpose Processors (GPP), the current one takes into account the intrinsic characteristics of embedded systems, which usually have a lower number of cores and significantly different characteristics concerning the microarchitecture and memory hierarchy when compared to GPPs. Through experiments on a real embedded system with an octa-core processor, this work demonstrates that adapting the number of threads at runtime saves energy, on average, by 15,35% with only 3,41% loss performance, improving the EDP by 12,47% over the default configuration (maximum number of threads available in the system). In the best case, the dynamic adaptation saves 26,97 % in energy while promoting a 25,74 % increase in performance, resulting in a 45,77 % improvement in EDP.
|
14 |
Adaptação dinâmica do número de threads em aplicações paralelas openMP para otimizar EDP em sistemas embarcados / Dynamic Adaptation of the number of threads for OpenMP applications in embedded systems to optimize EDPSchwarzrock, Janaina January 2018 (has links)
Aplicações paralelas geralmente são executadas com o máximo número de threads de hardware disponíveis no sistema para maximizar o seu desempenho. Contudo, esta abordagem pode não ser a melhor escolha quando se busca eficiência energética e, em alguns casos, pode até mesmo degradar o desempenho. Desta maneira, o presente trabalho aplica a adaptação dinâmica do número de threads para otimizar o Energy-Delay Product (EDP) de aplicações paralelas OpenMP executadas em sistemas embarcados. Ao contrário de soluções anteriores, que focam em processadores de propósito geral (GPP, do inglês General Purpose Processors), o presente trabalho considera as características intrínsecas de sistemas embarcados, os quais geralmente possuem menos núcleos disponíveis, assim como apresentam diferenças significativas em relação à micro-arquitetura e à hierarquia de memória. Por meio de experimentos realizados em um sistema embarcado real com processador octa-core, este trabalho mostrou que a adaptação dinâmica do número de threads permite, em média, economizar 15,35% no consumo de energia com apenas 3,41% de perda de desempenho, gerando assim 12,47% de otimização de EDP em relação à configuração padrão (uso do máximo número de threads disponíveis no sistema). No melhor caso, a adaptação dinâmica foi capaz de economizar 26,97% em energia enquanto promoveu 25,74% de aumento no desempenho, resultando em 45,77% de melhora no EDP. / Parallel applications usually execute using the maximum number of threads allowed by the available hardware at hand to maximize performance. However, this approach may not be the best when it comes to energy efficiency and may even lead to performance decrease in some particular cases. In this way, the present work proposes a new apporach for the dynamic adaptation of the number of threads to optimize Energy-Delay Product (EDP) of OpenMP applications when running on Embedded Systems. Differently from previous solutions, which focus on General Purpose Processors (GPP), the current one takes into account the intrinsic characteristics of embedded systems, which usually have a lower number of cores and significantly different characteristics concerning the microarchitecture and memory hierarchy when compared to GPPs. Through experiments on a real embedded system with an octa-core processor, this work demonstrates that adapting the number of threads at runtime saves energy, on average, by 15,35% with only 3,41% loss performance, improving the EDP by 12,47% over the default configuration (maximum number of threads available in the system). In the best case, the dynamic adaptation saves 26,97 % in energy while promoting a 25,74 % increase in performance, resulting in a 45,77 % improvement in EDP.
|
15 |
Performance variation considered helpful / Les variations de performance considérées utilesMosli Bouksiaa, Mohamed Said 26 April 2018 (has links)
Comprendre les performances d'une application multi-thread est difficile. Les threads interfèrent quand ils accèdent à la même ressource, ce qui ralentit leur exécution. Malheureusement, les outils de profiling existants se focalisent sur l'identification des causes de l'interférence, et non pas sur ses effets.Le développeur ne peut donc pas conclure si l'optimisation d'une interférence identifiée par un outil de profiling peut mener à une meilleure performance. Dans cette thèse, on propose de compléter les outils existants par un outil orienté-effet capable de quantifier l'impact de l'interférence sur la performance, indépendamment de la cause de l'interférence. Avec une évaluation de 27 applications, on montre que notre outil réussit à identifier 12 bottlenecks causés par 6 types d'interférence différents / Understanding the performance of a multi-threaded application is difficult. The threads interfere when they access the same resource, which slows their execution down. Unfortunately, current profiling tools focus on identifying the interference causes, not their effects.The developer can thus not know if optimizing the interference reported by a profiling tool can lead to better performance. In this thesis, we propose to complete the profiling toolbox with an effect-oriented profiling tool able to indicate how much interference impacts performance, regardless of the interference cause. With an evaluation of 27 applications, we show that our tool successfully identifies 12 performance bottlenecks caused by 6 different kinds of interference
|
16 |
A Domain Specific Language Based Approach for Generating Deadlock-Free Parallel Load Scheduling Protocols for Distributed SystemsAdhikari, Pooja 11 May 2013 (has links)
In this dissertation, the concept of using domain specific language to develop errorree parallel asynchronous load scheduling protocols for distributed systems is studied. The motivation of this study is rooted in addressing the high cost of verifying parallel asynchronous load scheduling protocols. Asynchronous parallel applications are prone to subtle bugs such as deadlocks and race conditions due to the possibility of non-determinism. Due to this non-deterministic behavior, traditional testing methods are less effective at finding software faults. One approach that can eliminate these software bugs is to employ model checking techniques that can verify that non-determinism will not cause software faults in parallel programs. Unfortunately, model checking requires the development of a verification model of a program in a separate verification language which can be an error-prone procedure and may not properly represent the semantics of the original system. The model checking approach can provide true positive result if the semantics of an implementation code and a verification model is represented under a single framework such that the verification model closely represents the implementation and the automation of a verification process is natural. In this dissertation, a domain specific language based verification framework is developed to design parallel load scheduling protocols and automatically verify their behavioral properties through model checking. A specification language, LBDSL, is introduced that facilitates the development of parallel load scheduling protocols. The LBDSL verification framework uses model checking techniques to verify the asynchronous behavior of the protocol. It allows the same protocol specification to be used for verification and the code generation. The support to automatic verification during protocol development reduces the verification cost post development. The applicability of LBDSL verification framework is illustrated by performing case study on three different types of load scheduling protocols. The study shows that the LBDSL based verification approach removes the need of debugging for deadlocks and race bugs which has potential to significantly lower software development costs.
|
17 |
High Performance Applications for the Single-Chip Message-Passing Parallel ComputerDickenson, William Wesley 05 May 2004 (has links)
Computer architects continue to push the limits of modern microprocessors. By using techniques such as out-of-order execution, branch prediction, and dynamic scheduling, designers have found ways to speed execution. However, growing architectural complexity has led to unsustained development and testing times. Shrinking feature sizes are causing increased wire resistances and signal propagation, thereby limiting a design's scalability. Indeed, the method of exploiting instruction-level parallelism (ILP) within applications is reaching a point of diminishing returns.
One approach to the aforementioned challenges is the Single-Chip Message-Passing (SCMP) Parallel Computer, developed at Virginia Tech. SCMP is a unique, tiled architecture aimed at thread-level parallelism (TLP). Identical cores are replicated across the chip, and global wire traces have been eliminated. The nodes are connected via a 2-D grid network and each contains a local memory bank.
This thesis presents the design and analysis of three high-performance applications for SCMP. The results show that the architecture proves itself as a formidable opponent to several current systems. / Master of Science
|
18 |
Agrégation spatiotemporelle pour la visualisation de traces d'exécution / Spatiotemporal Aggregation for Execution Trace VisualizationDosimont, Damien 10 June 2015 (has links)
Les techniques de visualisation de traces sont fréquemment employées par les développeurs pour comprendre, déboguer, et optimiser leurs applications.La plupart des outils d'analyse font appel à des représentations spatiotemporelles, qui impliquent un axe du temps et une représentation des ressources, et lient la dynamique de l'application avec sa structure ou sa topologie.Toutefois, ces dernières ne répondent pas au problème de passage à l'échelle de manière satisfaisante. Face à un volume de trace de l'ordre du Gigaoctet et une quantité d'évènements supérieure au million, elles s'avèrent incapables de représenter une vue d'ensemble de la trace, à cause des limitations imposées par la taille de l'écran, des performances nécessaires pour une bonne interaction, mais aussi des limites cognitives et perceptives de l'analyste qui ne peut pas faire face à une représentation trop complexe. Cette vue d'ensemble est nécessaire puisqu'elle constitue un point d'entrée à l'analyse~; elle constitue la première étape du mantra de Shneiderman - Overview first, zoom and filter, then details-on-demand -, un principe aidant à concevoir une méthode d'analyse visuelle.Face à ce constat, nous élaborons dans cette thèse deux méthodes d'analyse, l'une temporelle, l'autre spatiotemporelle, fondées sur la visualisation. Elles intègrent chacune des étapes du mantra de Shneiderman - dont la vue d'ensemble -, tout en assurant le passage à l'échelle.Ces méthodes sont fondées sur une méthode d'agrégation qui s'attache à réduire la complexité de la représentation tout en préservant le maximum d'information. Pour ce faire, nous associons à ces deux concepts des mesures issues de la théorie de l'information. Les parties du système sont agrégées de manière à satisfaire un compromis entre ces deux mesures, dont le poids de chacune est ajusté par l'analyste afin de choisir un niveau de détail. L'effet de la résolution de ce compromis est la discrimination de l'hétérogénéité du comportement des entités composant le système à analyser. Cela nous permet de détecter des anomalies dans des traces d'applications multimédia embarquées, ou d'applications de calcul parallèle s'exécutant sur une grille.Nous avons implémenté ces techniques au sein d'un logiciel, Ocelotl, dont les choix de conception assurent le passage à l'échelle pour des traces de plusieurs milliards d'évènements. Nous proposons également une interaction efficace, notamment en synchronisant notre méthode de visualisation avec des représentations plus détaillées, afin de permettre une analyse descendante jusqu'à la source des anomalies. / Trace visualization techniques are commonly used by developers to understand, debug, and optimize their applications.Most of the analysis tools contain spatiotemporal representations, which is composed of a time line and the resources involved in the application execution. These techniques enable to link the dynamic of the application to its structure or its topology.However, they suffer from scalability issues and are incapable of providing overviews for the analysis of huge traces that have at least several Gigabytes and contain over a million of events. This is caused by screen size constraints, performance that is required for a efficient interaction, and analyst perceptive and cognitive limitations. Indeed, overviews are necessary to provide an entry point to the analysis, as recommended by Shneiderman's emph{mantra} - Overview first, zoom and filter, then details-on-demand -, a guideline that helps to design a visual analysis method.To face this situation, we elaborate in this thesis several scalable analysis methods based on visualization. They represent the application behavior both over the temporal and spatiotemporal dimensions, and integrate all the steps of Shneiderman's mantra, in particular by providing the analyst with a synthetic view of the trace.These methods are based on an aggregation method that reduces the representation complexity while keeping the maximum amount of information. Both measures are expressed using information theory measures. We determine which parts of the system to aggregate by satisfying a trade-off between these measures; their respective weights are adjusted by the user in order to choose a level of details. Solving this trade off enables to show the behavioral heterogeneity of the entities that compose the analyzed system. This helps to find anomalies in embedded multimedia applications and in parallel applications running on a computing grid.We have implemented these techniques into Ocelotl, an analysis tool developed during this thesis. We designed it to be capable to analyze traces containing up to several billions of events. Ocelotl also proposes effective interactions to fit with a top-down analysis strategy, like synchronizing our aggregated view with more detailed representations, in order to find the sources of the anomalies.
|
19 |
Analyse des synchronisations dans un programme parallèle ordonnancé par vol de travail. Applications à la génération déterministe de nombres pseudo-aléatoires. / Analysis of Synchronizations In Greedy-Scheduled Executions - Application to Efficient Generation of Pseudorandom Numbers in ParallelMor, Stefano Drimon Kurz 26 October 2015 (has links)
Nous présentons deux contributions dans le domaine de la programmation parallèle.La première est théorique : nous introduisons l'analyse SIPS, une approche nouvelle pour dénombrer le nombre d'opérations de synchronisation durant l'exécution d'un algorithme parallèle ordonnancé par vol de travail.Basée sur le concept d'horloges logiques, elle nous permet,: d'une part de donner de nouvelles majorations de coût en moyenne; d'autre part de concevoir des programmes parallèles plus efficaces par adaptation dynamique de la granularité.La seconde contribution est pragmatique: nous présentons une parallélisation générique d'algorithmes pour la génération déterministe de nombres pseudo-aléatoires, indépendamment du nombre de processus concurrents lors de l'exécution.Alternative à l'utilisation d'un générateur pseudo-aléatoire séquentiel par processus, nous introduisons une API générique, appelée Par-R qui est conçue et analysée grâce à SIPS.Sa caractéristique principale est d'exploiter un générateur séquentiel qui peut "sauter" directement d'un nombre à un autre situé à une distance arbitraire dans la séquence pseudo-aléatoire.Grâce à l'analyse SIPS, nous montrons qu'en moyenne, lors d'une exécution par vol de travail d'un programme très parallèle (dont la profondeur ou chemin critique est très petite devant le travail ou nombre d'opérations), ces opérations de saut sont rares.Par-R est comparé au générateur pseudo-aléatoire DotMix, écrit pour Cilk Plus, une extension de C/C++ pour la programmation parallèle par vol de travail.Le surcout théorique de Par-R se compare favorablement au surcoput de DotMix, ce qui apparait aussi expériemntalement.De plus, étant générique, Par-R est indépendant du générateur séquentiel sous-jacent. / We present two contributions to the field of parallel programming.The first contribution is theoretical: we introduce SIPS analysis, a novel approach to estimate the number of synchronizations performed during the execution of a parallel algorithm.Based on the concept of logical clocks, it allows us: on one hand, to deliver new bounds for the number of synchronizations, in expectation; on the other hand, to design more efficient parallel programs by dynamic adaptation of the granularity.The second contribution is pragmatic: we present an efficient parallelization strategy for pseudorandom number generation, independent of the number of concurrent processes participating in a computation.As an alternative to the use of one sequential generator per process, we introduce a generic API called Par-R, which is designed and analyzed using SIPS.Its main characteristic is the use of a sequential generator that can perform a ``jump-ahead'' directly from one number to another on an arbitrary distance within the pseudorandom sequence.Thanks to SIPS, we show that, in expectation, within an execution scheduled by work stealing of a "very parallel" program (whose depth or critical path is subtle when compared to the work or number of operations), these operations are rare.Par-R is compared with the parallel pseudorandom number generator DotMix, written for the Cilk Plus dynamic multithreading platform.The theoretical overhead of Par-R compares favorably to DotMix's overhead, what is confirmed experimentally, while not requiring a fixed generator underneath.
|
20 |
Optimisation des transferts de données sur systèmes multiprocesseurs sur puce / Optimizing Data Transfers for Multiprocessor Systems on ChipsSaidi, Selma 24 October 2012 (has links)
Les systèmes multiprocesseurs sur puce, tel que le processeur CELL ou plus récemment Platform 2012, sont des architectures multicœurs hétérogènes constitués d'un processeur host et d'une fabric de calcul qui consiste en plusieurs petits cœurs dont le rôle est d'agir comme un accélérateur programmable. Les parties parallélisable d'une application, qui initialement est supposé etre executé par le host, et dont le calcul est intensif sont envoyés a la fabric multicœurs pour être exécutés. Ces applications sont en général des applications qui manipulent des tableaux trés larges de données, ces données sont stockées dans une memoire distante hors puce (off-chip memory) dont l 'accès est 100 fois plus lent que l 'accès par un cœur a une mémoire locale. Accéder ces données dans la mémoire off-chip devient donc un problème majeur pour les performances. une characteristiques principale de ces plateformes est une mémoire local géré par le software, au lieu d un mechanisme de cache, tel que les mouvements de données dans la hiérarchie mémoire sont explicitement gérés par le software. Dans cette thèse, l 'objectif est d'optimiser ces transfert de données dans le but de reduire/cacher la latence de la mémoire off-chip . / Multiprocessor system on chip (MPSoC) such as the CELL processor or the more recent Platform2012 are heterogeneous multi-core architectures, with a powerful host processor and a computation fabric, consisting of several smaller cores, whose intended role is to act as a general purpose programmable accelerator. Therefore computation-intensive (and parallelizable) parts of the application initially intended to be executed by the host processor are offloaded to the multi-cores for execution. These parts of the application are often data intensive, operating on large arrays of data initially stored in a remote off-chip memory whose access time is about 100 times slower than that of the cores local memory. Accessing data in the off-chip memory becomes then a main bottleneck for performance. A major characteristic of these platforms is a software controlled local memory storage rather than a hidden cache mechanism where data movement in the memory hierarchy, typically performed using a DMA (Direct Memory Access) engine, are explicitely managed by the software. In this thesis, we attempt to optimize such data transfers in order to reduce/hide the off-chip memory latency.
|
Page generated in 0.0399 seconds