Spelling suggestions: "subject:"performancecomputing"" "subject:"performance.comparing""
131 |
Insightful Performance Analysis of Many-Task Runtimes through Tool-Runtime IntegrationChaimov, Nicholas 06 September 2017 (has links)
Future supercomputers will require application developers to expose much more parallelism than current applications expose. In order to assist application developers in structuring their applications such that this is possible, new programming models and libraries are emerging, the many-task runtimes, to allow for the expression of orders of magnitude more parallelism than currently existing models.
This dissertation describes the challenges that these emerging many-task runtimes will place on performance analysis, and proposes deep integration between runtimes and performance tools as a means of producing correct, insightful, and actionable performance results. I show how tool-runtime integration can be used to aid programmer understanding of performance characteristics and to provide online performance feedback to the runtime for Unified Parallel C (UPC), High Performance ParalleX (HPX), Apache Spark, the Open Community Runtime, and the OpenMP runtime.
|
132 |
MPI Performance Engineering with the MPI Tools Information InterfaceRamesh, Srinivasan 06 September 2018 (has links)
The desire for high performance on scalable parallel systems is increasing
the complexity and the need to tune MPI implementations. The MPI Tools
Information Interface (MPI T) introduced in the MPI 3.0 standard provides
an opportunity for performance tools and external software to introspect and
understand MPI runtime behavior at a deeper level to detect scalability issues. The
interface also provides a mechanism to fine-tune the performance of the MPI library
dynamically at runtime.
This thesis describes the motivation, design, and challenges involved in
developing an MPI performance engineering infrastructure using MPI T for two performance toolkits — the TAU Performance System, and Caliper. I validate the design of the infrastructure for TAU by developing optimizations
for production and synthetic applications. I show that the MPI T runtime
introspection mechanism in Caliper enables a meaningful analysis of performance
data.
This thesis includes previously published co-authored material.
|
133 |
Anahy-DVM: um módulo para escalonamento distribuído / Anahy-DVM: a module for distributed schedulingCardozo Junior, Marcelo Augusto 14 March 2006 (has links)
Made available in DSpace on 2015-03-05T13:56:58Z (GMT). No. of bitstreams: 0
Previous issue date: 14 / Hewlett-Packard Brasil Ltda / Atualmente o uso de aglomerados de computadores para fins de alto desempenho tem aumentado. Contudo, a programação desse tipo de arquitetura não é trivial. Pois,além de desenvolver a aplicação, detectar e explicitar a concorrência nela existente, o programador também é responsável por implementar o escalonamento de sua aplicação para
efetivamente usar o paralelismo dos aglomerados. Existem ferramentas que se propõem a solucionar esses problemas; a ferramenta de programação Anahy é uma destas.
Este trabalho se propõe a implementar um módulo para Anahy com fins de provêla de suporte à execução em ambientes dotados de memória distribuída. Para isso seu núcleo executivo foi estendido para que se possa ter acesso as estruturas de dados imprescindíveis à distribuição da carga computacional. Também será necessário desenvolver um
mecanismo de comunicação entre os nós do aglomerado para que estes troquem as informações necessárias para o andamento da computação. Por fim, o módulo desenvolvido
é avaliado através do / Lately, the usage of computer clusters has increased. However, programming for this class of architecture is non trivial. This happens due the fact that, besides programming
the application, detecting and specifying its concurrency, the programmer is also responsible for coding the scheduler of the application so it can use computer clusters efficiently. There are programming tools that propose solutions for these problems, one of these tools is Anahy. This work proposes an extension for Anahy runtime in order to provide support for
distributed memory environments. In order to achieve this objective, the execution core of Anahy is extended so the necessary data structures can be accessed by this module. It
is also necessary to develop a comunication mechanism among the nodes of the cluster so they can exchange the necessary information to complete the computation. Finally, the module is evaluated using a synthetic application. Through this evaluation, the module is analyzed relating to its usability in the
|
134 |
Exploring the dynamic radio sky with many-core high-performance computingMalenta, Mateusz January 2018 (has links)
As new radio telescopes and processing facilities are being built, the amount of data that has to be processed is growing continuously. This poses significant challenges, especially if the real-time processing is required, which is important for surveys looking for poorly understood objects, such as Fast Radio Bursts, where quick detection and localisation can enable rapid follow-up observations at different frequencies. With the data rates increasing all the time, new processing techniques using the newest hardware, such as GPUs, have to be developed. A new pipeline, called PAFINDER, has been developed to process data taken with a phased array feed, which can generate up to 36 beams on the sky, with data rates of 25 GBps per beam. With the majority of work done on GPUs, the pipeline reaches real-time performance when generating filterbank files used for offline processing. The full real-time processing, including single-pulse searches has also been implemented and has been shown to perform well under favourable conditions. The pipeline was successfully used to record and process data containing observations of RRAT J1819-1458 and positions on the sky where 3 FRBs have been observed previously, including the repeating FRB121102. Detailed examination of J1819-1458 single-pulse detections revealed a complex emission environment with pulses coming from three different rotation phase bands and a number of multi-component emissions. No new FRBs and no repeated bursts from FRB121102 have been detected. The GMRT High Resolution Southern Sky survey observes the sky at high galactic latitudes, searching for new pulsars and FRBs. 127 hours of data have been searched for the presence of any new bursts, with the help of new pipeline developed for this survey. No new FRBs have been found, which can be the result of bad RFI pollution, which was not fully removed despite new techniques being developed and combined with the existing solutions to mitigate these negative effects. Using the best estimates on the total amount of data that has been processed correctly, obtained using new single-pulse simulation software, no detections were found to be consistent with the expected rates for standard candle FRBs with a flat or positive spectrum.
|
135 |
Study of load distribution measures for high-performance applications / Estudos de medidas de distribuição de carga para aplicação de alto desempenhoRodrigues, Flavio Alles January 2016 (has links)
Balanceamento de carga é essencial para que aplicações paralelas tenham desempenho adequado. Conforme sistemas de computação paralelos crescem, o custo de uma má distribuição de carga também aumenta. Porém, o comportamento dinâmico que a carga computacional possui em certas aplicações pode induzir disparidades na carga atribuída a cada recurso. Portanto, o repetitivo processo de redistribuição de carga realizado durante a execução é crucial para que problemas de grande escala que possuam tais características possam ser resolvidos. Medidas que quantifiquem a distribuição de carga são um importante aspecto desse procedimento. Por estas razões, métricas frequentemente utilizadas como indicadores da distribuição de carga em aplicações paralelas são investigadas nesse estudo. Dado que balanceamento de carga é um processo dinâmico e recorrente, a investigação examina como tais métricas quantificam a distribuição de carga em intervalos regulares durante a execução da aplicação paralela. Seis métricas são avaliadas: percent imbalance, imbalance percentage, imbalance time, standard deviation, skewness e kurtosis. A análise revela virtudes e deficiências que estas medidas possuem, bem como as diferenças entres as mesmas como descritores da distribuição de carga em aplicações paralelas. Uma investigação como esta não tem precedentes na literatura especializada. / Load balance is essential for parallel applications to perform at their highest possible levels. As parallel systems grow, the cost of poor load distribution increases in tandem. However, the dynamic behavior the distribution of load possesses in certain applications can induce disparities in computational loads among resources. Therefore, the process of repeatedly redistributing load as execution progresses is critical to achieve the performance necessary to compute large scale problems with such characteristics. Metrics quantifying the load distribution are an important facet of this procedure. For these reasons, measures commonly used as load distribution indicators in HPC applications are investigated in this study. Considering the dynamic and recurrent aspect in load balancing, the investigation examines how these metrics quantify load distribution at regular intervals during a parallel application execution. Six metrics are evaluated: percent imbalance, imbalance percentage, imbalance time, standard deviation, skewness, and kurtosis. The analysis reveals the virtues and deficiencies each metric has, as well as the differences they register as descriptors of load distribution progress in parallel applications. As far as we know, an investigation as the one performed in this work is unprecedented.
|
136 |
Contribution à la parallélisation et au passage à l'échelle du code FLUSEPA / Contributions to the parallelization and the scalability of the FLUSEPA codeCouteyen Carpaye, Jean Marie 19 September 2016 (has links)
Les satellites sont mis en orbite en utilisant des lanceurs dont la conception est une des activités principales d’Airbus Defence and Space. Pour ce faire, se baser sur des expériences n’est pas facile : les souffleries ne permettent pas d’évaluer toutes les situations auxquelles un lanceur est confronté au cours de sa mission. La simulation numérique est donc essentielle pour l’industrie spatiale. Afin de disposer de simulations toujours plus fidèles, il est nécessaire d’utiliser des supercalculateurs de plus en plus puissants. Cependant, ces machines voient leur complexité augmenter et pour pouvoir exploiter leur plein potentiel, il est nécessaire d’adapter les codes existants. Désormais, il semble essentiel de passer par des couches d’abstraction afin d’assurer une bonne portabilité des performances. ADS a développé depuis plus de 20 ans le code FLUSEPA qui est utilisé pour le calcul de phénomènes instationnaires comme les calculs d’onde de souffle au décollage ou les séparations d’étages. Le solveur aérodynamique est basé sur une formulation volume fini et une technique d’intégration temporelle adaptative. Les corps en mouvement sont pris en compte via l’utilisation de plusieurs maillages qui sont combinés par intersections.Cette thèse porte sur la parallélisation du code FLUSEPA. Au début de la thèse, la seule version parallèle disponible était en mémoire partagée. Une première version parallèle en mémoire distribuée a d’abord été réalisée. Les gains en performance de cette version ont été évalués via l’utilisation de deux cas tests industriels. Un démonstrateur du solveur aérodynamique utilisant la programmation par tâche au dessus d’un runtime a aussi été réalisé. / There are different kinds of satellites that offer different services like communication, navigationor observation. They are put into orbit through the use of launchers whose design is oneof the main activities of Airbus Defence and Space. Relying on experiments is not easy : windtunnel cannot be used to evaluate every critical situation that a launcher will face during itsmission. Numerical simulation is therefore mandatory for spatial industry.In order to have more reliable simulations, more computational power is needed and supercomputersare used. Those supercomputers become more and more complex and this impliesto adapt existing codes to make them run efficiently. Nowadays, it seems important to rely onabstractions in order to ensure a good portability of performance. Airbus Defence and Spacedeveloped for more than 20 years the FLUSEPA code which is used to compute unsteady phenomenalike take-off blast wave or stage separation. The aerodynamic solver relies on a finitevolume formulation and an explicit temporal adaptive solver. Bodies in relative motion are takeninto account through the use of multiple meshes that are overlapped.This thesis is about the parallelization of the FLUSEPA code. At the start of the thesis,the only parallel version available was in shared memory through OpenMP. A first distributedmemory version was realized and relies on MPI and OpenMP. The performance improvementof this version was evaluated on two industrial test cases. A task-based demonstrator of theaerodynamic solver was also realized over a runtime system.
|
137 |
Hybrid MPI - uma implementação MPI para ambientes distribuídos híbridos. / Hybrid MPI - a MPI implementation for hybrid distributed systems.Francisco Isidro Massetto 04 October 2007 (has links)
O crescente desenvolvimento de aplicações de alto desempenho é uma realidade presente nos dias atuais. Entretanto, a diversidade de arquiteturas de máquinas, incluindo monoprocessadores e multiprocessadores, clusters com ou sem máquina front-end, variedade de sistemas operacionais e implementações da biblioteca MPI tem aumentado cada dia mais. Tendo em vista este cenário, bibliotecas que proporcionem a integração de diversas implementações MPI, sistemas operacionais e arquiteturas de máquinas são necessárias. Esta tese apresenta o HyMPI, uma implementação da biblioteca MPI voltada para integração, em um mesmo ambiente distribuído de alto desempenho, nós com diferentes arquiteturas, clusters com ou sem máquina front-end, sistemas operacionais e implementações MPI. HyMPI oferece um conjunto de primitivas compatíveis com a especificação MPI, incluindo comunicação ponto a ponto, operações coletivas, inicio e termino, além de outras primitivas utilitárias. / The increasing develpment of high performance applications is a reality on current days. However, the diversity of computer architectures, including mono and multiprocessor machines, clusters with or without front-end node, the variety of operating systems and MPI implementations has growth increasingly. Focused on this scenario, programming libraries that allows integration of several MPI implementations, operating systems and computer architectures are needed. This thesis introduces HyMPI, a MPI implementation aiming integratino, on a distributed high performance system nodes with different architectures, clusters with or without front-end machine, operating systems and MPI implementations. HyMPI offers a set of primitives based on MPI specification, including point-to-point communication, collective operations, startup and finalization and some other utility functions.
|
138 |
Code profiling and optimization in transactional memory systems / Profiling e otimização de código em sistemas de memória transacionalCordeiro, Silvio Ricardo January 2014 (has links)
Memória Transacional tem se demonstrado um paradigma promissor na implementação de aplicações concorrentes sob memória compartilhada que busquem evitar um modelo de sincronização baseado em locks. Em vez de sujeitar a execução a um acesso exclusivo com base no valor de um lock que é compartilhado por threads concorrentes, uma aplicação sob Memória Transacional tenta executar seções críticas de modo otimista, desfazendo as modificações no caso de um conflito de acesso à memória. Entretanto, apesar de a abordagem baseada em locks ter adquirido um número significativo de ferramentas automatizadas para a depuração, profiling e otimização automatizados (por ser uma das técnicas de sincronização mais antigas e mais bem pesquisadas), o campo da Memória Transacional ainda é comparativamente recente, e programadores frequentemente precisam adaptar manualmente suas aplicações transacionais ao encontrar problemas de eficiência. Este trabalho propõe um sistema no qual o profiling de código em uma implementação de Memória Transacional simulada é utilizado para caracterizar uma aplicação transacional, formando a base para uma parametrização automatizada do respectivo sistema especulativo para uma execução eficiente do código em questão. Também é proposta uma abordagem de escalonamento de threads guiado por profiling em uma implementação de Memória Transacional baseada em software, usando dados coletados pelo profiler para prever a probabilidade de conflitos e determinar que thread escalonar com base nesta previsão. São apresentados os resultados de experimentos sob ambas as abordagens. / Transactional Memory has shown itself to be a promising paradigm for the implementation of shared-memory concurrent applications that eschew a lock-based model of data synchronization. Rather than conditioning exclusive access on the value of a lock that is shared across concurrent threads, Transactional Memory attempts to execute critical sections optimistically, rolling back the modifications in the event of a data access conflict. However, while the lock-based approach has acquired a significant body of debugging, profiling and automated optimization tools (as one of the oldest and most researched synchronization techniques), the field of Transactional Memory is still comparably recent, and programmers are usually tasked with an unguided manual tuning of their transactional applications when facing efficiency problems. We propose a system in which code profiling in a simulated hardware implementation of Transactional Memory is used to characterize a transactional application, which forms the basis for the automated tuning of the underlying speculative system for the efficient execution of that particular application. We also propose a profile-guided approach to the scheduling of threads in a software-based implementation of Transactional Memory, using collected data to predict the likelihood of conflicts and determine what thread to schedule based on this prediction. We present the results achieved under both designs.
|
139 |
Dynamic superscalar grid for technical debt reductionKillian, Rudi January 2018 (has links)
Thesis (MTech (Information Technology))--Cape Peninsula University of Technology, 2018. / Organizations and the private individual, look to technology advancements to increase their ability to make informed decisions. The motivation for technology adoption by entities sprouting from an innate need for value generation. The technology currently heralded as the future platform to facilitate value addition, is popularly termed cloud computing. The move to cloud computing however, may conceivably increase the obsolescence cycle for currently retained Information Technology (IT) assets. The term obsolescence, applied as the inability to repurpose or scale an information system resource for needed functionality. The incapacity to reconfigure, grow or shrink an IT asset, be it hardware or software is a well-known narrative of technical debt. The notion of emergent technical debt realities is professed to be all but inevitable when informed by Moore’s Law, as technology must inexorably advance. Of more imminent concern however are that major accelerating factors of technical debt are deemed as non-holistic conceptualization and design conventions. Should management of IT assets fail to address technical debt continually, the technology platform would predictably require replacement. The unrealized value, functional and fiscal loss, together with the resultant e-waste generated by technical debt is meaningfully unattractive. Historically, the cloud milieu had evolved from the grid and clustering paradigms which allowed for information sourcing across multiple and often dispersed computing platforms. The parallel operations in distributed computing environments are inherently value adding, as enhanced effective use of resources and efficiency in data handling may be achieved. The predominant information processing solutions that implement parallel operations in distributed environments are abstracted constructs, styled as High Performance Computing (HPC) or High Throughput Computing (HTC). Regardless of the underlying distributed environment, the archetypes of HPC and HTC differ radically in standard implementation. The foremost contrasting factors of parallelism granularity, failover and locality in data handling have recently been the subject of greater academic discourse towards possible fusion of the two technologies. In this research paper, we uncover probable platforms of future technical debt and subsequently recommend redeployment alternatives. The suggested alternatives take the form of scalable grids, which should provide alignment with the contemporary nature of individual information processing needs. The potential of grids, as efficient and effective information sourcing solutions across geographically dispersed heterogeneous systems are envisioned to reduce or delay aspects of technical debt. As part of an experimental investigation to test plausibility of concepts, artefacts are designed to generically implement HPC and HTC. The design features exposed by the experimental artefacts, could provide insights towards amalgamation of HPC and HTC.
|
140 |
Performance evaluation of code optimizations in FPGA accelerators /Leite, Gustavo January 2019 (has links)
Orientador: Alexandro José Baldassin / Resumo: Com o crescimento contínuo do consumo de energia em microprocessadores,cientistas e engenheiros da computação redirecionaram atenção a arquiteturas heterogêneas, onde dispositivos de classes diferentes são usados para acelerar a computação. Dentre eles, existem as FPGAs (Field-Programmable Gate Arrays) cujo hardware pode ser reconfigurado após sua fabricação. Esta classe de dispositivos demonstra desempenho comparável aos processadores convencionais enquanto consomem apenas uma fração de energia. O uso de FPGAs vem se proliferando nos últimos anos e a perspectiva é que o nível de adoção continue a crescer. No entanto, programar FPGAs e aprimorar os programas para obter maior desempenho continua uma tarefa não trivial. Este trabalho apresenta uma compilação das principais transformações de código para otimização de programas direcionados à FPGAs. Neste trabalho também é avaliado o desempenho de programas executando em FPGAs. Mais especificamente, um subconjunto das transformações de código são aplicadas em um kernel OpenCL e os tempos de execução são medidos em um dispositivo da Intel®. Os resultados mostram que, sem a aplicação das transformações, o desempenho dos dispositivos é abaixo do que é observado quando as transformações são de fato aplicadas. / Abstract: With the ever increasing power wall in microprocessor design, scientists and engineers shifted their attention to heterogeneous architectures, where in several classes of devices are used for different kinds of computation. Among them are FPGAs whose hardware can be reconfigured after manufacturing. These devices offer comparable performance to CPUs while consuming only a fraction of energy. Infact, the use of FPGAs have been proliferating in recent years and should continue to do so considering the amount of attention these devices are receiving. Still, programmability and performance engineering in FPGAs remain hard. This work presents acompilation of the most prominent code transformations for optimizing code aimed at FPGAs. In this work we also evaluate the performance of programs running on FPGAs. More specifically, we apply a subset of the code transformations to an OpenCL kernel and measure the execution time on a Intel® FPGA. We show that, without applying these transformations before execution, poor performance is observed and the devices are underutilized. / Mestre
|
Page generated in 0.0938 seconds