Global ETD Search

291	Herramientas para el soporte de análisis de rendimiento More, Andres 07 October 2013 (has links) Este documento describe una investigación realizada como trabajo final para la Especialización en Computo de Altas Prestaciones dictada en la Facultad de Informática de la Universidad Nacional de La Plata. El tema de investigación consiste en métodos y herramientas para el análisis del comportamiento de aplicaciones de alto rendimiento. Este trabajo contribuye con un resumen de la teoría de análisis de rendimiento más una descripción de las herramientas de soporte disponibles en el momento. Se propone también un proceso para analizar el rendimiento, ejemplificando su aplicación a un conjunto de núcleos de cómputo no triviales. Luego de la introducción de terminología y bases teóricas del análisis cuantitativo de rendimiento, se detalla la experiencia de utilizar herramientas para conocer donde se deberían localizar los esfuerzos de optimización. Este trabajo resume la experiencia que debe atravesar cualquier investigador en busca de las diferentes alternativas para el análisis de rendimiento; incluyendo la selección de herramientas de soporte y la definición de un procedimiento sistemático de optimización. aplicaciones de alto rendimiento análisis de rendimiento Linpack STREAM Intel MPI Benchmarks HPC Challenge Software Performance Optimization Ciencias Informáticas Informática
292	Extension of the SkePU Skeleton ProgrammingFramework for Multi-core CPU and Multi-GPU Systems for MPI-based Clusters Mangaraj, Swadhin K January 2013 (has links) SkePU (Skeleton Programming Framework for Multi-core CPU and Multi-GPU Systems) is a parallel computing framework developed by Johan Enmyren and Christoph Kessler at Linköpings Universitet. This C++ template library provides a simple and unified interface for specifying data-parallel computations with the help of skeletons and is targeted to multiple backends e.g. for a sequential CPU, parallel CPUs using MPI and OpenMP or GPUs using CUDA and OpenCL. SkePU is comprised of seven data-parallel skeletons and one task-parallel skeleton and these skeletons use two types of containers: vector and matrix to model real-life parallel applications. In this thesis, we address the extension of the SkePU framework by extending the matrix container (which stores 2-D data values) that can efficiently use the existing skeletons to develop parallel scientific applications on large-scale clusters using MPI. This piece of work focuses on the distribution of the matrix among the participating processes which after receiving their share of data can execute the application in parallel. This work covers all of the seven data-parallel skeletons. Each skeleton has been tested with a small application program. In addition to measurement of performance improvement from the application program’s execution time, we have also done a communication cost analysis for all skeletons with MPI using the LogGP model. In order to evaluate and test the operational efficiency of the extension, we have considered a PDE solver application. Through this application, we have demonstrated the performance gain and scalability of the extended framework. The performance improvement was more when computational load dominates the memory I/O operations. The results show that using the extension can serve as a viable approach while implementing real-life parallel applications on large-scale clusters. SkePU Skeleton Programming Framework C++ template library Matrix container
293	Résolution de systèmes linéaires et non linéaires creux sur grappes de GPUs Ziane Khodja, Lilia 07 June 2013 (has links) (PDF) Depuis quelques années, les grappes équipées de processeurs graphiques GPUs sont devenues des outils très attrayants pour le calcul parallèle haute performance. Dans cette thèse, nous avons conçu des algorithmes itératifs parallèles pour la résolution de systèmes linéaires et non linéaires creux de très grandes tailles sur grappes de GPUs. Dans un premier temps, nous nous sommes focalisés sur la résolution de systèmes linéaires creux à l'aide des méthodes itératives CG et GMRES. Les expérimentations ont montré qu'une grappe de GPUs est plus performante que son homologue grappe de CPUs pour la résolution de systèmes linéaires de très grandes tailles. Ensuite, nous avons mis en oeuvre des algorithmes parallèles synchrones et asynchrones des méthodes itératives Richardson et de relaxation par blocs pour la résolution de systèmes non linéaires creux. Nous avons constaté que les meilleurs solutions développées pour les CPUs ne sont pas nécessairement bien adaptées aux GPUs. En effet, les simulations effectuées sur une grappe de GPUs ont montré que les algorithmes Richardson sont largement plus efficaces que ceux de relaxation par blocs. De plus, elles ont aussi montré que la puissance de calcul des GPUs permet de réduire le rapport entre le temps d'exécution et celui de communication, ce qui favorise l'utilisation des algorithmes asynchrones sur des grappes de GPUs. Enfin, nous nous sommes intéressés aux grappes géographiquement distantes pour la résolution de systèmes linéaires creux. Dans ce contexte, nous avons utilisé la méthode de multi-décomposition à deux niveaux avec GMRES parallèle adaptée aux grappes de GPUs. Celle-ci utilise des itérations synchrones pour résoudre localement les sous-systèmes linéaires et des itérations asynchrones pour résoudre la globalité du système linéaire. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Méthodes itératives Parallélisme MPI/CUDA Grappes de GPUs
294	Software Engineering Best Practices for Parallel Computing Development patney, vikas January 2010 (has links) In today’s computer age, the numerical simulations are replacing the traditional laboratory experiments. Researchers around the world are using advanced computer software and multiprocessor computer technology to perform experiments, and analyse these simulation results to advance in their respective endeavours. With a wide variety of tools and technologies available, it could be a tedious and time taking task for a non-computer science researcher to choose appropriate methodologies for developing simulation software The research of this thesis addresses the use of Message Passing Interface (MPI) using object-oriented programming techniques and discusses the methodologies suitable to scientific computing, also, propose a customized software engineering development model. Parallel Programming MPI Software Engineering Multi cores C++ Design patterns Boost Library Simulation STL Generic Programming Templates Partial Differential Equations
295	Performance Evaluation and Prediction of Parallel Applications Markomanolis, Georgios 20 January 2014 (has links) (PDF) Analyzing and understanding the performance behavior of parallel applicationson various compute infrastructures is a long-standing concern in the HighPerformance Computing community. When the targeted execution environments arenot available, simulation is a reasonable approach to obtain objectiveperformance indicators and explore various ''what-if?'' scenarios. In thiswork we present a framework for the off-line simulation of MPIapplications. The main originality of our work with regard to the literature is to rely on\tit execution traces. This allows for an extreme scalability as heterogeneousand distributed resources can be used to acquire a trace. We propose a formatwhere for each event that occurs during the execution of an application we logthe volume of instructions for a computation phase or the bytes and the type ofa communication. To acquire time-independent traces of the execution of MPI applications, wehave to instrument them to log the required data. There exist many profilingtools which can instrument an application. We propose a scoring system thatcorresponds to our framework specific requirements and evaluate the mostwell-known and open source profiling tools according to it. Furthermore weintroduce an original tool called Minimal Instrumentation that was designed tofulfill the requirements of our framework. We study different instrumentationmethods and we also investigate several acquisition strategies. We detail thetools that extract the \tit traces from the instrumentation traces of somewell-known profiling tools. Finally we evaluate the whole acquisition procedureand we present the acquisition of large scale instances. We describe in detail the procedure to provide a realistic simulated platformfile to our trace replay tool taking under consideration the topology of thereal platform and the calibration procedure with regard to the application thatis going to be simulated. Moreover we present the implemented trace replaytools that we used during this work. We show that our simulator can predictthe performance of some MPI benchmarks with less than 11\% relativeerror between the real execution and simulation for the cases that there is noperformance issue. Finally, we identify the reasons of the performance issuesand we propose solutions. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Trace acquisition Off-line simulation Performance evaluation Performance prediction MPI
296	Parallel Processing Of Three-dimensional Navier-stokes Equations For Compressible Flows Sisman, Cagri Tahsin 01 September 2005 (has links) (PDF) The aim of this study is to develop a code that is capable of solving three-dimensional compressible flows which are viscous and turbulent, and parallelization of this code. Purpose of parallelization is to obtain a computational efficiency in time respect which enables the solution of complex flow problems in reasonable computational times. In the first part of the study, which is the development of a three-dimensional Navier-Stokes solver for turbulent flows, first step is to develop a two-dimensional Euler code using Roe flux difference splitting method. This is followed by addition of sub programs involving calculation of viscous fluxes. Third step involves implementation of Baldwin-Lomax turbulence model to the code. Finally, the Euler code is generalized to three-dimensions. At every step, code validation is done by comparing numerical results with theoretical, experimental or other numerical results, and adequate consistency between these results is obtained. In the second part, which is the parallelization of the developed code, two-dimensional code is parallelized by using Message Passing Interface (MPI), and important improvements in computational times are obtained. TJ Mechanical Engineering. 1-1570
297	Muse a parallel agent-based simulation environment / Gebre, Meseret Redae. January 2009 (has links) Thesis (M.C.S.)--Miami University, Dept. of Computer Science and Systems Analysis, 2009. / Title from first page of PDF document. Includes bibliographical references (p. 72-75).
298	Non-oscillatory forward-in-time method for incompressible flows Cao, Zhixin January 2018 (has links) This research extends the capabilities of Non-oscillatory Forward-in-Time (NFT) solvers operating on unstructured meshes to allow for accurate simulation of incompressible turbulent flows. This is achieved by the development of Large Eddy Simulation (LES) and Detached Eddy Simulation (DES) turbulent flow methodologies and the development of parallel option of the flow solver. The effective use of LES and DES requires a development of a subgrid-scale model. Several subgrid-scale models are implemented and studied, and their efficacy is assessed. The NFT solvers employed in this work are based on the Multidimensional Positive Definite Advection Transport Algorithm (MPDATA) that facilitates novel implicit Large Eddy Simulation (ILES) approach to treating turbulence. The flexibility and robustness of the new NFT MPDATA solver are studied and successfully validated using well established benchmarks and concentrate on a flow past a sphere. The flow statistics from the solutions are compared against the existing experimental and numerical data and fully confirm the validity of the approach. The parallel implementation of the flow solver is also documented and verified showing a substantial speedup of computations. The proposed method lays foundations for further studies and developments, especially for exploring the potential of MPDATA in the context of ILES and associated treatments of boundary conditions at solid boundaries.
299	Paralelização do algoritmo DIANA com OpenMP e MPI / Parallelization of the DIANA algorithm with OpenMP and MPI Ribeiro, Hethini do Nascimento 31 August 2018 (has links) Submitted by HETHINI DO NASCIMENTO RIBEIRO (hethini.ribeiro@outlook.com) on 2018-10-08T23:20:34Z No. of bitstreams: 1 Dissertação_hethini.pdf: 1986842 bytes, checksum: f1d6e8b9be8decd1fb1e992204d2b2d0 (MD5) / Rejected by Elza Mitiko Sato null (elzasato@ibilce.unesp.br), reason: Solicitamos que realize correções na submissão seguindo as orientações abaixo: Problema 01) A FICHA CATALOGRÁFICA (Obrigatório pela ABNT NBR14724) está desconfigurada e falta número do CDU. Problema 02) Falta citação nos agradecimentos, segundo a Portaria nº 206, de 4 de setembro de 2018, todos os trabalhos que tiveram financiamento CAPES deve constar nos agradecimentos a expressão: "O presente trabalho foi realizado com apoio da Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Código de Financiamento 001 Problema 03) Falta o ABSTRACT (resumo em língua estrangeira), você colocou apenas o resumo em português. Problema 04) Na lista de tabelas, a página referente a Tabela 9 está desconfigurada. Problema 05) A cidade na folha de aprovação deve ser Bauru, cidade onde foi feita a defesa. Bauru 31 de agosto de 2018 Problema 06) A paginação deve ser sequencial, iniciando a contagem na folha de rosto e mostrando o número a partir da introdução, a ficha catalográfica ficará após a folha de rosto e não deverá ser contada. OBS:-Estou encaminhando via e-mail o template/modelo das páginas pré-textuais para que você possa fazer as correções da paginação, sugerimos que siga este modelo pois ele contempla as normas da ABNT Lembramos que o arquivo depositado no repositório deve ser igual ao impresso, o rigor com o padrão da Universidade se deve ao fato de que o seu trabalho passará a ser visível mundialmente. Agradecemos a compreensão on 2018-10-09T14:18:32Z (GMT) / Submitted by HETHINI DO NASCIMENTO RIBEIRO (hethini.ribeiro@outlook.com) on 2018-10-10T00:30:40Z No. of bitstreams: 1 Dissertação_hethini_corrigido.pdf: 1570340 bytes, checksum: a42848ab9f1c4352dcef8839391827a7 (MD5) / Approved for entry into archive by Elza Mitiko Sato null (elzasato@ibilce.unesp.br) on 2018-10-10T14:37:37Z (GMT) No. of bitstreams: 1 ribeiro_hn_me_sjrp.pdf: 1566499 bytes, checksum: 640247f599771152e290426a2174d30f (MD5) / Made available in DSpace on 2018-10-10T14:37:37Z (GMT). No. of bitstreams: 1 ribeiro_hn_me_sjrp.pdf: 1566499 bytes, checksum: 640247f599771152e290426a2174d30f (MD5) Previous issue date: 2018-08-31 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / No início desta década havia cerca de 5 bilhões de telefones em uso gerando dados. Essa produção global aumentou aproximadamente 40% ao ano no início da década passada. Esses grandes conjuntos de dados que podem ser capturados, comunicados, agregados, armazenados e analisados, também chamados de Big Data, estão colocando desafios inevitáveis em muitas áreas e, em particular, no campo Machine Learning. Algoritmos de Machine Learning são capazes de extrair informações úteis desses grandes repositórios de dados e por este motivo está se tornando cada vez mais importante o seu estudo. Os programas aptos a realizarem essa tarefa podem ser chamados de algoritmos de classificação e clusterização. Essas aplicações são dispendiosas computacionalmente. Para citar alguns exemplos desse custo, o algoritmo Quality Threshold Clustering tem, no pior caso, complexidade O(��5). Os algoritmos hierárquicos AGNES e DIANA, por sua vez, possuem O(n²) e O(2n) respectivamente. Sendo assim, existe um grande desafio, que consiste em processar grandes quantidades de dados em um período de tempo realista, encorajando o desenvolvimento de algoritmos paralelos que se adequam ao volume de dados. O objetivo deste trabalho é apresentar a paralelização do algoritmo de hierárquico divisivo DIANA. O desenvolvimento do algoritmo foi realizado em MPI e OpenMP, chegando a ser três vezes mais rápido que a versão monoprocessada, evidenciando que embora em ambientes de memória distribuídas necessite de sincronização e troca de mensagens, para um certo grau de paralelismo é vantajosa a aplicação desse tipo de otimização para esse algoritmo. / Earlier in this decade there were about 5 billion phones in use generating data. This global production increased approximately 40% per year at the beginning of the last decade. These large datasets that can be captured, communicated, aggregated, stored and analyzed, also called Big Data, are posing inevitable challenges in many areas, and in particular in the Machine Learning field. Machine Learning algorithms are able to extract useful information from these large data repositories and for this reason their study is becoming increasingly important. The programs that can perform this task can be called classification and clustering algorithms. These applications are computationally expensive. To cite some examples of this cost, the Quality Threshold Clustering algorithm has, in the worst case, complexity O (n5). The hierarchical algorithms AGNES and DIANA, in turn, have O (n²) and O (2n) respectively. Thus, there is a great challenge, which is to process large amounts of data in a realistic period of time, encouraging the development of parallel algorithms that fit the volume of data. The objective of this work is to present the parallelization of the DIANA divisive hierarchical algorithm. The development of the algorithm was performed in MPI and OpenMP, reaching three times faster than the monoprocessed version, evidencing that although in distributed memory environments need synchronization and exchange of messages, for a certain degree of parallelism it is advantageous to apply this type of optimization for this algorithm. / 1757857 Paralelismo Algoritmos de clusterização Aprendizado de máquina Mineração de dados DIANA OpenMP MPI Parallelism Clustering algorithms Machine learning Data mining
300	Communication Reducing Approaches and Shared-Memory Optimizations for the Hierarchical Fast Multipole Method on Distributed and Many-core Systems Abduljabbar, Mustafa 06 December 2018 (has links) We present algorithms and implementations that overcome obstacles in the migration of the Fast Multipole Method (FMM), one of the most important algorithms in computational science and engineering, to exascale computing. Emerging architectural approaches to exascale computing are all characterized by data movement rates that are slow relative to the demand of aggregate floating point capability, resulting in performance that is bandwidth limited. Practical parallel applications of FMM are impeded in their scaling by irregularity of domains and dominance of collective tree communication, which is known not to scale well. We introduce novel ideas that improve partitioning of the N-body problem with boundary distribution through a sampling-based mechanism that hybridizes two well-known partitioning techniques, Hashed Octree (HOT) and Orthogonal Recursive Bisection (ORB). To reduce communication cost, we employ two methodologies. First, we directly utilize features available in parallel runtime systems to enable asynchronous computing and overlap it with communication. Second, we present Hierarchical Sparse Data Exchange (HSDX), a new all-to-all algorithm that inherently relieves communication by relaying sparse data in a few steps of neighbor exchanges. HSDX exhibits superior scalability and improves relative performance compared to the default MPI alltoall and other relevant literature implementations. We test this algorithm alongside others on a Cray XC40 tightly coupled with the Aries network and on Intel Many Integrated Core Architecture (MIC) represented by Intel Knights Corner (KNC) and Intel Knights Landing (KNL) as modern shared-memory CPU environments. Tests include comparisons of thoroughly tuned handwritten versus auto-vectorization of FMM Particle-to-Particle (P2P) and Multipole-to-Local (M2L) kernels. Scalability of task-based parallelism is assessed with FMM’s tree traversal kernel using different threading libraries. The MIC tests show large performance gains after adopting the prescribed techniques, which are inevitable in a world that is moving towards many-core parallelism. N-Body Method fast multipole method hight-performance computing HSDX algorithm domain decomposition MPI+X+Y Parallelism

Search results