421 |
Ancillarity-Sufficiency Interweaving Strategy (ASIS) for Boosting MCMC Estimation of Stochastic Volatility ModelsKastner, Gregor, Frühwirth-Schnatter, Sylvia 01 1900 (has links) (PDF)
Bayesian inference for stochastic volatility models using MCMC methods highly depends
on actual parameter values in terms of sampling efficiency. While draws from the posterior
utilizing the standard centered parameterization break down when the volatility of volatility parameter
in the latent state equation is small, non-centered versions of the model show deficiencies
for highly persistent latent variable series. The novel approach of ancillarity-sufficiency
interweaving has recently been shown to aid in overcoming these issues for a broad class of
multilevel models. In this paper, we demonstrate how such an interweaving strategy can be
applied to stochastic volatility models in order to greatly improve sampling efficiency for all
parameters and throughout the entire parameter range. Moreover, this method of "combining
best of different worlds" allows for inference for parameter constellations that have previously
been infeasible to estimate without the need to select a particular parameterization beforehand. / Series: Research Report Series / Department of Statistics and Mathematics
|
422 |
Métaheuristiques pour l'optimisation combinatoire sur processeurs graphiques (GPU) / Metaheuristics for combinatorial optimization on Graphics Processing Unit (GPU)Delevacq, Audrey 04 February 2013 (has links)
Plusieurs problèmes d'optimisation combinatoire sont dits NP-difficiles et ne peuvent être résolus de façon optimale par des algorithmes exacts. Les métaheuristiques ont prouvé qu'elles pouvaient être efficaces pour résoudre un grand nombre de ces problèmes en leur trouvant des solutions approchées en un temps raisonnable. Cependant, face à des instances de grande taille, elles ont besoin d'un temps de calcul et d'une quantité d'espace mémoire considérables pour être performantes dans l'exploration de l'espace de recherche. Par conséquent, l'intérêt voué à leur déploiement sur des architectures de calcul haute performance a augmenté durant ces dernières années. Les approches de parallélisation existantes suivent généralement les paradigmes de passage de messages ou de mémoire partagée qui conviennent aux architectures traditionnelles à base de microprocesseurs, aussi appelés CPU (Central Processing Unit).Cependant, la recherche évolue très rapidement dans le domaine du parallélisme et de nouvelles architectures émergent, notamment les accélérateurs matériels qui permettent de décharger le CPU de certaines de ses tâches. Parmi ceux-ci, les processeurs graphiques ou GPU (Graphics Processing Units) présentent une architecture massivement parallèle possédant un grand potentiel mais aussi de nouvelles difficultés d'algorithmique et de programmation. En effet, les modèles de parallélisation de métaheuristiques existants sont généralement inadaptés aux environnements de calcul de type GPU. Certains travaux ont d'ailleurs abordé ce sujet sans toutefois y apporter une vision globale et fondamentale.L'objectif général de cette thèse est de proposer un cadre de référence permettant l'implémentation efficace des métaheuristiques sur des architectures parallèles basées sur les GPU. Elle débute par un état de l'art décrivant les travaux existants sur la parallélisation GPU des métaheuristiques et les classifications générales des métaheuristiques parallèles. Une taxonomie originale est ensuite proposée afin de classifier les implémentations recensées et de formaliser les stratégies de parallélisation sur GPU dans un cadre méthodologique cohérent. Cette thèse vise également à valider cette taxonomie en exploitant ses principales composantes pour proposer des stratégies de parallélisation originales spécifiquement adaptées aux architectures GPU. Plusieurs implémentations performantes basées sur les métaheuristiques d'Optimisation par Colonie de Fourmis et de Recherche Locale Itérée sont ainsi proposées pour la résolution du problème du Voyageur de Commerce. Une étude expérimentale structurée et minutieuse est réalisée afin d'évaluer et de comparer la performance des approches autant au niveau de la qualité des solutions trouvées que de la réduction du temps de calcul. / Several combinatorial optimization problems are NP-hard and can only be solved optimally by exact algorithms for small instances. Metaheuristics have proved to be effective in solving many of these problems by finding approximate solutions in a reasonable time. However, dealing with large instances, they may require considerable computation time and amount of memory space to be efficient in the exploration of the search space. Therefore, the interest devoted to their deployment on high performance computing architectures has increased over the past years. Existing parallelization approaches generally follow the message-passing and shared-memory computing paradigms which are suitable for traditional architectures based on microprocessors, also called CPU (Central Processing Unit). However, research in the field of parallel computing is rapidly evolving and new architectures emerge, including hardware accelerators which offloads the CPU of some of its tasks. Among them, graphics processors or GPUs (Graphics Processing Units) have a massively parallel architecture with great potential but also imply new algorithmic and programming challenges. In fact, existing parallelization models of metaheuristics are generally unsuited to computing environments like GPUs. Few works have tackled this subject without providing a comprehensive and fundamental view of it.The general purpose of this thesis is to propose a framework for the effective implementation of metaheuristics on parallel architectures based on GPUs. It begins with a state of the art describing existing works on GPU parallelization of metaheuristics and general classifications of parallel metaheuristics. An original taxonomy is then designed to classify identified implementations and to formalize GPU parallelization strategies in a coherent methodological framework. This thesis also aims to validate this taxonomy by exploiting its main components to propose original parallelization strategies specifically tailored to GPU architectures. Several effective implementations based on Ant Colony Optimization and Iterated Local Search metaheuristics are thus proposed for solving the Travelling Salesman Problem. A structured and thorough experimental study is conducted to evaluate and compare the performance of approaches on criteria related to solution quality and computing time reduction.
|
423 |
[en] EFFICIENT FLUID SIMULATION IN THE PARAMETRIC SPACE OF THREE-DIMENSIONAL STRUCTURED GRIDS / [pt] SIMULAÇÃO EFICIENTE DE FLUIDOS NO ESPAÇO PARAMÉTRICO DE MALHAS ESTRUTURADAS TRIDIMENSIONAISVITOR BARATA RIBEIRO BLANCO BARROSO 13 January 2017 (has links)
[pt] Fluidos são extremamente comuns em nosso mundo e têm papel central em muitos fenômenos naturais. A compreensão de seu comportamento tem importância fundamental em uma vasta gama de aplicações e diversas áreas de pesquisa, da análise de fluxo sanguíneo até o transporte de petróleo, da exploração do fluxo de um rio até a previsão de maremotos, tempestades e furacões. Na simulação de fluidos, a abordagem conhecida como Euleriana é capaz de gerar resultados bastante corretos e precisos, mas as computações envolvidas podem se tornar excessivamente custosas quando há a necessidade de tratar fronteiras curvas e obstáculos com formas complexas. Este trabalho aborda esse problema e apresenta uma técnica Euleriana rápida e direta para simular o escoamento de fluidos em grades estruturadas parametrizadas tridimensionais. O principal objetivo do método é tratar de forma correta e eficiente as interações de fluidos com fronteiras curvas, incluindo paredes externas e obstáculos internos. Para isso, são utilizadas matrizes Jacobianas por célula para relacionar as derivadas de campos escalares e vetoriais nos espaços do mundo e paramétrico, o que permite a resolução das equações de Navier-Stokes diretamente no segundo, onde a discretização do domínio torna-se simplesmente uma grade uniforme. O trabalho parte de um simulador baseado em grades regulares e descreve como adaptá-lo com a aplicação das matrizes Jacobianas em cada passo, incluindo a resolução de equações de Poisson e dos sistemas lineares esparsos associados, utilizando tanto iterações de Jacobi quanto o método do Gradiente Biconjugado Estabilizado. A técnica é implementada na linguagem de programação CUDA e procura explorar ao máximo a arquitetura massivamente paralela das placas gráficas atuais. / [en] Fluids are extremely common in our world and play a central role in many natural phenomena. Understanding their behavior is of great importance to a broad range of applications and several areas of research, from blood flow analysis to oil transportation, from the exploitation of river flows to the prediction of tidal waves, storms and hurricanes. When simulating fluids, the so-called Eulerian approach can generate quite correct and precise results, but the computations involved can become excessively expensive when curved boundaries and obstacles with complex shapes need to be taken into account. This work addresses this problem and presents a fast and straightforward Eulerian technique to simulate fluid flows in three-dimensional parameterized structured grids. The method s primary design goal is the correct and efficient handling of fluid interactions with curved boundary walls and internal obstacles. This is accomplished by the use of per-cell Jacobian matrices to relate field derivatives in the world and parameter spaces, which allows the Navier-Stokes equations to be solved directly in the latter, where the domain discretization becomes a simple uniform grid. The work builds on a regular-grid-based simulator and describes how to apply Jacobian matrices to each step, including the solution of Poisson equations and the related sparse linear systems using
both Jacobi iterations and a Biconjugate Gradient Stabilized solver. The technique is implemented efficiently in the CUDA programming language and strives to take full advantage of the massively parallel architecture of today s graphics cards.
|
424 |
Towards optimal design of multiscale nonlinear structures : reduced-order modeling approaches / Vers une conception optimale des structures multi-échelles non-linéaires : approches de réduction de modèleXia, Liang 25 November 2015 (has links)
L'objectif principal est de faire premiers pas vers la conception topologique de structures hétérogènes à comportement non-linéaires. Le deuxième objectif est d’optimiser simultanément la topologie de la structure et du matériau. Il requiert la combinaison des méthodes de conception optimale et des approches de modélisation multi-échelle. En raison des lourdes exigences de calcul, nous avons introduit des techniques de réduction de modèle et de calcul parallèle. Nous avons développé tout d’abord un cadre de conception multi-échelle constitué de l’optimisation topologique et la modélisation multi-échelle. Ce cadre fournit un outil automatique pour des structures dont le modèle de matériau sous-jacent est directement régi par la géométrie de la microstructure réaliste et des lois de comportement microscopiques. Nous avons ensuite étendu le cadre en introduisant des variables supplémentaires à l’échelle microscopique pour effectuer la conception simultanée de la structure et de la microstructure. En ce qui concerne les exigences de calcul et de stockage de données en raison de multiples réalisations de calcul multi-échelle sur les configurations similaires, nous avons introduit: les approches de réduction de modèle. Nous avons développé un substitut d'apprentissage adaptatif pour le cas de l’élasticité non-linéaire. Pour viscoplasticité, nous avons collaboré avec le Professeur Felix Fritzen de l’Université de Stuttgart en utilisant son modèle de réduction avec la programmation parallèle sur GPU. Nous avons également adopté une autre approche basée sur le potentiel de réduction issue de la littérature pour améliorer l’efficacité de la conception simultanée. / High-performance heterogeneous materials have been increasingly used nowadays for their advantageous overall characteristics resulting in superior structural mechanical performance. The pronounced heterogeneities of materials have significant impact on the structural behavior that one needs to account for both material microscopic heterogeneities and constituent behaviors to achieve reliable structural designs. Meanwhile, the fast progress of material science and the latest development of 3D printing techniques make it possible to generate more innovative, lightweight, and structurally efficient designs through controlling the composition and the microstructure of material at the microscopic scale. In this thesis, we have made first attempts towards topology optimization design of multiscale nonlinear structures, including design of highly heterogeneous structures, material microstructural design, and simultaneous design of structure and materials. We have primarily developed a multiscale design framework, constituted of two key ingredients : multiscale modeling for structural performance simulation and topology optimization forstructural design. With regard to the first ingredient, we employ the first-order computational homogenization method FE2 to bridge structural and material scales. With regard to the second ingredient, we apply the method Bi-directional Evolutionary Structural Optimization (BESO) to perform topology optimization. In contrast to the conventional nonlinear design of homogeneous structures, this design framework provides an automatic design tool for nonlinear highly heterogeneous structures of which the underlying material model is governed directly by the realistic microstructural geometry and the microscopic constitutive laws. Note that the FE2 method is extremely expensive in terms of computing time and storage requirement. The dilemma of heavy computational burden is even more pronounced when it comes to topology optimization : not only is it required to solve the time-consuming multiscale problem once, but for many different realizations of the structural topology. Meanwhile we note that the optimization process requires multiple design loops involving similar or even repeated computations at the microscopic scale. For these reasons, we introduce to the design framework a third ingredient : reduced-order modeling (ROM). We develop an adaptive surrogate model using snapshot Proper Orthogonal Decomposition (POD) and Diffuse Approximation to substitute the microscopic solutions. The surrogate model is initially built by the first design iteration and updated adaptively in the subsequent design iterations. This surrogate model has shown promising performance in terms of reducing computing cost and modeling accuracy when applied to the design framework for nonlinear elastic cases. As for more severe material nonlinearity, we employ directly an established method potential based Reduced Basis Model Order Reduction (pRBMOR). The key idea of pRBMOR is to approximate the internal variables of the dissipative material by a precomputed reduced basis computed from snapshot POD. To drastically accelerate the computing procedure, pRBMOR has been implemented by parallelization on modern Graphics Processing Units (GPUs). The implementation of pRBMOR with GPU acceleration enables us to realize the design of multiscale elastoviscoplastic structures using the previously developed design framework inrealistic computing time and with affordable memory requirement. We have so far assumed a fixed material microstructure at the microscopic scale. The remaining part of the thesis is dedicated to simultaneous design of both macroscopic structure and microscopic materials. By the previously established multiscale design framework, we have topology variables and volume constraints defined at both scales.
|
425 |
Méthodes de préconditionnement pour la résolution de systèmes linéaires sur des machines massivement parallèles / Preconditioning methods for solving linear systems on massively parallel machinesQu, Long 10 April 2014 (has links)
Cette thèse traite d’une nouvelle classe de préconditionneurs qui ont pour but d’accélérer la résolution des grands systèmes creux, courant dans les problèmes scientifiques ou industriels, par les méthodes itératives préconditionnées. Pour appliquer ces préconditionneurs, la matrice d’entrée doit être réorganisée avec un algorithme de dissection emboîtée. Nous introduisons également une technique de recouvrement qui s’adapte à l’idée de chevauchement des sous-domaines provenant des méthodes de décomposition de domaine, aux méthodes de dissection emboîtée pour améliorer la convergence de nos préconditionneurs.Les résultats montrent que cette technique de recouvrement nous permet d’améliorer la vitesse de convergence de Nested SSOR (NSSOR) et Nested Modified incomplete LU with Rowsum proprety (NMILUR) qui sont des préconditionneurs que nous étudions. La dernière partie de cette thèse portera sur nos contributions dans le domaine du calcul parallèle. Nous présenterons la distribution des données et les algorithmes parallèles utilisés pour la mise en oeuvre de nos préconditionneurs. Les résultats montrent que sur une grille régulière 400x400x400, le nombre d’itérations nécessaire à la résolution avec un de nos préconditionneurs, Nested Filtering Factorization préconditionneur (NFF), n’augmente que légèrement quand le nombre de sous-domaines augmente jusqu’à 2048. En ce qui concerne les performances d’exécution sur le super-calculateur Curie, il passe à l’échelle jusqu’à 2048 coeurs et il est 2,6 fois plus rapide que le préconditionneur Schwarz Additif Restreint (RAS) qui est un des préconditionneurs basés sur les méthodes de décomposition de domaine implémentés dans la bibliothèque de calcul scientifique PETSc, bien connue de la communauté. / This thesis addresses a new class of preconditioners which aims at accelerating solving large sparse systems arising in scientific and engineering problem by using preconditioned iterative methods. To apply these preconditioners, the input matrix needs to be reordered with K-way nested dissection. We also introduce an overlapping technique that adapts the idea of overlapping subdomains from domain decomposition methods to nested dissection based methods to improve the convergence of these preconditioners. Results show that such overlapping technique improves the convergence rate of Nested SSOR (NSSOR) and Nested Modified Incomplete LU with Rowsum property (NMILUR) precondtioners that we worked on. We also present the data distribution and parallel algorithms for implementing these preconditioners. Results show that on a 400x400x400 regular grid, the number of iterations with Nested Filtering Factorization preconditioner (NFF) increases slightly while increasing the number of subdomains up to 2048. In terms of runtime performance on Curie supercomputer, it scales up to 2048 cores and it is 2.6 times faster than the domain decomposition preconditioner Restricted Additive Schwarz (RAS) as implemented in PETSc.
|
426 |
Système d'agents mobiles pour les architectures de calculs auto-adaptatifs / Mobile Agent System dedicated to adaptable numerical architectureDumont, Cyril 28 May 2014 (has links)
Ce travail appartient au domaine de la simulation numérique sur des plates-formes d'exécution distribuées hétérogènes telles que des grilles de calcul. Ce type de plate-forme se caractérise par des possibles changements de condition d'exécution et par une probabilité importante de défaillance de certains composants. Une application qui s'exécute dans un tel environnement se doit d'être adaptable à son contexte d'exécution et tolérante aux pannes. Face à la complexité croissante de la mise en place de cas de calcul sur des grilles de calcul, nous proposons une plateforme logicielle pour la résolution de cas de calcul numérique dans un environnement distribué hétérogène. Nos travaux apportent une solution qui se base sur un système d'agents mobiles, ce qui permet à une application de s'adapter au changement de son environnement d'exécution. Dans un premier temps, nous utilisons le langage pi calcul d'ordre supérieur pour spécifier une « ferme de travailleurs » capable de participer à la résolution de tout type de cas de calcul. Ensuite, nous énonçons des propriétés qui caractérisent le bon fonctionnement de ce système avec une logique temporelle TCTL. Pour cela, nous souhaitons modéliser notre système à l'aide d'automates temporisés à partir des termes définis par la spécification formelle en pi calcul. Dans ce but, nous définissons une transformation de termes écrits en pi calcul en automates temporisés. Les propriétés sont alors vérifiées avec l'outil UppAal. Pour valider ce travail de modélisation, nous avons réalisé le framework MCA (pour Mobile Computing Architecture). Celui-ci propose un ensemble d'outils facilitant la mise en place de composants sur un environnement distribué hétérogène dans le but d'effectuer la résolution de cas de calcul. La librairie avec laquelle sont développés ces composants, qu'ils soient mobiles ou non, est implantée en Java et se base les technologies Jini et JavaSpaces. Enfin, nous réalisons l'évaluation du framework MCA en procédant à la résolution de trois cas de calcul différents. Chacune de ces expériences, réalisées sur une grappe de 20 noeuds, nous permet de montrer les caractéristiques essentielles de notre framework : une simplicité de programmation, un faible surcoût en temps d'exécution sans l'activation de la tolérance aux pannes et une tolérance aux pannes efficace / This work belongs to the domain of numerical simulation on heterogeneous distributed platforms such as grids. This type of platform is characterized by possible changes in execution conditions and a significant probability of some components failure. An application running in such an environment must be adaptable to its execution context and fault tolerant. Facing the growing complexity of implementing computation cases on grid computing, we propose a software platform which solves numerical computation cases in a distributed heterogeneous environment. Our work provides a solution based on a mobile agent system, which allows an application to adapt to change in its execution environment. At first, we use the higher-order pi calculus language to specify a « farm of workers » able to take part in solving any type of computation case. Then we set the properties that characterize the system's correct execution with a temporal logic TCTL. In order to do this, we perform a temporal modeling system based on terms defined by the formal specification in pi calculus. To achieve this transformation, we define a translation of terms written in pi calculus into timed automata. The properties are verified with the UppAal tool. To validate this modeling work, we develop the MCA (for Mobile Computing Architecture) framework. It offers a set of tools which facilitate the implementation of distributed heterogeneous components in order to solve computation cases. These components, mobile or not, are developed with a library written in Java and which uses Jini and JavaSpaces technologies. Finally, our framework is evaluated through the resolution of three different computation cases. Each of these experiments, performed on a 20 node cluster allow us to highlight our framework's main characteristics : programming simplicity, low overhead in execution time without the fault tolerance activation and efficient fault tolerance
|
427 |
"Processamento distribuído de áudio em tempo real" / "Distributed Real-Time Audio Processing"Nelson Posse Lago 04 June 2004 (has links)
Sistemas computadorizados para o processamento de multimídia em tempo real demandam alta capacidade de processamento. Problemas que exigem grandes capacidades de processamento são comumente abordados através do uso de sistemas paralelos ou distribuídos; no entanto, a conjunção das dificuldades inerentes tanto aos sistemas de tempo real quanto aos sistemas paralelos e distribuídos tem levado o desenvolvimento com vistas ao processamento de multimídia em tempo real por sistemas computacionais de uso geral a ser baseado em equipamentos centralizados e monoprocessados. Em diversos sistemas para multimídia há a necessidade de baixa latência durante a interação com o usuário, o que reforça ainda mais essa tendência para o processamento em um único nó. Neste trabalho, implementamos um mecanismo para o processamento síncrono e distribuído de áudio com características de baixa latência em uma rede local, permitindo o uso de um sistema distribuído de baixo custo para esse processamento. O objetivo primário é viabilizar o uso de sistemas computacionais distribuídos para a gravação e edição de material musical em estúdios domésticos ou de pequeno porte, contornando a necessidade de hardware dedicado de alto custo. O sistema implementado consiste em duas partes: uma, genérica, implementada sob a forma de um middleware para o processamento síncrono e distribuído de mídias contínuas com baixa latência; outra, específica, baseada na primeira, voltada para o processamento de áudio e compatível com aplicações legadas através da interface padronizada LADSPA. É de se esperar que pesquisas e aplicações futuras em que necessidades semelhantes se apresentem possam utilizar o middleware aqui descrito para outros tipos de processamento de áudio bem como para o processamento de outras mídias, como vídeo. / Computer systems for real-time multimedia processing require high processing power. Problems that depend on high processing power are usually solved by using parallel or distributed computing techniques; however, the combination of the difficulties of both real-time and parallel programming has led the development of applications for real-time multimedia processing for general purpose computer systems to be based on centralized and single-processor systems. In several systems for multimedia processing, there is a need for low latency during the interaction with the user, which reinforces the tendency towards single-processor development. In this work, we implemented a mechanism for synchronous and distributed audio processing with low latency on a local area network which makes the use of a low cost distributed system for this kind of processing possible. The main goal is to allow the use of distributed systems for recording and editing of musical material in home and small studios, bypassing the need for high-cost equipment. The system we implemented is made of two parts: the first, generic, implemented as a middleware for synchronous and distributed processing of continuous media with low latency; and the second, based on the first, geared towards audio processing and compatible with legacy applications based on the standard LADSPA interface. We expect that future research and applications that share the needs of the system developed here make use of the middleware we developed, both for other kinds of audio processing as well as for the processing of other media forms, such as video.
|
428 |
A simulation workflow to evaluate the performance of dynamic load balancing with over decomposition for iterative parallel applicationsTesser, Rafael Keller January 2018 (has links)
Nesta tese é apresentado um novo workflow de simulação para avaliar o desempenho do balanceamento de carga dinâmico baseado em sobre-decomposição aplicado a aplicações paralelas iterativas. Seus objetivos são realizar essa avaliação com modificações mínimas da aplicação e a baixo custo em termos de tempo e de sua necessidade de recursos computacionais. Muitas aplicações paralelas sofrem com desbalanceamento de carga dinâmico (temporal) que não pode ser tratado a nível de aplicação. Este pode ser causado por características intrínsecas da aplicação ou por fatores externos de hardware ou software. Como demonstrado nesta tese, tal desbalanceamento é encontrado mesmo em aplicações cujo código não aparenta qualquer dinamismo. Portanto, faz-se necessário utilizar mecanismo de balanceamento de carga dinâmico a nível de runtime. Este trabalho foca no balanceamento de carga dinâmico baseado em sobre-decomposição. No entanto, avaliar e ajustar o desempenho de tal técnica pode ser custoso. Isso geralmente requer modificações na aplicação e uma grande quantidade de execuções para obter resultados estatisticamente significativos com diferentes combinações de parâmetros de balanceamento de carga Além disso, para que essas medidas sejam úteis, são usualmente necessárias grandes alocações de recursos em um sistema de produção. Simulated Adaptive MPI (SAMPI), nosso workflow de simulação, emprega uma combinação de emulação sequencial e replay de rastros para reduzir os custos dessa avaliação. Tanto emulação sequencial como replay de rastros requerem um único nó computacional. Além disso, o replay demora apenas uma pequena fração do tempo de uma execução paralela real da aplicação. Adicionalmente à simulação de balanceamento de carga, foram desenvolvidas técnicas de agregação espacial e rescaling a nível de aplicação, as quais aceleram o processo de emulação. Para demonstrar os potenciais benefícios do balanceamento de carga dinâmico com sobre-decomposição, foram avaliados os ganhos de desempenho empregando essa técnica a uma aplicação iterativa paralela da área de geofísica (Ondes3D). Adaptive MPI (AMPI) foi utilizado para prover o suporte a balanceamento de carga dinâmico, resultando em ganhos de desempenho de até 36.58% em 288 cores de um cluster Essa avaliação também é usada pra ilustrar as dificuldades encontradas nesse processo, assim justificando o uso de simulação para facilitá-la. Para implementar o workflow SAMPI, foi utilizada a interface SMPI do simulador SimGrid, tanto no modo de emulação, como no de replay de rastros. Para validar esse simulador, foram comparadas execuções simuladas (SAMPI) e reais (AMPI) da aplicação Ondes3D. As simulações apresentaram uma evolução do balanceamento de carga bastante similar às execuções reais. Adicionalmente, SAMPI estimou com sucesso a melhor heurística de balanceamento de carga para os cenários testados. Além dessa validação, nesta tese é demonstrado o uso de SAMPI para exploração de parâmetros de balanceamento de carga e para planejamento de capacidade computacional. Quanto ao desempenho da simulação, estimamos que o workflow completo é capaz de simular a execução do Ondes3D com 24 combinações de parâmetros de balanceamento de carga em 5 horas para o nosso cenário de terremoto mais pesado e 3 horas para o mais leve. / In this thesis we present a novel simulation workflow to evaluate the performance of dynamic load balancing with over-decomposition applied to iterative parallel applications at low-cost. Its goals are to perform such evaluation with minimal application modification and at a low cost in terms of time and of resource requirements. Many parallel applications suffer from dynamic (temporal) load imbalance that can not be treated at the application level. It may be caused by intrinsic characteristics of the application or by external software and hardware factors. As demonstrated in this thesis, such dynamic imbalance can be found even in applications whose codes do not hint at any dynamism. Therefore, we need to rely on runtime dynamic load balancing mechanisms, such as dynamic load balancing based on over-decomposition. The problem is that evaluating and tuning the performance of such technique can be costly. This usually entails modifications to the application and a large number of executions to get statistically sound performance measurements with different load balancing parameter combinations. Moreover, useful and accurate measurements often require big resource allocations on a production cluster. Our simulation workflow, dubbed Simulated Adaptive MPI (SAMPI), employs a combined sequential emulation and trace-replay simulation approach to reduce the cost of such an evaluation Both sequential emulation and trace-replay require a single computer node. Additionally, the trace-replay simulation lasts a small fraction of the real-life parallel execution time of the application. Besides the basic SAMPI simulation, we developed spatial aggregation and applicationlevel rescaling techniques to speed-up the emulation process. To demonstrate the real-life performance benefits of dynamic load balance with over-decomposition, we evaluated the performance gains obtained by employing this technique on a iterative parallel geophysics application, called Ondes3D. Dynamic load balancing support was provided by Adaptive MPI (AMPI). This resulted in up to 36.58% performance improvement, on 288 cores of a cluster. This real-life evaluation also illustrates the difficulties found in this process, thus justifying the use of simulation. To implement the SAMPI workflow, we relied on SimGrid’s Simulated MPI (SMPI) interface in both emulation and trace-replay modes.To validate our simulator, we compared simulated (SAMPI) and real-life (AMPI) executions of Ondes3D. The simulations presented a load balance evolution very similar to real-life and were also successful in choosing the best load balancing heuristic for each scenario. Besides the validation, we demonstrate the use of SAMPI for load balancing parameter exploration and for computational capacity planning. As for the performance of the simulation itself, we roughly estimate that our full workflow can simulate the execution of Ondes3D with 24 different load balancing parameter combinations in 5 hours for our heavier earthquake scenario and in 3 hours for the lighter one.
|
429 |
[en] MCCLOUD SERVICE FRAMEWORK: DEVELOPMENT SERVICES OF MONTE CARLO SIMULATION IN THE CLOUD / [pt] MCCLOUD SERVICE FRAMEWORK: ARCABOUÇO PARA DESENVOLVIMENTO DE SERVIÇOS BASEADOS NA SIMULAÇÃO DE MONTE CARLO NA CLOUDRAFAEL BARBOSA NASSER 13 June 2012 (has links)
[pt] O investimento em infraestrutura computacional para suportar picos de
processamento de curta duração ou sazonais pode gerar desperdícios financeiros,
em razão de, na maior parte do tempo, estes recursos ficarem ociosos. Além disso,
em muitas soluções, o tempo de resposta é crítico para atendimento dos requisitos
do negócio, tornando, muitas vezes, a solução economicamente inviável. Neste
cenário é fundamental a alocação inteligente de recursos computacionais em
função da demanda por processamento, custo desta alocação e requisitos do
negócio. A Simulação de Monte Carlo é um método estatístico utilizado para
resolver uma ampla gama de problemas científicos e de engenharia. Quando
aplicado a problemas reais, muitas vezes apresenta os desafios mencionados.
Computação na nuvem surge como uma alternativa para disponibilizar recursos
computacionais sob demanda, gerando economia de escala sem precedentes e
escalabilidade quase infinita. Ao alinhar uma moderna arquitetura à nuvem é
possível encapsular funcionalidades e oferecer um leque de serviços que antes
seriam restritos a domínios específicos. Neste trabalho propomos um arcabouço
genérico, que permite a disponibilização de um leque de serviços baseados na
Simulação de Monte Carlo, fazendo uso racional da elasticidade provida pela
nuvem, a fim de alcançar melhores patamares de eficiência e reuso. / [en] The investment in computing infrastructure to attend seasonal demand or
processing peak can generate financial waste, because the most of the time these
resources are idle. In addition, in many solutions the response time are critical to
attend business requirements, which often, turn the solution economically
unviable. In this scenario it is essential intelligent allocation of computing
resources according to the demand for processing, allocation and cost of business
requirements. The Monte Carlo Simulation is a statistical method widely used to
solve a wide range of scientific and engineering problems. When applied to real
problems usually have the challenges mentioned. Cloud Computing is an
alternative to providing on-demand computing resources, generating economies of
scale unprecedented and almost infinite scalability. Aligning a modern
architecture to the cloud is possible to encapsulate functionality and offer a range
of services that would previously have been restricted to specific areas. In this
paper we are interested in building a generic framework, that may provide a range
of services based on Monte Carlo, make rational use of the elasticity provided by
the cloud in order to achieve better levels of efficiency and reuse.
|
430 |
Akcelerace ultrazvukových simulací pro axisymetrické medium / Acceleration of Axisymetric Ultrasound SimulationsKukliš, Filip January 2018 (has links)
Simulácia šírenia ultrazvuku prostredníctvom mäkkých biologických tkanív má širokú škálu praktických aplikácií. Patria sem dizajn prevodníkov pre diagnostický a terapeutický ultrazvuk, vývoj nových metód spracovania signálov a zobrazovacích techník, štúdium anomálií ultrazvukových lúčov v heterogénnych médiách, ultrazvuková klasifikácia tkanív, učenie rádiológov používať ultrazvukové zariadenia a interpretáciu ultrazvukových obrazov, modelové vrstvenie medicínskeho obrazu a plánovanie liečby pre ultrazvuk s vysokou intenzitou. Ultrazvuková simulácia však predstavuje výpočtovo zložitý problém, pretože simulačné domény sú veľmi veľké v porovnaní s akustickými vlnovými dĺžkami, ktoré sú predmetom záujmu. Ale ak je problém osovo symetrický, problém môže byť riešený v 2D.To umožňuje spúšťanie simulácií na mriežke s väčším počtom bodov, s menším využitím výpoč- tových zdrojov za kratšiu dobu. Táto práca modeluje a implementuje zrýchlenie vlnovej nelineárnej ultrazvukovej simulácie v axisymetrickom súradnicovom systéme realizovanom v Matlabe pomocou Mex súborov pre diskrétne sínové a kosínové transformácie. Axisymetrická simulácia bola implementovaná v C++ ako open source rozšírenie K-WAVE toolboxu. Kód je optimalizovaný na beh na jednom uzle superpočítaču Salomon (IT4Innovations, Ostrava, Česká republika) s dvoma dvanásť-jadrovými procesormi Intel Xeon E5-2680v3. Na maximalizáciu výpočtovej efektívnosti boli vykonané viaceré optimalizácie kódu. Po prvé, fourierové tramsformácie boli vypočítané pomocou real-to-complex FFT z knižnice FFTW. V porovnaní s complex-to-complex FFT to znížilo čas výpočtu a pamäť spojenú s výpočtom FFT o takmer 50%. Taktiež diskrétne sínové a kosínové transformácie sa počítali pomocou knižnice FFTW, ktoré v Matlab verzii museli byť vyvolané z dynamicky načítaných MEX súborov. Po druhé, aby sa znížilo zaťaženie priepustnosti pamäte, boli všetky operácie počítané jednoduchej presnosti pohyblivej rádovej čiarky. Po tretie, elementárne operá- cie boli paralelizované pomocou OpenMP a potom vektorizované pomocou rozšírení SIMD (SSE). Celkový výpočet C++ verzie je až do 34-násobne rýchlejší a využíva menej ako tretinu pamäte ako Matlab verzia simulácie. Simulácia ktorá by trvala takmer dva dni tak môže byť vypočítaná za jeden a pol hodinu. Toto všetko umožňuje počítať simuláciu na výpočetnej mriežke s veľkosťou 16384 × 8192 bodov v primeranom čase.
|
Page generated in 0.1019 seconds