Global ETD Search

891	Simulation of Biological Tissue using Mass-Spring-Damper Models / Simulering av biologisk vävnad med hjälp av mass-spring-damper-modeller Eriksson, Emil January 2013 (has links) The goal of this project was to evaluate the viability of a mass-spring-damper based model for modeling of biological tissue. A method for automatically generating such a model from data taken from 3D medical imaging equipment including both the generation of point masses and an algorithm for generating the spring-damper links between these points is presented. Furthermore, an implementation of a simulation of this model running in real-time by utilizing the parallel computational power of modern GPU hardware through OpenCL is described. This implementation uses the fourth order Runge-Kutta method to improve stability over similar implementations. The difficulty of maintaining stability while still providing rigidness to the simulated tissue is thoroughly discussed. Several observations on the influence of the structure of the model on the consistency of the simulated tissue are also presented. This implementation also includes two manipulation tools, a move tool and a cut tool for interaction with the simulation. From the results, it is clear that the mass-springdamper model is a viable model that is possible to simulate in real-time on modern but commoditized hardware. With further development, this can be of great benefit to areas such as medical visualization and surgical simulation. / Målet med detta projekt var att utvärdera huruvida en modell baserad på massa-fjäderdämpare är meningsfull för att modellera biologisk vävnad. En metod för att automatiskt generera en sådan modell utifrån data tagen från medicinsk 3D-skanningsutrustning presenteras. Denna metod inkluderar både generering av punktmassor samt en algoritm för generering av länkar mellan dessa. Vidare beskrivs en implementation av en simulering av denna modell som körs i realtid genom att utnyttja den parallella beräkningskraften hos modern GPU-hårdvara via OpenCL. Denna implementation använder sig av fjärde ordningens Runge-Kutta-metod för förbättrad stabilitet jämfört med liknande implementationer. Svårigheten att bibehålla stabiliteten samtidigt som den simulerade vävnaden ges tillräcklig styvhet diskuteras genomgående. Flera observationer om modellstrukturens inverkan på den simulerade vävnadens konsistens presenteras också. Denna implementation inkluderar två manipuleringsverktyg, ett flytta-verktyg och ett skärverktyg för att interagera med simuleringen. Resultaten visar tydligt att en modell baserad på massa-fjäder-dämpare är en rimlig modell som är möjlig att simulera i realtid på modern men lättillgänglig hårdvara. Med vidareutveckling kan detta bli betydelsefullt för områden så som medicinsk bildvetenskap och kirurgisk simulering. mass-spring-damper model modelling biological tissue 3D medical imaging real-time parallel computation GPU OpenCL Runge-Kutta move cut simulation medical visualization surgical simulation massa-fjäderdämpare modell modellera biologisk vävnad 3D medicinsk skanningsutrustning 3D-skanning realtid parallell beräkning GPU OpenCL Runge-Kutta flytta skär medicinsk bildvetenskap kirurgisk simulering Software Engineering Programvaruteknik
892	Modélisation ultra-rapide des transferts de chaleur par rayonnement et par conduction et exemple d'application / Fast Modeling of Radiation and Conduction Heat Transfer and application example Ghannam, Boutros 19 October 2012 (has links) L'apparition de CUDA en 2007 a rendu les GPU hautement programmables permettant ainsi aux applications scientifiques et techniques de profiter de leur capacité de calcul élevée. Des solutions ultra-rapides pour la résolution des transferts de chaleur par rayonnement et par conduction sur GPU sont présentées dans ce travail. Tout d'abord, la méthode MACZM pour le calcul des facteurs de transferts radiatifs directs en 3D et en milieu semi-transparent est représentée et validée. Ensuite, une implémentation efficace de la méthode à la base d'algorithmes de géométrie discrète et d'une parallélisation optimisée sur GPU dans CUDA atteignant 300 à 600 fois d'accélération, est présentée. Ceci est suivi par la formulation du NRPA, une version non-récursive de l'algorithme des revêtements pour le calcul des facteurs d'échange radiatifs totaux. La complexité du NRPA est inférieure à celle du PA et sont exécution sur GPU est jusqu'à 750 fois plus rapide que l'exécution du PA sur CPU. D'autre part, une implémentation efficace de la LOD sur GPU est présentée, consistant d'une alternance optimisée des solveurs et schémas de parallélisation et achevant une accélération GPU de 75 à 250 fois. Finalement, toutes les méthodes sont appliquées ensemble pour la résolution des transferts de chaleur en 3D dans un four de réchauffage sidérurgique de brames d'acier. Dans ce but, MACZM est appliquée avec un maillage multi-grille et le NRPA est appliqué au four en le découpant en zones, permettant d'avoir un temps de calcul très rapide une précision élevée. Ceci rend les méthodes utilisées de très grande importance pour la conception de stratégies de contrôle efficaces et précises. / The release of CUDA by NVIDIA in 2007 has tremendously increased GPU programmability, thus allowing scientific and engineering applications to take advantage of the high GPU compute capability. In this work, we present ultra-fast solutions for radiation and diffusion heat transfer on the GPU. First, the Multiple Absorption Coefficient Zonal Method (MACZM) for computing direct radiative exchange factors in 3D semi-transparent media is reviewed and validated. Then, an efficient implementation for MACZM is presented, based on discrete geometry algorithms, and an optimized GPU CUDA parallelization. The CUDA implementation achieves 300 to 600 times speed-up. The Non-recursive Plating Algorithm (NRPA), a non-recursive version of the plating algorithm for computing total exchange factors is then formulated. Due to low-complexity matrix multiplication algorithms, the NRPA has lower complexity than the PA does and it runs up to 750 times faster on the GPU by comparison to the CPU PA. On the other hand, an efficient GPU implementation for the Locally One Dimensional (LOD) finite difference split method for solving heat diffusion is presented, based on an optimiwed alternation between parallelization schemes and equation solvers, achieving accelerations from 75 to 250 times. Finally, all the methods are applied together for solving 3D heat transfer in a steel reheating furnace. A multi-grid approach is applied for MACZM and a zone-by zone computation for the NRPA. As a result, high precision and very fast computation time are achieved, making the methods of high interest for building precise and efficient control units. Algorithmes de droites discrètes Processeurs Graphiques (GPU) CUDA Parallél Discrete line algorithms Non-Recursive Plating Algorithm (NRPA) Graphic Processing Unit (GPU) Cuda Parallelization
893	3D Video Playback : A modular cross-platform GPU-based approach for flexible multi-view 3D video rendering Andersson, Håkan January 2010 (has links) The evolution of depth‐perception visualization technologies, emerging format standardization work and research within the field of multi‐view 3D video and imagery addresses the need for flexible 3D video visualization. The wide variety of available 3D‐display types and visualization techniques for multi‐view video, as well as the high throughput requirements for high definition video, addresses the need for a real‐time 3D video playback solution that takes advantage of hardware accelerated graphics, while providing a high degree of flexibility through format configuration and cross‐platform interoperability. A modular component based software solution based on FFmpeg for video demultiplexing and video decoding is proposed,using OpenGL and GLUT for hardware accelerated graphics and POSIX threads for increased CPU utilization. The solution has been verified to have sufficient throughput in order to display 1080p video at the native video frame rate on the experimental system, which is considered as a standard high‐end desktop PC only using commercial hardware. In order to evaluate the performance of the proposed solution a number of throughput evaluation metrics have been introduced measuring average frame rate as a function of: video bit rate, video resolution and number of views. The results obtained have indicated that the GPU constitutes the primary bottleneck in a multi‐view lenticular rendering system and that multi‐view rendering performance is degraded as the number of views is increased. This is a result of the current GPU square matrix texture cache architectures, resulting in texture lookup access times according to random memory access patterns when the number of views is high. The proposed solution has been identified in order to provide low CPU efficiency, i.e. low CPU hardware utilization and it is recommended to increase performance by investigating the gains of scalable multithreading techniques. It is also recommended to investigate the gains of introducing video frame buffering in video memory or to move more calculations to the CPU in order to increase GPU performance. 3D Video Player Multi-view Video Lenticular Rendering Auto-stereoscopy 3D Visualization FFmpeg GPU OpenGL C. 3D Video Videospelare Visualisering Multi-vy Lentikulär Rendering Auto-stereoskopi Visualisering Systemdesign FFmpeg GPU OpenGL C PThreads GLUT Annan elektroteknik och elektronik Signal Processing Signalbehandling Computer Engineering Datorteknik
894	Beräkningar med GPU vs CPU : En jämförelsestudie av beräkningseffektivitet med avseende på energi- och tidsförbrukning / Calculations with the CPU vs CPU : A Comparative Study of Computational Efficiency in Terms of Energy and Time Consumption Löfgren, Robin, Dahl, Kristoffer January 2010 (has links) Examensarbetet handlar om en jämförelsestudie av beräkningseffektivitet med avseende på energi- och tidsförbrukning mellan grafikkort och processorer i persondatorer och PlayStation 3. Problemet studeras för att göra allmänheten uppmärksam på att det går att lösa en del av energiproblematiken med beräkningar genom att öka energieffektiviteten av beräkningsenheterna. Undersökningen har genomförts på ett explorativt sätt och studerar förhållandet mellan processorer, grafikkort och vilken som presterar bäst i vilket sammanhang. Prestandatest genomförs med molekylberäkningsprogrammet F@H och med filkomprimeringsprogrammet WinRAR. Testerna utförs på MultiCore- och SingleCorePCs och PS3s av olika karaktär. I vissa test mäts effektförbrukning för att kunna räkna ut hur energieffektiva vissa system är. Resultatet visar tydligt hur den genomsnittliga effektförbrukningen och energieffektiviteten för olika testsystem skiljer sig vid belastning, viloläge och olika typer beräkningar. / The thesis is a comparative study of computational efficiency in terms of energy and time consumption of graphics cards and processors in personal computers and Playstation3’s. The problem is studied in order to make the public aware that it is possible to solve some of the energy problems with computations by increasing energy efficiency of the computational units. The audit was conducted in an exploratory way, studying the relationship between the processors, graphics cards and which one performs best in which context. Performance tests are carried out by the molecule calculating F@H-program and the file compression program WinRAR. Tests performed on MultiCore and SingleCore PC’s and PS3’s with different characteristics. In some tests power consumption is measured in order to figure out how energy-efficient certain systems are. The results clearly show how the average power consumption and energy efficiency for various test systems at differ at load, sleep and various calculations. Energy efficiency Computational efficiency Efficient methods of calculation CPU performance GPU performance FAH F@H Folding @ Home Folding at Home PS3 Graphics card Processor Energieffektivitet Beräkningseffektivitet Effektiva beräkningsmetoder CPU prestanda GPU prestanda FAH F@H Folding @ Home Folding at Home PS3 Grafikkort Processor Computer Sciences Datavetenskap (datalogi)
895	Efficient betweenness Centrality Computations on Hybrid CPU-GPU Systems Mishra, Ashirbad January 2016 (has links) (PDF) Analysis of networks is quite interesting, because they can be interpreted for several purposes. Various features require different metrics to measure and interpret them. Measuring the relative importance of each vertex in a network is one of the most fundamental building blocks in network analysis. Between’s Centrality (BC) is one such metric that plays a key role in many real world applications. BC is an important graph analytics application for large-scale graphs. However it is one of the most computationally intensive kernels to execute, and measuring centrality in billion-scale graphs is quite challenging. While there are several existing e orts towards parallelizing BC algorithms on multi-core CPUs and many-core GPUs, in this work, we propose a novel ne-grained CPU-GPU hybrid algorithm that partitions a graph into two partitions, one each for CPU and GPU. Our method performs BC computations for the graph on both the CPU and GPU resources simultaneously, resulting in a very small number of CPU-GPU synchronizations, hence taking less time for communications. The BC algorithm consists of two phases, the forward phase and the backward phase. In the forward phase, we initially and the paths that are needed by either partitions, after which each partition is executed on each processor in an asynchronous manner. We initially compute border matrices for each partition which stores the relative distances between each pair of border vertex in a partition. The matrices are used in the forward phase calculations of all the sources. In this way, our hybrid BC algorithm leverages the multi-source property inherent in the BC problem. We present proof of correctness and the bounds for the number of iterations for each source. We also perform a novel hybrid and asynchronous backward phase, in which each partition communicates with the other only when there is a path that crosses the partition, hence it performs minimal CPU-GPU synchronizations. We use a variety of implementations for our work, like node-based and edge based parallelism, which includes data-driven and topology based techniques. In the implementation we show that our method also works using variable partitioning technique. The technique partitions the graph into unequal parts accounting for the processing power of each processor. Our implementations achieve almost equal percentage of utilization on both the processors due to the technique. For large scale graphs, the size of the border matrix also becomes large, hence to accommodate the matrix we present various techniques. The techniques use the properties inherent in the shortest path problem for reduction. We mention the drawbacks of performing shortest path computations on a large scale and also provide various solutions to it. Evaluations using a large number of graphs with different characteristics show that our hybrid approach without variable partitioning and border matrix reduction gives 67% improvement in performance, and 64-98.5% less CPU-GPU communications than the state of art hybrid algorithm based on the popular Bulk Synchronous Paradigm (BSP) approach implemented in TOTEM. This shows our algorithm's strength which reduces the need for larger synchronizations. Implementing variable partitioning, border matrix reduction and backward phase optimizations on our hybrid algorithm provides up to 10x speedup. We compare our optimized implementation, with CPU and GPU standalone codes based on our forward phase and backward phase kernels, and show around 2-8x speedup over the CPU-only code and can accommodate large graphs that cannot be accommodated in the GPU-only code. We also show that our method`s performance is competitive to the state of art multi-core CPU and performs 40-52% better than GPU implementations, on large graphs. We show the drawbacks of CPU and GPU only implementations and try to motivate the reader about the challenges that graph algorithms face in large scale computing, suggesting that a hybrid or distributed way of approaching the problem is a better way of overcoming the hurdles. Network Analysis Betweenness Centrality CPU-GPU Hybrid Systems Space Complexity Border Matrix Reduction Distributed Computing Graphics Processing Unit (GPU) Graph Partitioning Bulk Synchronous Paradigm (BSP) Central Processing Unit (CPU) Parallel Processing High Performance Computing Network Theory System Analysis Graph Theory Hybrid Betweenness Centrality Algorithm Computer Science
896	Solving dense linear systems on accelerated multicore architectures / Résoudre des systèmes linéaires denses sur des architectures composées de processeurs multicœurs et d’accélerateurs Rémy, Adrien 08 July 2015 (has links) Dans cette thèse de doctorat, nous étudions des algorithmes et des implémentations pour accélérer la résolution de systèmes linéaires denses en utilisant des architectures composées de processeurs multicœurs et d'accélérateurs. Nous nous concentrons sur des méthodes basées sur la factorisation LU. Le développement de notre code s'est fait dans le contexte de la bibliothèque MAGMA. Tout d'abord nous étudions différents solveurs CPU/GPU hybrides basés sur la factorisation LU. Ceux-ci visent à réduire le surcoût de communication dû au pivotage. Le premier est basé sur une stratégie de pivotage dite "communication avoiding" (CALU) alors que le deuxième utilise un préconditionnement aléatoire du système original pour éviter de pivoter (RBT). Nous montrons que ces deux méthodes surpassent le solveur utilisant la factorisation LU avec pivotage partiel quand elles sont utilisées sur des architectures hybrides multicœurs/GPUs. Ensuite nous développons des solveurs utilisant des techniques de randomisation appliquées sur des architectures hybrides utilisant des GPU Nvidia ou des coprocesseurs Intel Xeon Phi. Avec cette méthode, nous pouvons éviter l'important surcoût du pivotage tout en restant stable numériquement dans la plupart des cas. L'architecture hautement parallèle de ces accélérateurs nous permet d'effectuer la randomisation de notre système linéaire à un coût de calcul très faible par rapport à la durée de la factorisation. Finalement, nous étudions l'impact d'accès mémoire non uniformes (NUMA) sur la résolution de systèmes linéaires denses en utilisant un algorithme de factorisation LU. En particulier, nous illustrons comment un placement approprié des processus légers et des données sur une architecture NUMA peut améliorer les performances pour la factorisation du panel et accélérer de manière conséquente la factorisation LU globale. Nous montrons comment ces placements peuvent améliorer les performances quand ils sont appliqués à des solveurs hybrides multicœurs/GPU. / In this PhD thesis, we study algorithms and implementations to accelerate the solution of dense linear systems by using hybrid architectures with multicore processors and accelerators. We focus on methods based on the LU factorization and our code development takes place in the context of the MAGMA library. We study different hybrid CPU/GPU solvers based on the LU factorization which aim at reducing the communication overhead due to pivoting. The first one is based on a communication avoiding strategy of pivoting (CALU) while the second uses a random preconditioning of the original system to avoid pivoting (RBT). We show that both of these methods outperform the solver using LU factorization with partial pivoting when implemented on hybrid multicore/GPUs architectures. We also present new solvers based on randomization for hybrid architectures for Nvidia GPU or Intel Xeon Phi coprocessor. With this method, we can avoid the high cost of pivoting while remaining numerically stable in most cases. The highly parallel architecture of these accelerators allow us to perform the randomization of our linear system at a very low computational cost compared to the time of the factorization. Finally we investigate the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and data on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We show how these placements can improve the performance when applied to hybrid multicore/GPU solvers. Systèmes linéaires denses Factorisation LU Bibliothèque MAGMA Calcul hybride multicœur/GPU Processeurs graphiques Intel Xeon Phi . ccNUMA Communication-avoiding Randomisation Placement des processus légers Dense linear systems LU factorization Dense linear algebra libraries MAGMA library Hybrid multicore/GPU computing Graphics process units Intel Xeon Phi . ccNUMA Communication-avoiding algorithms Randomization Thread placement
897	Simulações Financeiras em GPU / Finance and Stochastic Simulation on GPU Thársis Tuani Pinto Souza 26 April 2013 (has links) É muito comum modelar problemas em finanças com processos estocásticos, dada a incerteza de suas variáveis de análise. Além disso, problemas reais nesse domínio são, em geral, de grande custo computacional, o que sugere a utilização de plataformas de alto desempenho (HPC) em sua implementação. As novas gerações de arquitetura de hardware gráfico (GPU) possibilitam a programação de propósito geral enquanto mantêm alta banda de memória e grande poder computacional. Assim, esse tipo de arquitetura vem se mostrando como uma excelente alternativa em HPC. Com isso, a proposta principal desse trabalho é estudar o ferramental matemático e computacional necessário para modelagem estocástica em finanças com a utilização de GPUs como plataforma de aceleração. Para isso, apresentamos a GPU como uma plataforma de computação de propósito geral. Em seguida, analisamos uma variedade de geradores de números aleatórios, tanto em arquitetura sequencial quanto paralela. Além disso, apresentamos os conceitos fundamentais de Cálculo Estocástico e de método de Monte Carlo para simulação estocástica em finanças. Ao final, apresentamos dois estudos de casos de problemas em finanças: \"Stops Ótimos\" e \"Cálculo de Risco de Mercado\". No primeiro caso, resolvemos o problema de otimização de obtenção do ganho ótimo em uma estratégia de negociação de ações de \"Stop Gain\". A solução proposta é escalável e de paralelização inerente em GPU. Para o segundo caso, propomos um algoritmo paralelo para cálculo de risco de mercado, bem como técnicas para melhorar a solução obtida. Nos nossos experimentos, houve uma melhora de 4 vezes na qualidade da simulação estocástica e uma aceleração de mais de 50 vezes. / Given the uncertainty of their variables, it is common to model financial problems with stochastic processes. Furthermore, real problems in this area have a high computational cost. This suggests the use of High Performance Computing (HPC) to handle them. New generations of graphics hardware (GPU) enable general purpose computing while maintaining high memory bandwidth and large computing power. Therefore, this type of architecture is an excellent alternative in HPC and comptutational finance. The main purpose of this work is to study the computational and mathematical tools needed for stochastic modeling in finance using GPUs. We present GPUs as a platform for general purpose computing. We then analyze a variety of random number generators, both in sequential and parallel architectures, and introduce the fundamental mathematical tools for Stochastic Calculus and Monte Carlo simulation. With this background, we present two case studies in finance: ``Optimal Trading Stops\'\' and ``Market Risk Management\'\'. In the first case, we solve the problem of obtaining the optimal gain on a stock trading strategy of ``Stop Gain\'\'. The proposed solution is scalable and with inherent parallelism on GPU. For the second case, we propose a parallel algorithm to compute market risk, as well as techniques for improving the quality of the solutions. In our experiments, there was a 4 times improvement in the quality of stochastic simulation and an acceleration of over 50 times. Computação Paralela Finanças Quantitativas GPGPU GPU Métodos Matemáticos em Finanças Modelagem Matemática Números Aleatórios Precificação de Opções Risco de Mercado Simulação Estocástica Stops Value-at-Risk VaR GPGPU GPU Market Risk Mathematical Methods in Finance Mathematical Modeling Options Pricing Parallel Computing Quantitative Finance Random Numbers Stochastic Simulation Stops Value-at-Risk VaR
898	A parallel iterative solver for large sparse linear systems enhanced with randomization and GPU accelerator, and its resilience to soft errors / Un solveur parallèle itératif pour les grands systèmes linéaires creux, amélioré par la randomisation et l'utilisation des accélérateurs GPU, et sa résilience aux fautes logicielles Jamal, Aygul 28 September 2017 (has links) Dans cette thèse de doctorat, nous abordons trois défis auxquels sont confrontés les solveurs d'algèbres linéaires dans la perspective des futurs systèmes exascale: accélérer la convergence en utilisant des techniques innovantes au niveau algorithmique, en profitant des accélérateurs GPU (Graphics Processing Units) pour améliorer le calcul sur plusieurs systèmes, en évaluant l'impact des erreurs due à l'augmentation du parallélisme dans les superordinateurs. Nous nous intéressons à l'étude des méthodes permettant d'accélérer la convergence et le temps d'exécution des solveurs itératifs pour les grands systèmes linéaires creux. Le solveur plus spécifiquement considéré dans ce travail est le “parallel Algebraic Recursive Multilevel Solver (pARMS)” qui est un soldeur parallèle sur mémoire distribuée basé sur les méthodes de sous-espace de Krylov.Tout d'abord, nous proposons d'intégrer une technique de randomisation appelée “Random Butterfly Transformations (RBT)” qui a été proposée avec succès pour éliminer le coût du pivotage dans la résolution des systèmes linéaires denses. Notre objectif est d'appliquer cette technique dans le préconditionneur ARMS de pARMS pour résoudre plus efficacement le dernier système Complément de Schur dans l'application du processus à multi-niveaux récursif. En raison de l'importance considérable du dernier Complément de Schur pour certains problèmes de test, nous proposons également d'utiliser une variante creux de RBT suivie d'un solveur direct creux (SuperLU). Les résultats expérimentaux sur certaines matrices de la collection de Davis montrent une amélioration de la convergence et de la précision par rapport aux implémentations existantes.Ensuite, nous illustrons comment une approche non intrusive peut être appliquée pour implémenter des calculs GPU dans le solveur pARMS, plus particulièrement pour la phase de préconditionnement locale qui représente une partie importante du temps pour la résolution. Nous comparons les solveurs purement CPU avec les solveurs hybrides CPU / GPU sur plusieurs problèmes de test issus d'applications physiques. Les résultats de performance du solveur hybride CPU / GPU utilisant le préconditionnement ARMS combiné avec RBT, ou le préconditionnement ILU(0), montrent un gain de performance jusqu'à 30% sur les problèmes de test considérés dans nos expériences.Enfin, nous étudions l'effet des défaillances logicielles variable sur la convergence de la méthode itérative flexible GMRES (FGMRES) qui est couramment utilisée pour résoudre le système préconditionné dans pARMS. Le problème ciblé dans nos expériences est un problème elliptique PDE sur une grille régulière. Nous considérons deux types de préconditionneurs: une factorisation LU incomplète à double seuil (ILUT) et le préconditionneur ARMS combiné avec randomisation RBT. Nous considérons deux modèle de fautes logicielles différentes où nous perturbons la multiplication du vecteur matriciel et la phase de préconditionnement, et nous comparons leur impact potentiel sur la convergence. / In this PhD thesis, we address three challenges faced by linear algebra solvers in the perspective of future exascale systems: accelerating convergence using innovative techniques at the algorithm level, taking advantage of GPU (Graphics Processing Units) accelerators to enhance the performance of computations on hybrid CPU/GPU systems, evaluating the impact of errors in the context of an increasing level of parallelism in supercomputers. We are interested in studying methods that enable us to accelerate convergence and execution time of iterative solvers for large sparse linear systems. The solver specifically considered in this work is the parallel Algebraic Recursive Multilevel Solver (pARMS), which is a distributed-memory parallel solver based on Krylov subspace methods.First we integrate a randomization technique referred to as Random Butterfly Transformations (RBT) that has been successfully applied to remove the cost of pivoting in the solution of dense linear systems. Our objective is to apply this method in the ARMS preconditioner to solve more efficiently the last Schur complement system in the application of the recursive multilevel process in pARMS. The experimental results show an improvement of the convergence and the accuracy. Due to memory concerns for some test problems, we also propose to use a sparse variant of RBT followed by a sparse direct solver (SuperLU), resulting in an improvement of the execution time.Then we explain how a non intrusive approach can be applied to implement GPU computing into the pARMS solver, more especially for the local preconditioning phase that represents a significant part of the time to compute the solution. We compare the CPU-only and hybrid CPU/GPU variant of the solver on several test problems coming from physical applications. The performance results of the hybrid CPU/GPU solver using the ARMS preconditioning combined with RBT, or the ILU(0) preconditioning, show a performance gain of up to 30% on the test problems considered in our experiments.Finally we study the effect of soft fault errors on the convergence of the commonly used flexible GMRES (FGMRES) algorithm which is also used to solve the preconditioned system in pARMS. The test problem in our experiments is an elliptical PDE problem on a regular grid. We consider two types of preconditioners: an incomplete LU factorization with dual threshold (ILUT), and the ARMS preconditioner combined with RBT randomization. We consider two soft fault error modeling approaches where we perturb the matrix-vector multiplication and the application of the preconditioner, and we compare their potential impact on the convergence of the solver. Calcul haute performance Algorithmes randomisés Calculs sur GPU GMRES flexible Modèles de fautes logicielles Solveur pARMS Preconditionnement Tolérance aux fautes High performance computing Parallel iterative linear solvers Randomized algorithms GPU computing Flexible GMRES Soft fault models PARMS solver Preconditioning Fault tolerance
899	Sledování více osob ve videu z jedné kamery / Multi-Person Tracking in Video from Mono-Camera Vojvoda, Jakub January 2016 (has links) Multiple person detection and tracking is challenging problem with high application potential. The difficulty of the problem is caused mainly by complexity of scene and large variations in articulation and appearance of person. The aim of this work is to design and implement system capable of detecting and tracking people in video from static mono-camera. For this purpose, an online method for tracking has been proposed based on tracking-by-detection approach. The method combines detection, tracking and fusion of responses to achieve accurate results. The implementation was evaluated on available dataset and the results show that it is suitable to use for this task. A method for motion segmentation was proposed and implemented to improve the tracking results. Furthermore, implementation of detector based on histogram of oriented gradients was accelerated by taking advantage of graphics processing unit (GPU).
900	Efficient Parallel Monte-Carlo Simulations for Large-Scale Studies of Surface Growth Processes Kelling, Jeffrey 21 August 2018 (has links) Lattice Monte Carlo methods are used to investigate far from and out-of-equilibrium systems, including surface growth, spin systems and solid mixtures. Applications range from the determination of universal growth or aging behaviors to palpable systems, where coarsening of nanocomposites or self-organization of functional nanostructures are of interest. Such studies require observations of large systems over long times scales, to allow structures to grow over orders of magnitude, which necessitates massively parallel simulations. This work addresses the problem of parallel processing introducing correlations in Monte Carlo updates and proposes a virtually correlation-free domain decomposition scheme to solve it. The effect of correlations on scaling and dynamical properties of surface growth systems and related lattice gases is investigated further by comparing results obtained by correlation-free and intrinsically correlated but highly efficient simulations using a stochastic cellular automaton (SCA). Efficient massively parallel implementations on graphics processing units (GPUs) were developed, which enable large-scale simulations leading to unprecedented precision in the final results. The primary subject of study is the Kardar–Parisi–Zhang (KPZ) surface growth in (2 + 1) dimensions, which is simulated using a dimer lattice gas and the restricted solid-on-solid model (RSOS) model. Using extensive simulations, conjectures regard- ing growth, autocorrelation and autoresponse properties are tested and new precise numerical predictions for several universal parameters are made.:1. Introduction 1.1. Motivations and Goals 1.2. Overview 2. Methods and Models 2.1. Estimation of Scaling Exponents and Error Margins 2.2. From Continuum- to Atomistic Models 2.3. Models for Phase Ordering and Nanostructure Evolution 2.3.1. The Kinetic Metropolis Lattice Monte-Carlo Method 2.3.2. The Potts Model 2.4. The Kardar–Parisi–Zhang and Edwards–Wilkinson Universality Classes 2.4.0.1. Physical Aging 2.4.1. The Octahedron Model 2.4.2. The Restricted Solid on Solid Model 3. Parallel Implementation: Towards Large-Scale Simulations 3.1. Parallel Architectures and Programming Models 3.1.1. CPU 3.1.2. GPU 3.1.3. Heterogeneous Parallelism and MPI 3.1.4. Bit-Coding of Lattice Sites 3.2. Domain Decomposition for Stochastic Lattice Models 3.2.1. DD for Asynchronous Updates 3.2.1.1. Dead border (DB) 3.2.1.2. Double tiling (DT) 3.2.1.3. DT DD with random origin (DTr) 3.2.1.4. Implementation 3.2.2. Second DD Layer on GPUs 3.2.2.1. Single-Hit DT 3.2.2.2. Single-Hit dead border (DB) 3.2.2.3. DD Parameters for the Octahedron Model 3.2.3. Performance 3.3. Lattice Level DD: Stochastic Cellular Automaton 3.3.1. Local Approach for the Octahedron Model 3.3.2. Non-Local Approach for the Octahedron Model 3.3.2.1. Bit-Vectorized GPU Implementation 3.3.3. Performance of SCA Implementations 3.4. The Multi-Surface Coding Approach 3.4.0.1. Vectorization 3.4.0.2. Scalar Updates 3.4.0.3. Domain Decomposition 3.4.1. Implementation: SkyMC 3.4.1.1. 2d Restricted Solid on Solid Model 3.4.1.2. 2d and 3d Potts Model 3.4.1.3. Sequential CPU Reference 3.4.2. SkyMC Benchmarks 3.5. Measurements 3.5.0.1. Measurement Intervals 3.5.0.2. Measuring using Heterogeneous Resources 4. Monte-Carlo Investigation of the Kardar–Parisi–Zhang Universality Class 4.1. Evolution of Surface Roughness 4.1.1. Comparison of Parallel Implementations of the Octahedron Model 4.1.1.1. The Growth Regime 4.1.1.2. Distribution of Interface Heights in the Growth Regime 4.1.1.3. KPZ Ansatz for the Growth Regime 4.1.1.4. The Steady State 4.1.2. Investigations using RSOS 4.1.2.1. The Growth Regime 4.1.2.2. The Steady State 4.1.2.3. Consistency of Fine-Size Scaling with Respect to DD 4.1.3. Results for Growth Phase and Steady State 4.2. Autocorrelation Functions 4.2.1. Comparison of DD Methods for RS Dynamics 4.2.1.1. Device-Layer DD 4.2.1.2. Block-Layer DD 4.2.2. Autocorrelation Properties under RS Dynamics 4.2.3. Autocorrelation Properties under SCA Dynamics 4.2.3.1. Autocorrelation of Heights 4.2.3.2. Autocorrelation of Slopes 4.2.4. Autocorrelation in the SCA Steady State 4.2.5. Autocorrelation in the EW Case under SCA 4.2.5.1. Autocorrelation of Heights 4.2.5.2. Autocorrelations of Slopes 4.3. Autoresponse Functions 4.3.1. Autoresponse Properties 4.3.1.1. Autoresponse of Heights 4.3.1.2. Autoresponse of Slopes 4.3.1.3. Self-Averaging 4.4. Summary 5. Further Topics 5.1. Investigations of the Potts Model 5.1.1. Testing Results from the Parallel Implementations 5.1.2. Domain Growth in Disordered Potts Models 5.2. Local Scale Invariance in KPZ Surface Growth 6. Conclusions and Outlook Acknowledgements A. Coding Details A.1. Bit-Coding A.2. Packing and Unpacking Signed Integers A.3. Random Number Generation / Gitter-Monte-Carlo-Methoden werden zur Untersuchung von Systemen wie Oberflächenwachstum, Spinsystemen oder gemischten Feststoffen verwendet, welche fern eines Gleichgewichtes bleiben oder zu einem streben. Die Anwendungen reichen von der Bestimmung universellen Wachstums- und Alterungsverhaltens hin zu konkreten Systemen, in denen die Reifung von Nanokompositmaterialien oder die Selbstorganisation von funktionalen Nanostrukturen von Interesse sind. In solchen Studien müssen große Systemen über lange Zeiträume betrachtet werden, um Strukturwachstum über mehrere Größenordnungen zu erlauben. Dies erfordert massivparallele Simulationen. Diese Arbeit adressiert das Problem, dass parallele Verarbeitung Korrelationen in Monte-Carlo-Updates verursachen und entwickelt eine praktisch korrelationsfreie Domänenzerlegungsmethode, um es zu lösen. Der Einfluss von Korrelationen auf Skalierungs- und dynamische Eigenschaften von Oberflächenwachtums- sowie verwandten Gittergassystemen wird weitergehend durch den Vergleich von Ergebnissen aus korrelationsfreien und intrinsisch korrelierten Simulationen mit einem stochastischen zellulären Automaten untersucht. Effiziente massiv parallele Implementationen auf Grafikkarten wurden entwickelt, welche großskalige Simulationen und damit präzedenzlos genaue Ergebnisse ermöglichen. Das primäre Studienobjekt ist das (2 + 1)-dimensionale Kardar–Parisi–Zhang- Oberflächenwachstum, welches durch ein Dimer-Gittergas und das Kim-Kosterlitz-Modell simuliert wird. Durch massive Simulationen werden Thesen über Wachstums-, Autokorrelations- und Antworteigenschaften getestet und neue, präzise numerische Vorhersagen zu einigen universellen Parametern getroffen.:1. Introduction 1.1. Motivations and Goals 1.2. Overview 2. Methods and Models 2.1. Estimation of Scaling Exponents and Error Margins 2.2. From Continuum- to Atomistic Models 2.3. Models for Phase Ordering and Nanostructure Evolution 2.3.1. The Kinetic Metropolis Lattice Monte-Carlo Method 2.3.2. The Potts Model 2.4. The Kardar–Parisi–Zhang and Edwards–Wilkinson Universality Classes 2.4.0.1. Physical Aging 2.4.1. The Octahedron Model 2.4.2. The Restricted Solid on Solid Model 3. Parallel Implementation: Towards Large-Scale Simulations 3.1. Parallel Architectures and Programming Models 3.1.1. CPU 3.1.2. GPU 3.1.3. Heterogeneous Parallelism and MPI 3.1.4. Bit-Coding of Lattice Sites 3.2. Domain Decomposition for Stochastic Lattice Models 3.2.1. DD for Asynchronous Updates 3.2.1.1. Dead border (DB) 3.2.1.2. Double tiling (DT) 3.2.1.3. DT DD with random origin (DTr) 3.2.1.4. Implementation 3.2.2. Second DD Layer on GPUs 3.2.2.1. Single-Hit DT 3.2.2.2. Single-Hit dead border (DB) 3.2.2.3. DD Parameters for the Octahedron Model 3.2.3. Performance 3.3. Lattice Level DD: Stochastic Cellular Automaton 3.3.1. Local Approach for the Octahedron Model 3.3.2. Non-Local Approach for the Octahedron Model 3.3.2.1. Bit-Vectorized GPU Implementation 3.3.3. Performance of SCA Implementations 3.4. The Multi-Surface Coding Approach 3.4.0.1. Vectorization 3.4.0.2. Scalar Updates 3.4.0.3. Domain Decomposition 3.4.1. Implementation: SkyMC 3.4.1.1. 2d Restricted Solid on Solid Model 3.4.1.2. 2d and 3d Potts Model 3.4.1.3. Sequential CPU Reference 3.4.2. SkyMC Benchmarks 3.5. Measurements 3.5.0.1. Measurement Intervals 3.5.0.2. Measuring using Heterogeneous Resources 4. Monte-Carlo Investigation of the Kardar–Parisi–Zhang Universality Class 4.1. Evolution of Surface Roughness 4.1.1. Comparison of Parallel Implementations of the Octahedron Model 4.1.1.1. The Growth Regime 4.1.1.2. Distribution of Interface Heights in the Growth Regime 4.1.1.3. KPZ Ansatz for the Growth Regime 4.1.1.4. The Steady State 4.1.2. Investigations using RSOS 4.1.2.1. The Growth Regime 4.1.2.2. The Steady State 4.1.2.3. Consistency of Fine-Size Scaling with Respect to DD 4.1.3. Results for Growth Phase and Steady State 4.2. Autocorrelation Functions 4.2.1. Comparison of DD Methods for RS Dynamics 4.2.1.1. Device-Layer DD 4.2.1.2. Block-Layer DD 4.2.2. Autocorrelation Properties under RS Dynamics 4.2.3. Autocorrelation Properties under SCA Dynamics 4.2.3.1. Autocorrelation of Heights 4.2.3.2. Autocorrelation of Slopes 4.2.4. Autocorrelation in the SCA Steady State 4.2.5. Autocorrelation in the EW Case under SCA 4.2.5.1. Autocorrelation of Heights 4.2.5.2. Autocorrelations of Slopes 4.3. Autoresponse Functions 4.3.1. Autoresponse Properties 4.3.1.1. Autoresponse of Heights 4.3.1.2. Autoresponse of Slopes 4.3.1.3. Self-Averaging 4.4. Summary 5. Further Topics 5.1. Investigations of the Potts Model 5.1.1. Testing Results from the Parallel Implementations 5.1.2. Domain Growth in Disordered Potts Models 5.2. Local Scale Invariance in KPZ Surface Growth 6. Conclusions and Outlook Acknowledgements A. Coding Details A.1. Bit-Coding A.2. Packing and Unpacking Signed Integers A.3. Random Number Generation info:eu-repo/classification/ddc/530 ddc:530

Search results