Global ETD Search

191	Paralelização de um modelo global de previsão do tempo em malhas localmente refinadas / Parallelization of a numerical weather prediction global model with local refinement grids Nelson Leonardo Vidaurre Navarrete 31 October 2014 (has links) O objetivo principal deste trabalho é a paralelização de um modelo global de previsão do tempo em diferenças finitas com refinamento local. Este é baseado nas equações primitivas, e faz uso de uma discretização semi-Lagrangiana e semi-implícita em três níveis no tempo em uma malha de Lorenz na vertical e uma malha do tipo C de Arakawa na horizontal. A discretização horizontal é feita através de diferenças finitas de segunda ordem. A equação escalar elíptica tridimensional resultante é desacoplada em um sistema de equações bidimensionais do tipo Helmholtz, o qual é resolvido por meio de um método multigrid. O modelo de paralelização foi desenvolvido para máquinas com memória distribuída, fazendo uso de MPI para passagens de mensagens e baseado em técnicas de decomposição de domínio. O acoplamento apenas local dos operadores de diferenças finitas viabiliza a decomposição em duas direções horizontais. Evitamos a decomposição vertical, tendo em vista o forte acoplamento nesta direção das parametrizações de fenômenos físicos. A estratégia de paralelização foi elaborada visando o uso eficiente de centenas ou alguns milhares de processadores, dependendo da resolução do modelo. Para tal, a malha localmente refinada é separada em três regiões: uma grossa, uma de transição e uma fina, onde cada uma delas é dividida de forma independente entre um número de processadores proporcional ao número de pontos que cada uma armazena, garantindo assim um balanceamento de carga adequado. Não obstante, para resolver o sistema de equações bidimensionais do tipo Helmholtz foi necessário mudar a estratégia de paralelização, dividindo o domínio unicamente nas direções vertical e latitudinal. Ambas partes do modelo com paralelizações diferentes estão conectadas por meio da estratégia de transposição de dados. Testamos nosso modelo utilizando até 1024 processadores e os resultados ainda mostraram uma boa escalabilidade. / The main goal of this work is the parallelization of a weather prediction model employing finite differences on locally refined meshes. The model is based on the primitive equations and uses a three-time-level semi-implicit semi-Lagrangian temporal discretization on a Lorenz-type vertical grid combined with a horizontal Arakawa C-grid. The horizontal discretization is performed by means of second order finite differences. The resulting three-dimensional scalar elliptic equation is decoupled into a set of Helmholtz-type two-dimensional equations, solved by a multigrid method. The parallelization has been written for distributed-memory machines, employing the MPI message passing standard and was based on domain decomposition techniques. The local coupling of the finite difference operators was exploited in a two-dimensional horizontal decomposition. We avoid a vertical decomposition due to the strong coupling of physical parameterization routines. The parallelization strategy has been designed in order to allow the efficient use of hundreds to a few thousand processors, depending on the model resolution. In order to achieve this, the locally refined mesh is split into three regions: a coarse, a transition and a fine one, each decomposed independently. The number of allocated processors for each region is proportional to the number of the grid-points it contains, in order to guarantee a good load-balancing distribution. However, to solve the set of Helmholtz-type bidimensional equations it was necessary to change the parallelization strategy, splitting the domain only in vertical and latitudinal directions. Both parts of the model with different parallelizations are related by means the data transposition strategy. We tested our model using up to 1024 processors and the results still showed a good scalability. Computação paralela Multigrid Refinamento local Simulação numérica do tempo Local refinement Multigrid Numerical weather simulation Parallel computing
192	Uma arquitetura sistólica para solução de sistemas lineares implementada com circuitos FPGAs. / A systolic architecture to solving linear systems implemented with FPGAs devices. Antônio Carlos de Oliveira Souza Aragão 17 December 1998 (has links) Neste trabalho de mestrado foi desenvolvido o projeto de uma máquina paralela dedicada para solução de sistemas de equações lineares. Este é um problema presente em uma grande variedade de aplicações científicas e de engenharia e cuja solução torna-se uma tarefa computacionalmente intensiva , a medida em que o número de incógnitas aumenta. Implementou-se uma Arquitetura Sistólica unidimensional, conectada numa topologia em anel, que mapeia métodos de solução iterativos. Essa classe de arquiteturas paralelas apresenta características de simplicidade, regularidade e modularidade que facilitam implementações em hardware, sendo muito utilizadas em sistemas de computação dedicados à solução de problemas específicos, que possuem como características básicas a grande demanda computacional e a necessidade de respostas em tempo real. Foram adotadas metodologias e ferramentas avançadas para projeto de hardware que aceleram o ciclo de desenvolvimento e para a implementação foram utilizados circuitos reconfiguráveis FPGAs (Field Programmable Gate Arrays). Os resultados de desempenho são apresentados e avaliados apontado a melhor configuração da arquitetura para atingir um speedup em relação a implementações em máquinas seqüenciais. Também são discutidas as vantagens e desvantagens deste tipo de abordagem e metodologia na solução de problemas que possuem requisitos de tempo. / This dissertation presents the project of a parallel machine dedicated for solving linear systems. This is a problem that appears in a great variety of scientific and engineering applications with a solution that becomes a computationally intensive task, measured by the increasing number of unknown variables. An Systolic Architecture was implemented, connected in a ring topology, mapping an iterative solution method. This class of parallel architectures presents characteristics of simplicity, regularity and modularity that facilitate hardware implementations, being very used in dedicated computation systems to the solution of specific problems, which possess as requirements to handle great computational demand and real-time response. Advanced methodologies and tools for hardware project were adopted to accelerate the development cycle. The architecture has been implemented and verified on FPGAs (Field Programmable Gate Arrays). The performance results are presented and discussed, indicating the feasibility and efficiency of the adopted approach and methodology for this kind of problem. circuitos FPGAs computação paralela sistemas lineares FPGAs devices linear systems parallel computing
193	Non-Equilibrium Many-Body Influence on Mode-Locked Vertical External-Cavity Surface-Emitting Lasers Kilen, Isak Ragnvald, Kilen, Isak Ragnvald January 2017 (has links) Vertical external-cavity surface-emitting lasers are ideal testbeds for studying the influence of the non-equilibrium many-body dynamics on mode locking. As we will show in this thesis, ultra short pulse generation involves a marked departure from Fermi carrier distributions assumed in prior theoretical studies. A quantitative model of the mode locking dynamics is presented, where the semiconductor Bloch equations with Maxwell’s equation are coupled, in order to study the influences of quantum well carrier scattering on mode locking dynamics. This is the first work where the full model is solved without adiabatically eliminating the microscopic polarizations. In many instances we find that higher order correlation contributions (e.g. polarization dephasing, carrier scattering, and screening) can be represented by rate models, with the effective rates extracted at the level of second Born-Markov approximations. In other circumstances, such as continuous wave multi-wavelength lasing, we are forced to fully include these higher correlation terms. In this thesis we identify the key contributors that control mode locking dynamics, the stability of single pulse mode-locking, and the influence of higher order correlation in sustaining multi-wavelength continuous wave operation. Maxwell semiconductor Bloch equations mode-locked pulses non-equilibrium dynamics numerical simulation parallel computing VECSEL
194	Random Forests for CUDA GPUs Lapajne, Mikael Hellborg, Slat, Daniel January 2010 (has links) Context. Machine Learning is a complex and resource consuming process that requires a lot of computing power. With the constant growth of information, the need for efficient algorithms with high performance is increasing. Today's commodity graphics cards are parallel multi processors with high computing capacity at an attractive price and are usually pre-installed in new PCs. The graphics cards provide an additional resource to be used in machine learning applications. The Random Forest learning algorithm which has been showed competitive within machine learning has a good potential for performance increase through parallelization of the algorithm. Objectives. In this study we implement and review a revised Random Forest algorithm for GPU execution using CUDA. Methods. A review of previous work in the area has been done by studying articles from several sources, including Compendex, Inspec, IEEE Xplore, ACM Digital Library and Springer Link. Additional information regarding GPU architecture and implementation specific details have been obtained mainly from documentation available from Nvidia and the Nvidia developer forums. The implemented algorithm has been benchmarked and compared with two state-of-the-art CPU implementations of the Random Forest algorithm, both regarding consumed time for training and classification and for classification accuracy. Results. Measurements from benchmarks made on the three different algorithms are gathered showing the performance results of the algorithms for two publicly available data sets. Conclusion. We conclude that our implementation under the right conditions is able to outperform its competitors. We also conclude that this is only true for certain data sets depending on the size of the data sets. Moreover we conclude that there is potential for further improvements of the algorithm both regarding performance as well as adaption towards a wider range of real world applications. / Mikael: +46768539263, Daniel: +46703040693 CUDA Random forests Parallel computing Graphics processing units Software Engineering Programvaruteknik
195	A Skeleton Programming Library for Multicore CPU and Multi-GPU Systems Enmyren, Johan January 2010 (has links) This report presents SkePU, a C++ template library which provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP back end. It also supports multi-GPU systems. Benchmarks show that copying data between the host and the GPU is often a bottleneck. Therefore a container which uses lazy memory copying has been implemented to avoid unnecessary memory transfers. SkePU was evaluated with small benchmarks and a larger application, a Runge-Kutta ODE solver. The results show that skeletal parallel programming is indeed a viable approach for GPU Computing and that a generalized interface for multiple back ends is also reasonable. The best performance gains are received when the computation load is large compared to memory I/O (the lazy memory copying can help to achieve this). We see that SkePU offers good performance with a more complex and realistic task such as ODE solving, with up to ten times faster run times when using SkePU with a GPU back end compared to a sequential solver running on a fast CPU. From the benchmarks we can conclude that skeletal parallel programming is indeed a viable approach for GPU Computing and that a generalized interface for multiple back ends is also reasonable. SkePU does however have some disadvantages too; there is some overhead in using the library which we can see from the dot product and LibSolve benchmarks. Although not big, it is still there and if performance is of uttermost importance, then a hand coded solution would be best. One cannot express all calculations in terms of skeletons either, if one have such a problem, specialized routines must still be created. CUDA OpenCL Skeleton Programming Parallel Computing Data Parallelism Computer Sciences Datavetenskap (datalogi)
196	Simulating Partial Differential Equations using the Explicit Parallelism of ParModelica Thorslund, Gustaf January 2015 (has links) The Modelica language is a modelling and programming language for modelling cyber-physical systems using equations and algorithms. In this thesis two suggested extensions of the Modelica language are covered. Those are Partial Differential Equations (PDE) and explicit parallelism in algorithmic code. While PDEs are not yet supported by the Modelica language, this thesis presents a framework for solving PDEs using the algorithmic part of the Modelica language, including parallel extensions. Different numerical solvers have been implemented using the explicit parallel constructs suggested for Modelica by the ParModelica language extensions, and implemented as part of OpenModelica. The solvers have been evaluated using different models, and it can be seen how bigger models are suitable for a parallel solver. The intention has been to write a framework suitable for modelling and parallel simulation of PDEs. This work can, however, also be seen as a case study of how to write a custom solver using parallel algorithmic Modelica and how to evaluate the performance of a parallel solver. OpenModelica ParModelica PDE Parallel Computing GPU GPGPU Computer Sciences Datavetenskap (datalogi)
197	Strain and defects in irradiated materials : a study using X-ray diffraction and diffuse scattering / Défauts et déformations au sein de matériaux irradiés : Etude par diffraction et diffusion diffuse des rayons X Channagiri, Jayanth 04 December 2015 (has links) Les faisceaux d'ions, sont communément utilisés dans le cadre de l'étude des matériaux du nucléaire dans le but de reproduire, dans une certaine mesure, les différentes sources d'irradiations auxquelles sont soumis ces matériaux. L’interaction des ions avec la matière induit la formation de défauts cristallins le long du trajet de ces ions, associée à d'importantes déformations au sein de la zone irradiée. L'un des principaux enjeux de l'industrie électro-nucléaire consiste en l'encapsulation, à long terme, des déchets nucléaires. La zircone yttriée (YSZ) est un des matériaux qui pourrait être utilisé comme matrice inerte pour la transmutation des actinides. Par conséquent, la compréhension du comportement d’YSZ sous différentes conditions d'irradiations est d'une importance capitale.Cette thèse est décomposée en deux parties distinctes. Dans la première partie de ce travail, nous avons utilisé plusieurs techniques avancées de diffraction des rayons X (DRX) dans le but de caractériser les défauts et déformations au sein de la zone irradiée des cristaux étudiés. Les profils de déformations et de défauts ont été modélisés par des fonctions B-splines cubiques et les données DRX ont été simulées en utilisant la théorie dynamique de la diffraction couplée à un algorithme de recuit simulé généralisé. Cette démarche a été appliquée au cas des monocristaux d'YSZ irradiés par des ions Au 2+ dans une large gamme de températures et de fluences. Les résultats ont été comparés avec ceux de la spectroscopie de rétrodiffusion de Rutherford en mode canalisé (RBS/C) obtenus pour les mêmes échantillons.La deuxième partie est consacrée au développement d'un modèle spécifique pour calculer la distribution bidimensionnelle d'intensité diffractée par des monocristaux irradiés de grandes dimensions et présentant des distributions de défauts réalistes. Pour atteindre cet objectif, nous avons mis en œuvre une approche de calcul parallèle haute performance (basée à la fois sur l'utilisation de processeurs multi-cœurs et de processeurs graphiques) afin de réduire les durées de calcul. Cette approche a été utilisée pour modéliser les cartographies X de l'espace réciproque de monocristaux d’YSZ présentant des défauts de structure complexe. / Ion beams are commonly used in the framework of nuclear materials in order to reproduce, in a controlled way, the different sources of irradiation that these materials are submitted to. The interaction of ions with the material induces the formation of crystalline defects along the path of these ions,associated with high strains in the irradiated region. One of the main issues of the electro-nuclearindustry is the encapsulation of the long-term nuclear waste. Yttria stabilized zirconia (YSZ) is one of the materials that can be used as an inert matrix for the transmutation of actinides and therefore,understanding its behaviour under different conditions of irradiation is of utmost importance.This thesis is divided into two distinct parts. In the first part of this work, we have used advanced X-raydiffraction (XRD) techniques in order to characterize the strain and the damage levels within the irradiated region of the crystals. The strain and the damage profiles were modelled using B-splines functions and the XRD data were simulated using the dynamical theory of diffraction combined with a generalized simulated annealing algorithm. This approach was used to study YSZ single crystals irradiated with Au 2+ ions in a wide range of temperatures and fluences. The results were compared with the RBS/C results obtained for same samples.The second part of the thesis is devoted to the development of a specific model for calculating the two-dimensional XRD intensity from irradiated single crystals with realistic dimensions and defectdistributions. In order to achieve this goal, we have implemented high-performance parallel computing (both multi-processing and GPU-based) to accelerate the calculations. The approach was used to successfully model the reciprocal space maps of the YSZ single crystals which exhibit a complex defect structure. Irradiation aux ions Modélisation Déformation Calcul parallèle Ion irradiation Modelling Strain Parallel computing 620.112 72
198	Enforcing Security Policies On GPU Computing Through The Use Of Aspect-Oriented Programming Techniques Albassam, Bader 29 June 2016 (has links) This thesis presents a new security policy enforcer designed for securing parallel computation on CUDA GPUs. We show how the very features that make a GPGPU desirable have already been utilized in existing exploits, fortifying the need for security protections on a GPGPU. An aspect weaver was designed for CUDA with the goal of utilizing aspect-oriented programming for security policy enforcement. Empirical testing verified the ability of our aspect weaver to enforce various policies. Furthermore, a performance analysis was performed to demonstrate that using this policy enforcer provides no significant performance impact over manual insertion of policy code. Finally, future research goals are presented through a plan of work. We hope that this thesis will provide for long term research goals to guide the field of GPU security. Programming Languages Enforceability Theory Distributed Computing Parallel Computing CUDA Computer Engineering Computer Sciences
199	Techniques for algorithm design on the instruction systolic array Schmidt, Bertil January 1999 (has links) Instruction systolic arrays (ISAs) provide a programmable high performance hardware for specific computationally intensive applications. Typically, such an array is connected to a sequential host, thus operating like a coprocessor which solves only the computationally intensive tasks within a global application. The ISA model is a mesh connected processor grid, which combines the advantages of special purpose systolic arrays with the flexible programmability of general purpose machines. The subject of this thesis is the analysis, design, and implementation of several special purpose algorithms and subroutines on the ISA that take advantage of the special features of the systolic information flow. The ability of ISAs to perform parallel prefix computations in an extremely efficient way is exploited as a key-operation to derive efficiency as well as local operations within each processor. Therefore, given sequential algorithms has to be decomposed in simple building blocks of parallel prefix computations and parallel local operations. To modify sequential algorithms for a parallelisation several techniques are introduced in this thesis, e. g. swapping of loops in the sequential algorithm, shearing of data, and appropriate mapping of input data onto the processor array It is demonstrated how these techniques can be exploited to derive efficient ISA algorithms for several computationally intensive applications. These include cryptographic applications (e. g. arithmetic operations on long operands, RSA encryption, RSA key generation) and image processing applications (e. g. convolution, Wavelet Transform, morphological operators, median filter, Fourier Transform, Hough Transform, Morphological Hough Transform, and tomographic image reconstruction). Their implementation on Systola 1024 - the first commercial parallel computer with the ISA architecture - shows that the concept of the ISA is very suitable for these applications and results in significant run time savings. The results of this thesis emphases the suitability of the ISA concept as an accelerator for computationally intensive applications in the areas of cryptography and image processing. This might lead research towards further high-speed low cost systems based on ISA hardware. 621.39
200	Robust and scalable hierarchical matrix-based fast direct solver and preconditioner for the numerical solution of elliptic partial differential equations Chavez, Gustavo Ivan 10 July 2017 (has links) This dissertation introduces a novel fast direct solver and preconditioner for the solution of block tridiagonal linear systems that arise from the discretization of elliptic partial differential equations on a Cartesian product mesh, such as the variable-coefficient Poisson equation, the convection-diffusion equation, and the wave Helmholtz equation in heterogeneous media. The algorithm extends the traditional cyclic reduction method with hierarchical matrix techniques. The resulting method exposes substantial concurrency, and its arithmetic operations and memory consumption grow only log-linearly with problem size, assuming bounded rank of off-diagonal matrix blocks, even for problems with arbitrary coefficient structure. The method can be used as a standalone direct solver with tunable accuracy, or as a black-box preconditioner in conjunction with Krylov methods. The challenges that distinguish this work from other thrusts in this active field are the hybrid distributed-shared parallelism that can demonstrate the algorithm at large-scale, full three-dimensionality, and the three stressors of the current state-of-the-art multigrid technology: high wavenumber Helmholtz (indefiniteness), high Reynolds convection (nonsymmetry), and high contrast diffusion (inhomogeneity). Numerical experiments corroborate the robustness, accuracy, and complexity claims and provide a baseline of the performance and memory footprint by comparisons with competing approaches such as the multigrid solver hypre, and the STRUMPACK implementation of the multifrontal factorization with hierarchically semi-separable matrices. The companion implementation can utilize many thousands of cores of Shaheen, KAUST's Haswell-based Cray XC-40 supercomputer, and compares favorably with other implementations of hierarchical solvers in terms of time-to-solution and memory consumption. hierarchical matrices cyclic reduction fast solvers Direct solvers preconditioning Parallel Computing

Search results