Global ETD Search

311	Optimization of Monte Carlo Neutron Transport Simulations with Emerging Architectures / Optimisation du code Monte Carlo neutronique à l’aide d’accélérateurs de calculs Wang, Yunsong 14 December 2017 (has links) L’accès aux données de base, que sont les sections efficaces, constitue le principal goulot d’étranglement aux performances dans la résolution des équations du transport neutronique par méthode Monte Carlo (MC). Ces sections efficaces caractérisent les probabilités de collisions des neutrons avec les nucléides qui composent le matériau traversé. Elles sont propres à chaque nucléide et dépendent de l’énergie du neutron incident et de la température du matériau. Les codes de référence en MC chargent ces données en mémoire à l’ensemble des températures intervenant dans le système et utilisent un algorithme de recherche binaire dans les tables stockant les sections. Sur les architectures many-coeurs (typiquement Intel MIC), ces méthodes sont dramatiquement inefficaces du fait des accès aléatoires à la mémoire qui ne permettent pas de profiter des différents niveaux de cache mémoire et du manque de vectorisation de ces algorithmes.Tout le travail de la thèse a consisté, dans une première partie, à trouver des alternatives à cet algorithme de base en proposant le meilleur compromis performances/occupation mémoire qui tire parti des spécificités du MIC (multithreading et vectorisation). Dans un deuxième temps, nous sommes partis sur une approche radicalement opposée, approche dans laquelle les données ne sont pas stockées en mémoire, mais calculées à la volée. Toute une série d’optimisations de l’algorithme, des structures de données, vectorisation, déroulement de boucles et influence de la précision de représentation des données, ont permis d’obtenir des gains considérables par rapport à l’implémentation initiale.En fin de compte, une comparaison a été effectué entre les deux approches (données en mémoire et données calculées à la volée) pour finalement proposer le meilleur compromis en termes de performance/occupation mémoire. Au-delà de l'application ciblée (le transport MC), le travail réalisé est également une étude qui peut se généraliser sur la façon de transformer un problème initialement limité par la latence mémoire (« memory latency bound ») en un problème qui sature le processeur (« CPU-bound ») et permet de tirer parti des architectures many-coeurs. / Monte Carlo (MC) neutron transport simulations are widely used in the nuclear community to perform reference calculations with minimal approximations. The conventional MC method has a slow convergence according to the law of large numbers, which makes simulations computationally expensive. Cross section computation has been identified as the major performance bottleneck for MC neutron code. Typically, cross section data are precalculated and stored into memory before simulations for each nuclide, thus during the simulation, only table lookups are required to retrieve data from memory and the compute cost is trivial. We implemented and optimized a large collection of lookup algorithms in order to accelerate this data retrieving process. Results show that significant speedup can be achieved over the conventional binary search on both CPU and MIC in unit tests other than real case simulations. Using vectorization instructions has been proved effective on many-core architecture due to its 512-bit vector units; on CPU this improvement is limited by a smaller register size. Further optimization like memory reduction turns out to be very important since it largely improves computing performance. As can be imagined, all proposals of energy lookup are totally memory-bound where computing units does little things but only waiting for data. In another word, computing capability of modern architectures are largely wasted. Another major issue of energy lookup is that the memory requirement is huge: cross section data in one temperature for up to 400 nuclides involved in a real case simulation requires nearly 1 GB memory space, which makes simulations with several thousand temperatures infeasible to carry out with current computer systems.In order to solve the problem relevant to energy lookup, we begin to investigate another on-the-fly cross section proposal called reconstruction. The basic idea behind the reconstruction, is to do the Doppler broadening (performing a convolution integral) computation of cross sections on-the-fly, each time a cross section is needed, with a formulation close to standard neutron cross section libraries, and based on the same amount of data. The reconstruction converts the problem from memory-bound to compute-bound: only several variables for each resonance are required instead of the conventional pointwise table covering the entire resolved resonance region. Though memory space is largely reduced, this method is really time-consuming. After a series of optimizations, results show that the reconstruction kernel benefits well from vectorization and can achieve 1806 GFLOPS (single precision) on a Knights Landing 7250, which represents 67% of its effective peak performance. Even if optimization efforts on reconstruction significantly improve the FLOP usage, this on-the-fly calculation is still slower than the conventional lookup method. Under this situation, we begin to port the code on GPGPU to exploit potential higher performance as well as higher FLOP usage. On the other hand, another evaluation has been planned to compare lookup and reconstruction in terms of power consumption: with the help of hardware and software energy measurement support, we expect to find a compromising solution between performance and energy consumption in order to face the "power wall" challenge along with hardware evolution. Monte Carlo Transport de neutron Section efficace Mic Vectorisation Programmation Parallèle Monte Carlo Neutron transport Cross section Mic Vectorization Parallel computing
312	A Parallel Adaptive Mesh Refinement Library for Cartesian Meshes January 2019 (has links) abstract: This dissertation introduces FARCOM (Fortran Adaptive Refiner for Cartesian Orthogonal Meshes), a new general library for adaptive mesh refinement (AMR) based on an unstructured hexahedral mesh framework. As a result of the underlying unstructured formulation, the refinement and coarsening operators of the library operate on a single-cell basis and perform in-situ replacement of old mesh elements. This approach allows for h-refinement without the memory and computational expense of calculating masked coarse grid cells, as is done in traditional patch-based AMR approaches, and enables unstructured flow solvers to have access to the automated domain generation capabilities usually only found in tree AMR formulations. The library is written to let the user determine where to refine and coarsen through custom refinement selector functions for static mesh generation and dynamic mesh refinement, and can handle smooth fields (such as level sets) or localized markers (e.g. density gradients). The library was parallelized with the use of the Zoltan graph-partitioning library, which provides interfaces to both a graph partitioner (PT-Scotch) and a partitioner based on Hilbert space-filling curves. The partitioned adjacency graph, mesh data, and solution variable data is then packed and distributed across all MPI ranks in the simulation, which then regenerate the mesh, generate domain decomposition ghost cells, and create communication caches. Scalability runs were performed using a Leveque wave propagation scheme for solving the Euler equations. The results of simulations on up to 1536 cores indicate that the parallel performance is highly dependent on the graph partitioner being used, and differences between the partitioners were analyzed. FARCOM is found to have better performance if each MPI rank has more than 60,000 cells. / Dissertation/Thesis / Doctoral Dissertation Aerospace Engineering 2019 Aerospace engineering Computer science adaptive mesh refinement computational fluid dynamics high-performance computing open source software parallel computing
313	Pending Event Set Management in Parallel Discrete Event Simulation Gupta, Sounak 02 October 2018 (has links) No description available. Computer Engineering Discrete Event Simulation Parallel Computing Distributed Computing Lock-free data structures Pending Event Set Time Warp
314	Detection And Classification Of Buried Radioactive Materials Wei, Wei 09 December 2011 (has links) This dissertation develops new approaches for detection and classification of buried radioactive materials. Different spectral transformation methods are proposed to effectively suppress noise and to better distinguish signal features in the transformed space. The contributions of this dissertation are detailed as follows. 1) Propose an unsupervised method for buried radioactive material detection. In the experiments, the original Reed-Xiaoli (RX) algorithm performs similarly as the gross count (GC) method; however, the constrained energy minimization (CEM) method performs better if using feature vectors selected from the RX output. Thus, an unsupervised method is developed by combining the RX and CEM methods, which can efficiently suppress the background noise when applied to the dimensionality-reduced data from principle component analysis (PCA). 2) Propose an approach for buried target detection and classification, which applies spectral transformation followed by noisejusted PCA (NAPCA). To meet the requirement of practical survey mapping, we focus on the circumstance when sensor dwell time is very short. The results show that spectral transformation can alleviate the effects from spectral noisy variation and background clutters, while NAPCA, a better choice than PCA, can extract key features for the following detection and classification. 3) Propose a particle swarm optimization (PSO)-based system to automatically determine the optimal partition for spectral transformation. Two PSOs are incorporated in the system with the outer one being responsible for selecting the optimal number of bins and the inner one for optimal bin-widths. The experimental results demonstrate that using variable bin-widths is better than a fixed bin-width, and PSO can provide better results than the traditional Powell’s method. 4) Develop parallel implementation schemes for the PSO-based spectral partition algorithm. Both cluster and graphics processing units (GPU) implementation are designed. The computational burden of serial version has been greatly reduced. The experimental results also show that GPU algorithm has similar speedup as cluster-based algorithm. gamma-ray spectral analysis classification buried target detection principal component analysis noisejusted principal component analysis particle swarm optimization parallel computing
315	An Internal Representation for Adaptive Online Parallelization Rehme, Koy D. 29 May 2009 (has links) (PDF) Future computer processors may have tens or hundreds of cores, increasing the need for efficient parallel programming models. The nature of multicore processors will present applications with the challenge of diversity: a variety of operating environments, architectures, and data will be available and the compiler will have no foreknowledge of the environment until run time. Adaptive Online Parallelization (ADOPAR) is a unifying framework that attempts to overcome diver sity by separating discovery and packaging of parallelism. Scheduling for execution may then occur at run time when diversity may best be resolved. This work presents a compact representation of parallelism based on the task graph programming model, tailored especially for ADOPAR and for regular and irregular parallel computations. Task graphs can be unmanageably large for fine-grained parallelism. Rather than representing each task individually, similar tasks are grouped into task descriptors. From these, a task descriptor graph, with relationship descriptors forming the edges of the graph, may be represented. While even highly irregular computations often have structure, previous representations have chosen to restrict what can be easily represented, thus limiting full exploitation by the back end. Therefore, in this work, task and relationship descriptors have been endowed with instantiation functions (methods of descriptors that act as factories) so the front end may have a full range of expression when describing the task graph. The representation uses descriptors to express a full range of regular and irregular computations in a very flexible and compact manner. The representation also allows for dynamic optimization and transformation, which assists ADOPAR in its goal of overcoming various forms of diversity. We have successfully implemented this representation using new compiler intrinsics, allow ADOPAR schedulers to operate on the described task graph for parallel execution, and demonstrate the low code size overhead and the necessity for native schedulers. parallelism internal representation task graph dynamic parallel computing compiler runtime optimization multicore manycore diversity adaptation ADOPAR Electrical and Computer Engineering
316	Scalable Extraction and Visualization of Scientific Features with Load-Balanced Parallelism Xu, Jiayi January 2021 (has links) No description available. Computer Science Computer Engineering scientific visualization feature extraction feature visualization spatiotemporal analysis distributed and parallel computing dynamic load balancing asynchronous parallelism
317	Coralai: Emergent Ecosystems of Neural Cellular Automata Barbieux, Aidan A, Barbieux, Aidan A 01 March 2024 (has links) (PDF) Artificial intelligence has traditionally been approached through centralized architectures and optimization of specific metrics on large datasets. However, the frontiers of fields spanning cognitive science, biology, physics, and computer science suggest that intelligence is better understood as a multi-scale, decentralized, emergent phenomenon. As such, scaling up approaches that mirror the natural world may be one of the next big advances in AI. This thesis presents Coralai, a framework for efficiently simulating the emergence of diverse artificial life ecosystems integrated with modular physics. The key innovations of Coralai include: 1) Hosting diverse Neural Cellular Automata organisms in the same simulation that can interact and evolve; 2) Allowing user-defined physics and weather that organisms adapt to and can utilize to enact environmental changes; 3) Hardware-acceleration using Taichi, PyTorch, and HyperNEAT, enabling interactive evolution of ecosystems with 500k evolved parameters on a grid of 1m+ 16-channel physics-governed cells, all in real-time on a laptop. Initial experiments with Coralai demonstrate the emergence of diverse ecosystems of organisms that employ a variety of strategies to compete for resources in dynamic environments. Key observations include competing mobile and sessile organisms, organisms that exploit environmental niches like dense energy sources, and cyclic dynamics of greedy dominance out-competed by resilience. Artificial Life Emergent Intelligence Neural Cellular Automata Open-ended Evolution Multi-scale Complexity Parallel Computing Integrative Biology
318	Linear Exact Repair Schemes for Distributed Storage and Secure Distributed Matrix Multiplication Valvo, Daniel William 08 May 2023 (has links) In this thesis we develop exact repair schemes capable of repairing or circumventing unavailable servers of a distributed network in the context of distributed storage and secure distributed matrix multiplication. We develop the (Λ, Γ, W, ⊙)-exact repair scheme framework for discussing both of these contexts and develop a multitude of explicit exact repair schemes utilizing decreasing monomial-Cartesian codes (DMC codes). Specifically, we construct novel DMC codes in the form of augmented Cartesian codes and rectangular monomial-Cartesian codes, as well as design exact repair schemes utilizing these constructions inspired by the schemes from Guruswami and Wootters [16] and Chen and Zhang [6]. In the context of distributed storage we demonstrate the existence of both high rate and low bandwidth systems based on these schemes, and we develop two methods to extend them to the l-erasure case. Additionally, we develop a family of hybrid schemes capable of attaining high rates, low bandwidths, and a balance in between which proves to be competitive compared to existing schemes. In the context of secure distributed matrix multiplication we develop similarly impactful schemes which have very competitive communication costs. We also construct an encoding algorithm based on multivariate interpolation and prove it is T-secure. / Doctor of Philosophy / Distributed networks may be thought of as networks of computers and/or servers which are capable of transmitting and receiving data from one another. For many applications it is possible for distributed networks to perform better than the sum of their constituent parts. In this thesis we will focus on the particular applications of distributed storage and secure distributed multiplication. A distributed storage system is a system that is capable of storing a single data file over every server in a distributed network. Distributed storage systems often come with exact repair schemes which are algorithms designed to reconstruct the data from a server in the network given the data from the other servers. In particular, if a server on the network ever fails or is otherwise unavailable an exact repair scheme can be used to repair the lost data from the server and maintain the original file. A distributed matrix multiplication scheme on the other hand is a process by which two matrices stored on a source server can be multiplied using a distributed network of helper servers. Again if a helper server becomes unavailable during this process we may use an exact repair scheme to circumvent this delay. The main goal of this thesis is to develop exact repair schemes for the distributed storage and secure distributed matrix multiplication contexts utilizing a mathematical object known as an evaluation code. We will develop several families of exact repair schemes which may be finely tuned to fit particular situations within these contexts, and we will compare these schemes to the existing schemes in the field. coding theory erasure recovery locally recoverable code linear exact repair scheme distributed storage matrix multiplication parallel computing field trace
319	In Situ Visualization of Performance Data in Parallel CFD Applications Falcao do Couto Alves, Rigel 19 January 2023 (has links) This thesis summarizes the work of the author on visualization of performance data in parallel Computational Fluid Dynamics (CFD) simulations. Current performance analysis tools are unable to show their data on top of complex simulation geometries (e.g. an aircraft engine). But in CFD simulations, performance is expected to be affected by the computations being carried out, which in turn are tightly related to the underlying computational grid. Therefore it is imperative that performance data is visualized on top of the same computational geometry which they originate from. However, performance tools have no native knowledge of the underlying mesh of the simulation. This scientific gap can be filled by merging the branches of HPC performance analysis and in situ visualization of CFD simulations data, which shall be done by integrating existing, well established state-of-the-art tools from each field. In this threshold, an extension for the open-source performance tool Score-P was designed and developed, which intercepts an arbitrary number of manually selected code regions (mostly functions) and send their respective measurements – amount of executions and cumulative time spent – to the visualization software ParaView – through its in situ library, Catalyst –, as if they were any other flow-related variable. Subsequently the tool was extended with the capacity to also show communication data (messages sent between MPI ranks) on top of the CFD mesh. Testing and evaluation are done with two industry-grade codes: Rolls-Royce’s CFD code, Hydra, and Onera, DLR and Airbus’ CFD code, CODA. On the other hand, it has been also noticed that the current performance tools have limited capacity of displaying their data on top of three-dimensional, framed (i.e. time-stepped) representations of the cluster’s topology. Parallel to that, in order for the approach not to be limited to codes which already have the in situ adapter, it was extended to take the performance data and display it – also in codes without in situ – on a three-dimensional, framed representation of the hardware resources being used by the simulation. Testing is done with the Multi-Grid and Block Tri-diagonal NAS Parallel Benchmarks (NPB), as well as with Hydra and CODA again. The benchmarks are used to explain how the new visualizations work, while real performance analyses are done with the industry-grade CFD codes. The proposed solution is able to provide concrete performance insights, which would not have been reached with the current performance tools and which motivated beneficial changes in the respective source code in real life. Finally, its overhead is discussed and proven to be suitable for usage with CFD codes. The dissertation provides a valuable addition to the state of the art of highly parallel CFD performance analysis and serves as basis for further suggested research directions. info:eu-repo/classification/ddc/006 ddc:006
320	Cumulus - translating CUDA to sequential C++ : Simplifying the process of debugging CUDA programs / Cumulus - översätter CUDA till sekventiell C++ : En studie i hur felsökande av CUDA-program kan förenklas Blomkvist Karlsson, Vera January 2021 (has links) Due to their highly parallel architecture, Graphics Processing Units (GPUs) offer increased performance for programs benefiting from parallel execution. A range of technologies exist which allow GPUs to be used for general-purpose programming, NVIDIA’s CUDA platform is one example. CUDA makes it possible to combine source code written for GPUs and Central Processing Units (CPUs) in the same program. Those sections that benefit from parallel execution can be written as CUDA kernels and will be executed on the GPU. With CUDA it is common to have tens, or even hundreds, of thousands of threads running in parallel. While the high level of parallelism can offer significant performance increases for executed programs, it can also make CUDA programs hard to debug. Although debuggers for CUDA exist, they can not be used in the same way as standard debuggers, and they do not reduce the difficulties of reasoning about parallel execution. As a result, developers may feel compelled to fall back to inefficient debugging methods, such as relying on print statements. This project examines two possible approaches for creating a tool which simplifies the process of debugging CUDA programs, by transforming a parallel CUDA program to a sequential program in another high level language: one method centered around the Clang Abstract Syntax Tree (AST), and the other method centered around LLVM Intermediate Representation (IR) code. The method using Clang was found to be the most suitable for the purpose of translating CUDA, as it enables modifying only select parts, such as kernels, of the input program. Thus, the tool Cumulus was developed as a Clang plugin. Cumulus translates parallel CUDA code into sequential C++ code, allowing developers to use any method available for C++ debugging to debug their CUDA program. Cumulus is indicated to be a potential aid in debugging CUDA programs, by providing developers with increased flexibility. / Tack vare sin högst parallella arkitektur kan grafikprocessorer erbjuda ökad prestanda för program som gagnas av parallel exekvering. En mängd teknologier finns, vilka möjliggör att grafikprocessorer kan användas inte bara till grafikberäkningar, utan även till allmäna beräkningar. NVIDIA’s plattform CUDA är en sådan teknik. CUDA gör det möjligt att i samma program kombinera källkod skriven för att exekveras på en centralprocessor, med källkod skriven för att exekveras på en grafikprocessor. Kodsektioner i ett program som gagnas av att köras parallellt kan skrivas som en CUDA kernel, vilket är en funktion som exekveras på grafikprocessorn. Med CUDA är det är inte ovanligt att ha tiotusentals, eller till och med hundratusentals, trådar som körs parallellt. Den mycket höga nivån av parallellism kan erbjuda markant ökad prestanda för exekverade program, men kan samtidigt göra det svårt att felsöka CUDA-program. Särskilda avlusare för CUDA existerar, men de kan inte användas på samma sätt som standardavlusare, och de minskar inte svårigheterna med att resonera kring parallella beräkningar. På grund av detta kan utvecklare känna sig nödgade att använda ineffektiva felsökningsmetoder, såsom att förlita sig på printsatser. Det här projektet undersöker två möjliga metoder för att skapa ett verktyg som förenklar felsökandet i CUDAprogram, genom att översätta ett parallellt CUDA-program till ett sekventiellt program i ett klassiskt högnivå-programmeringsspråk. Den ena möjliga metoden är centrerad kring Clangs AST, den andra möjliga metoden är centrerad kring LLVM IR-kod. Metoden som använder Clang fanns vara den mest lämpliga metoden för syftet att översätta CUDA-kod, eftersom den möjliggör översättning av endast utvalda delar av originalprogrammet, exempelvis kernels. Således utvecklades verktyget Cumulus som en Clangplugin. Cumulus översätter parallell CUDA-kod till serialiserad C++-kod, vilket låter utvecklare använda alla de metoder som finns tillgängliga för att felsöka C++-program, för att felsöka sina CUDA-program. Evalueringen av Cumulus indikerar att verktyget kan fungera som en möjlig hjälp vid felsökande av CUDA-program, genom att erbjuda utvecklare ökad flexibilitet. Clang Code generation CUDA Debugging Parallel computing Clang Kodgenerering CUDA Felsökning Parallella beräkningar Computer Sciences Datavetenskap (datalogi)

Search results