Global ETD Search

141	Low-rank Approximations in Quantum Transport Simulations Daniel A. Lemus (5929940) 07 May 2020 (has links) Quantum-mechanical effects play a major role in the performance of modern electronic devices. In order to predict the behavior of novel devices, quantum effects are often included using Non-Equilibrium Green's Function (NEGF) methods in atomistic device representations. These quantum effects may include realistic inelastic scattering caused by device impurities and phonons. With the inclusion of realistic physical phenomena, the computational load of predictive simulations increases greatly, and a manageable basis through low-rank approximations is desired.<br><br>In this work, low-rank approximations are used to reduce the computational load of atomistic simulations. The benefits of basis reductions on simulation time and peak memory are assessed.<br>The low-rank approximation method is then extended to include more realistic physical effects than those modeled today, including exact calculations of scattering phenomena. The inclusion of these exact calculations are then contrasted to current methods and approximations. Nanoelectronics NEGF low-rank approximation High Performance Computing (HPC) quantum transport calculations
142	Orchestration of HPC Workflows: Scalability Testing and Cross-System Execution Tronge, Jacob 14 April 2022 (has links) No description available. Computer Science HPC workflows workflow orchestration scalability performance testing continuous integration
143	A Performance Evaluation of MPI Shared Memory Programming / En utvärdering av MPI shared memory - programmering med inriktning på prestanda Karlbom, David January 2016 (has links) The thesis investigates the Message Passing Interface (MPI) support for shared memory programming on modern hardware architecture with multiple Non-Uniform Memory Access (NUMA) domains. We investigate its performance in two case studies: the matrix-matrix multiplication and Conway’s game of life. We compare MPI shared memory performance in terms of execution time and memory consumption with the performance of implementations using OpenMP and MPI point-to-point communication, also called "MPI two-sided". We perform strong scaling tests in both test cases. We observe that MPI two-sided implementation is 21% and 18% faster than the MPI shared and OpenMP implementations respectively in the matrix-matrix multiplication when using 32 processes. MPI shared uses less memory space: when compared to MPI two-sided, MPI shared uses 45% less memory. In the Conway’s game of life, we find that MPI two-sided implementation is 10% and 82% faster than the MPI shared and OpenMP implementations respectively when using 32 processes. We also observe that not mapping virtual memory to a specific NUMA domain can lead to an increment in execution time of 64% when using 32 processes. The use of MPI shared is viable for intranode communication on modern hardware architecture with multiple NUMA domains. / I detta examensarbete undersöker vi Message Passing Inferfaces (MPI) support för shared memory programmering på modern hårdvaruarkitektur med flera Non-Uniform Memory Access (NUMA) domäner. Vi undersöker prestanda med hjälp av två fallstudier: matris-matris multiplikation och Conway’s game of life. Vi jämför prestandan utav MPI shared med hjälp utav exekveringstid samt minneskonsumtion jämtemot OpenMP och MPI punkt-till-punkt kommunikation, även känd som MPI two-sided. Vi utför strong scaling tests för båda fallstudierna. Vi observerar att MPI-two sided är 21% snabbare än MPI shared och 18% snabbare än OpenMP för matris-matris multiplikation när 32 processorer användes. För samma testdata har MPI shared en 45% lägre minnesförburkning än MPI two-sided. För Conway’s game of life är MPI two-sided 10% snabbare än MPI shared samt 82% snabbare än OpenMP implementation vid användandet av 32 processorer. Vi kunde också utskilja att om ingen mappning av virtuella minnet till en specifik NUMA domän görs, leder det till en ökning av exekveringstiden med upp till 64% när 32 processorer används. Vi kom fram till att MPI shared är användbart för intranode kommunikation på modern hårdvaruarkitektur med flera NUMA domäner. MPI OpenMP Shared Memory HPC Parallel Programming NUMA Computer Sciences Datavetenskap (datalogi)
144	Gene-EnvironmentInteraction Analysis UsingGraphic Cards / Analys av genmiljöinteraktion med använding avgrafikkort Berglund, Daniel January 2015 (has links) Genome-wide association studies(GWAS) are used to find associations betweengenetic markers and diseases. One part of GWAS is to study interactions be-tween markers which can play an important role in the risk for the disease. Thesearch for interactions can be computationally intensive. The aim of this thesiswas to improve the performance of software used for gene-environment interac-tion by using parallel programming techniques on graphical processors. A studyof the new programs performance, speedup and efficiency was made using mul-tiple simulated datasets. The program shows significantly better performancecompared with the older program. HPC high performance computing CUDA GPU GWAS gene-environment interaction interaction Computer Sciences Datavetenskap (datalogi)
145	Apports des architectures hybrides à l'imagerie profondeur : étude comparative entre CPU, APU et GPU. / Contributions of hybrid architectures to depth imaging : a CPU, APU and GPU comparative study Said, Issam 21 December 2015 (has links) Les compagnies pétrolières s'appuient sur le HPC pour accélérer les algorithmes d'imagerie profondeur. Les grappes de CPU et les accélérateurs matériels sont largement adoptés par l'industrie. Les processeurs graphiques (GPU), avec une grande puissance de calcul et une large bande passante mémoire, ont suscité un vif intérêt. Cependant le déploiement d'applications telle la Reverse Time Migration (RTM) sur ces architectures présente quelques limitations. Notamment, une capacité mémoire réduite, des communications fréquentes entre le CPU et le GPU présentant un possible goulot d'étranglement à cause du bus PCI, et des consommations d'énergie élevées. AMD a récemment lancé l'Accelerated Processing Unit (APU) : un processeur qui fusionne CPU et GPU sur la même puce via une mémoire unifiée. Dans cette thèse, nous explorons l'efficacité de la technologie APU dans un contexte pétrolier, et nous étudions si elle peut surmonter les limitations des solutions basées sur CPU et sur GPU. L'APU est évalué à l'aide d'une suite OpenCL de tests mémoire, applicatifs et d'efficacité énergétique. La faisabilité de l'utilisation hybride de l'APU est explorée. L'efficacité d'une approche par directives de compilation est également étudiée. En analysant une sélection d'applications sismiques (modélisation et RTM) au niveau du noeud et à grande échelle, une étude comparative entre CPU, APU et GPU est menée. Nous montrons la pertinence du recouvrement des entrées-sorties et des communications MPI par le calcul pour les grappes d'APU et de GPU, que les APU délivrent des performances variant entre celles du CPU et celles du GPU, et que l'APU peut être aussi énergétiquement efficace que le GPU. / In an exploration context, Oil and Gas (O&G) companies rely on HPC to accelerate depth imaging algorithms. Solutions based on CPU clusters and hardware accelerators are widely embraced by the industry. The Graphics Processing Units (GPUs), with a huge compute power and a high memory bandwidth, had attracted significant interest.However, deploying heavy imaging workflows, the Reverse Time Migration (RTM) being the most famous, on such hardware had suffered from few limitations. Namely, the lack of memory capacity, frequent CPU-GPU communications that may be bottlenecked by the PCI transfer rate, and high power consumptions. Recently, AMD has launched theAccelerated Processing Unit (APU): a processor that merges a CPU and a GPU on the same die, with promising features notably a unified CPU-GPU memory. Throughout this thesis, we explore how efficiently may the APU technology be applicable in an O&G context, and study if it can overcome the limitations that characterize the CPU and GPU based solutions. The APU is evaluated with the help of memory, applicative and power efficiency OpenCL benchmarks. The feasibility of the hybrid utilization of the APUs is surveyed. The efficiency of a directive based approach is also investigated. By means of a thorough review of a selection of seismic applications (modeling and RTM) on the node level and on the large scale level, a comparative study between the CPU, the APU and the GPU is conducted. We show the relevance of overlapping I/O and MPI communications with computations for the APU and GPUclusters, that APUs deliver performances that range between those of CPUs and those of GPUs, and that the APU can be as power efficient as the GPU. HPC Calcul GPU Architectures hybrides APU Géophysique RTM APU RTM Graphics processors 004
146	Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects Chu, Ching-Hsiang January 2020 (has links) No description available. Computer Engineering Computer Science
147	A C++ based MPI-enabled Tasking Framework to Efficiently Parallelize Fast Multipole Methods for Molecular Dynamics Haensel, David 31 August 2018 (has links) Today's supercomputers gain their performance through a rapidly increasing number of cores per node. To tackle issues arising from those developments new parallelization approaches guided by modern software engineering are inevitable. The concept of task-based parallelization is a promising candidate to overcome many of those challenges. However, for latency-critical applications, like molecular dynamics, available tasking frameworks introduce considerable overheads. In this work a lightweight task engine for latency-critical applications is proposed. The main contributions of this thesis are a static data-flow dispatcher, a type-driven priority scheduler and an extension for communication-enabled tasks. The dispatcher allows a user-configurable mapping of algorithmic dependencies in the task-engine at compile-time. Resolving these dependencies at compile-time reduces the run-time overhead. The scheduler enables the prioritized execution of a critical path of an algorithm. Additionally, the priorities are deduced from the task type at compile-time as well. Furthermore, the aforementioned task engine supports inter-node communication via message passing. The provided communication interface drastically simplifies the user interface of inter-node communication without introducing additional performance penalties. This is only possible by distinguishing two developer roles -- the library developer and the algorithm developer. All proposed components follow a strict guideline to increase the maintainability for library developers and the usability for algorithm developers. To reach this goal a high level of abstraction and encapsulation is required in the software stack. As proof of concept the communication-enabled task engine is utilized to parallelize the FMM for molecular dynamics. info:eu-repo/classification/ddc/004 ddc:004
148	Managing Service Levels in Grid Computing Systems : Quota Policy and Computational Market Approaches Sandholm, Thomas January 2007 (has links) We study techniques to enforce and provision differentiated service levels in Computational Grid systems. The Grid offers simplified provisioning of peak-capacity for applications with computational requirements beyond local machines and clusters, by sharing resources across organizational boundaries. Current systems have focussed on access control, i.e., managing who is allowed to run applications on remote sites. Very little work has been done on providing differentiated service levels for those applications that are admitted. This leads to a number of problems when scheduling jobs in a fair and efficient way. For example, users with a large number of long-running jobs could starve out others, both intentionally and non-intentionally. We investigate the requirements of High Performance Computing (HPC) applications that run in academic Grid systems, and propose two models of service-level management. Our first model is based on global real-time quota enforcement, where projects are granted resource quota, such as CPU hours, across the Grid by a centralized allocation authority. We implement the SweGrid Accounting System to enforce quota allocated by the Swedish National Allocations Committee in the SweGrid production Grid, which connects six Swedish HPC centers. A flexible authorization policy framework allows provisioning and enforcement of two different service levels across the SweGrid clusters; high-priority and low-priority jobs. As a solution to more fine-grained control over service levels we propose and implement a Grid Market system, using a market-based resource allocator called Tycoon. The conclusion of our research is that although the Grid accounting solution offers better service level enforcement support than state-of-the-art production Grid systems, it turned out to be complex to set the resource price and other policies manually, while ensuring fairness and efficiency of the system. Our Grid Market on the other hand sets the price according to the dynamic demand, and it is further incentive compatible, in that the overall system state remains healthy even in the presence of strategic users. / QC 20101116 Grid Market Computational Grid Service Level Management QoS HPC Grid Middleware Computer Sciences Datavetenskap (datalogi)
149	A Study of Improving the Parallel Performance of VASP. Baker, Matthew Brandon 13 August 2010 (has links) (PDF) This thesis involves a case study in the use of parallelism to improve the performance of an application for computational research on molecules. The application, VASP, was migrated from a machine with 4 nodes and 16 single-threaded processors to a machine with 60 nodes and 120 dual-threaded processors. When initially migrated, VASP's performance deteriorated after about 17 processing elements (PEs), due to network contention. Subsequent modifications that restrict communication amongst VASP processes, together with additional support for threading, allowed VASP to scale up to 112 PEs, the maximum number that was tested. Other performance-enhancing optimizations that were attempted included replacing old libraries, which produced improvements of about 10%, and prefetching, which degraded, rather than enhanced, VASP performance. cluster parallel performance openmp mpi HPC VASP Computer Sciences Physical Sciences and Mathematics Systems Architecture
150	Analysing Memory Performance when computing DFTs using FFTW / Analys av minneshantering vid beräkning av DFTs med FFTW Heiskanen, Andreas, Johansson, Erik January 2018 (has links) Discrete Fourier Transforms (DFTs) are used in a wide variety of dif-ferent scientific areas. In addition, there is an ever increasing demand on fast and effective ways of computing DFT problems with large data sets. The FFTW library is one of the most common used libraries when computing DFTs. It adapts to the system architecture and predicts the most effective way of solving the input problem. Previous studies have proved the FFTW library to be superior to other DFT solving libraries. However, not many have specifically examined the cache memory performance, which is a key factor for overall performance. In this study, we examined the cache memory utilization when computing 1-D complex DFTs using the FFTW library. Testing was done using bench FFT, Linux Perf and testing scripts. The results from this study show that cache miss ratio increases with problem size when the input size is smaller than the theoretical input size matching the cache capacity. This is also verified by the results from the L2 prefetcher miss ratio. However, the study show that cache miss ratio stabilizes when exceeding the cache capacity. In conclusion, it is possible to use bench FFT and Linux Perf to measure cache memory utilization. Also, the analysis shows that cache memory performance is good when computing 1-D complex DFTS using the FFTW library, since the miss ratios stabilizes at low values. However, we suggest further examination ofthe memory behaviour for DFT computations using FFTW with larger input sizes and a more in-depth testing method. / Diskret Fouriertransform (DFT) används inom många olika vetenskapliga områden. Det finns en ökande efterfrågan på snabba och effektiva sätt att beräkna DFT-problem med stora mängder data. FFTW-biblioteket är ett av de mest använda biblioteken vid beräkning av DFT-problem. FFTW-biblioteket anpassar sig till systemarkitekturen och försöker generera det mest effektiva sättet att lösa ett givet DFT-problem. Tidigare studier har visat att FFTW-biblioteket är effektivare än andra bibliotek som kan användas för att lösa DFT-problem. Däremot har studierna inte fokuserat på minneshanteringen, vilket är en nyckelfaktor för den generella prestandan. I den här studien undersökte vi FFTW-bibliotekets cache-minneshanteringen vid beräkning av 1-D komplexa DFT-problem. Tester utfördes med hjälp av bench FFT, Linux Perf och testskript. Resultaten från denna studie visar att cache-missförhållandet ökar med problemstorleken när problemstorleken ärmindre än den teoretiska problemstorleken som matchar cachekapaciteten. Detta bekräftas av resultat från L2-prefetcher-missförhållandet. Studien visar samtidigt att cache-missförhållandet stabiliseras när problemstorleken överskrider cachekapaciteten. Sammanfattningsvis går det att argumentera för att det är möjligt att använda bench FFT och Linux Perf för att mäta cache-minneshanteringen. Analysen visar också att cache-minneshanteringen är bra vid beräkning av 1-D komplexa DFTs med hjälp av FFTW-biblioteket eftersom missförhållandena stabiliseras vid låga värden. Vi föreslår dock ytterligare undersökning av minnesbeteendet för DFT-beräkningar med hjälp av FFTW där problemstorlekarna är större och en mer genomgående testmetod används. fftw discrete fourier transform fast fourier transform hpc perf cache memory analysis Computer Sciences Datavetenskap (datalogi)

Search results