Global ETD Search

81	Knihovna operací nad konečnými automaty / Library for Operations over Finite Automata Bartůněk, Petr January 2010 (has links) This work deals with two basic operations over finite automata. Determination of nondeterministic finite automata and minimization of deterministic finite automata. For these two operations I proposed sequential algorithms that are parallelizable. I deal mainly with finding the speedup of SSE instructions, or use the OpenMP library. The trend today is mainly in increasing the number of processors, so I propose parallel algorithms for multiple processors. When searching for the optimal solution, I will be to examine other ways to achieve speedup, for example efficient saving of the data structures in memory.
82	Paralelní genetický algoritmus pro vícejádrové systémy / The Parallel Genetic Algorithm for Multicore Systems Vrábel, Lukáš January 2010 (has links) Genetický algoritmus je optimalizačná metóda zameraná na efektívne hľadanie riešení rozličných problémov. Je založená na princípe evolúcie a prirodzeného výberu najschopnejších jedincov v prírode. Keďže je táto metóda výpočtovo náročná, bolo vymyslených veľa spôsobov na jej paralelizáciu. Avšak väčšina týchto metód je z historických dôvodov založená na superpočítačoch alebo rozsiahlych počítačových systémoch. Moderný vývoj v oblasti informačných technológií prináša na trh osobných počítačov stále lacnejšie a výkonnejšie viacjadrové systémy. Táto práca sa zaoberá návrhom nových metód paralelizácie genetického algoritmu, ktoré sa snažia naplno využiť možnosti práve týchto počítačových systémov. Tieto metódy sú následne naimplementované v programovacom jazyku C za využitia knižnice OpenMP určenej na paralelizáciu. Implementácia je následne použitá na experimentálne ohodnotenie rozličných charakteristík každej z prezentovaných metód (zrýchlenie oproti sekvenčnej verzii, závislosť konvergencie výsledných hodnôt od miery paralelizácie alebo od vyťaženia procesoru, ...). V poslednej časti práce sú prezentované porovnania nameraných hodnôt a závery vyplývajúce z týchto meraní. Následne sú prediskutované možné vylepšenia daných metód vyplývajúce z týchto záverov, ako aj možnosti spracovania väčšieho množstva charakteristík na presnejšie ohodnotenie efektivity paralelizácie genetických algoritmov.
83	A Performance Evaluation of MPI Shared Memory Programming / En utvärdering av MPI shared memory - programmering med inriktning på prestanda Karlbom, David January 2016 (has links) The thesis investigates the Message Passing Interface (MPI) support for shared memory programming on modern hardware architecture with multiple Non-Uniform Memory Access (NUMA) domains. We investigate its performance in two case studies: the matrix-matrix multiplication and Conway’s game of life. We compare MPI shared memory performance in terms of execution time and memory consumption with the performance of implementations using OpenMP and MPI point-to-point communication, also called "MPI two-sided". We perform strong scaling tests in both test cases. We observe that MPI two-sided implementation is 21% and 18% faster than the MPI shared and OpenMP implementations respectively in the matrix-matrix multiplication when using 32 processes. MPI shared uses less memory space: when compared to MPI two-sided, MPI shared uses 45% less memory. In the Conway’s game of life, we find that MPI two-sided implementation is 10% and 82% faster than the MPI shared and OpenMP implementations respectively when using 32 processes. We also observe that not mapping virtual memory to a specific NUMA domain can lead to an increment in execution time of 64% when using 32 processes. The use of MPI shared is viable for intranode communication on modern hardware architecture with multiple NUMA domains. / I detta examensarbete undersöker vi Message Passing Inferfaces (MPI) support för shared memory programmering på modern hårdvaruarkitektur med flera Non-Uniform Memory Access (NUMA) domäner. Vi undersöker prestanda med hjälp av två fallstudier: matris-matris multiplikation och Conway’s game of life. Vi jämför prestandan utav MPI shared med hjälp utav exekveringstid samt minneskonsumtion jämtemot OpenMP och MPI punkt-till-punkt kommunikation, även känd som MPI two-sided. Vi utför strong scaling tests för båda fallstudierna. Vi observerar att MPI-two sided är 21% snabbare än MPI shared och 18% snabbare än OpenMP för matris-matris multiplikation när 32 processorer användes. För samma testdata har MPI shared en 45% lägre minnesförburkning än MPI two-sided. För Conway’s game of life är MPI two-sided 10% snabbare än MPI shared samt 82% snabbare än OpenMP implementation vid användandet av 32 processorer. Vi kunde också utskilja att om ingen mappning av virtuella minnet till en specifik NUMA domän görs, leder det till en ökning av exekveringstiden med upp till 64% när 32 processorer används. Vi kom fram till att MPI shared är användbart för intranode kommunikation på modern hårdvaruarkitektur med flera NUMA domäner. MPI OpenMP Shared Memory HPC Parallel Programming NUMA Computer Sciences Datavetenskap (datalogi)
84	Parallel Kafka Producer Applications : Their performance and its limitations Sundbom, Arvid January 2023 (has links) "This paper examines multi-threaded Kafka producer applications, and how the performance of such applications is affected by how the number of producer instances relates to the number of executing threads. Specifically, the performance of such applications when using a single producer instance, shared among all threads, and when each thread is allotted a separate, private instance, is compared. This comparison is carried out for a number of different producer configurations and varying levels of computational work per message produced.Overall, the data indicates that utilizing private producer instances results in highe rperformance, in terms of data throughput, than sharing a single instance among the executing threads. The magnitude of this difference is affected, to some extent, by the configuration profiles used to create the producer instances, as well as the computational workload of the application hosting the producers. Specifically, configuring producers for reliability seems to increase the difference, and so does increasing the rate at which messages are to be produced.As a result of this, Brod, a wrapper library [56], based on an implementation of a client library for Apache Kafka [25], has been developed. The purpose of the library is to provide functionality which simplifies the development of multi-threadedKafka producer applications." Distriubted Systems Parallel Systems Apache Kafka Rust OpenMP Computer Systems Datorsystem
85	Performance comparison between OOD and DOD with multithreading in games Wingqvist, David, Wickström, Filip January 2022 (has links) Background. The frame rate of a game is important for both the end-user and the developer. Maintaining at least 60 FPS in a PC game is the current standard, and demands for efficient game applications rise. Currently, the industry standard within programming is to use Object-Oriented Design (OOD). But with the trend of larger sized games, this frame rate might not be maintainable using OOD. A design pattern that mitigates this is the Data-Oriented Design (DOD) which focuses on utilizing the CPU and memory efficiently. These design patterns differ in how they handle the data associated with them. Objectives. In this thesis, two games were created with two versions that used either OOD or DOD. The first game had multithreading included. New hardware utilizes several CPU cores, therefore, this thesis compares both singlethreaded and multithreaded versions of these design patterns.Methods. Experiments were made to measure the execution time and cache misses on the CPU. Each experiment started with a baseline that was gradually increased to stress the systems under test.Results. The results gathered from the experiments showed that the sections of the code that used DOD were significantly faster than OOD. DOD also had a better affinity with multithreading and was able to achieve at certain parts up to 13 times the speed of equivalent conditioned OOD. In the special case comparison DOD, even though it had larger objects, proved to be faster than OOD.Conclusions. DOD has shown to be significantly faster in execution time with fewer cache misses compared to OOD. Using multithreading for DOD presented to be the most efficient. Game development C++ Execution time CPU cache OpenMP Computer Systems Datorsystem
86	Approximate Bayesian Inference based on Dense Matrices and New Features using INLA Abdul Fattah, Esmail 30 July 2023 (has links) The Integrated Nested Laplace Approximations (INLA) method has become a commonly used tool for researchers and practitioners to perform approximate Bayesian inference for various fields of applications. It has become essential to incorporate more complex models and expand the method’s capabilities with more features. In this dissertation, we contribute to the INLA method in different aspects. First, we present a new framework, INLA$^+$, based on dense matrices to perform approximate Bayesian inference. An application of the new approach is fitting disease-mapping models for count data with complex interactions. When the precision matrix is dense, the new approach scales better than the existing INLA method and utilizes the power of multiprocessors on shared and distributed memory architectures in today’s computational resources. Second, we propose an adaptive technique to improve gradient estimation for the convex gradient-based optimization framework in INLA. We propose a simple limited-memory technique for improving the accuracy of the numerical gradient of the marginal posterior of the hyperparameter by exploiting a coordinate transformation of the gradient and the history of previously taken descent directions. Third, we extend the commonly utilized Bayesian spatial model in disease mapping, known as the Besag model, into a non-stationary spatial model. This new model considers variations in spatial dependency among a predetermined number of sub-regions. The model incorporates multiple precision parameters, which enable different intensities of spatial dependence in each sub-region. To avoid overfitting and enhance generalization, we derive a joint penalized complexity prior for these parameters. These contributions expand the capabilities of the INLA method, improving its scalability, accuracy, and flexibility for a wider range of applications. INLA Non-Stationary Dense Matrices Bayesian Inference Smart Gradient fbesag INLAPLUS MPI OpenMP Spatial Model
87	A Study of Improving the Parallel Performance of VASP. Baker, Matthew Brandon 13 August 2010 (has links) (PDF) This thesis involves a case study in the use of parallelism to improve the performance of an application for computational research on molecules. The application, VASP, was migrated from a machine with 4 nodes and 16 single-threaded processors to a machine with 60 nodes and 120 dual-threaded processors. When initially migrated, VASP's performance deteriorated after about 17 processing elements (PEs), due to network contention. Subsequent modifications that restrict communication amongst VASP processes, together with additional support for threading, allowed VASP to scale up to 112 PEs, the maximum number that was tested. Other performance-enhancing optimizations that were attempted included replacing old libraries, which produced improvements of about 10%, and prefetching, which degraded, rather than enhanced, VASP performance. cluster parallel performance openmp mpi HPC VASP Computer Sciences Physical Sciences and Mathematics Systems Architecture
88	Predictability of Optimal Core Distribution Based on Weight and Speedup Eriksson, Rasmus January 2022 (has links) Efficient use of hardware resources is a vital part of getting good results within high performance computing. This thesis explores the predictability of optimal CPU-core distribution between two tasks running in parallel on a shared-memory machine, with the intent to reach the shortest total runtime possible. The predictions are based on the weight and speedup of each task, in regards to the CPU-frequency decrease that comes with a growing number of active cores in modern CPUs. The weight of a task is the number of floating point operations needed to compute it to completion. The Intel oneAPI Math Kernel Library is used to create a set of different tasks, where each task consists of a single call to a dgemm-routine. Two prediction algorithms for optimal core distribution are presented and used in this thesis. Their predictions are compared to the fastest distribution observed by either running the tasks back-to-back, with each using all available cores, or running the tasks simultaneously in two parallel regions. Experimental results suggest that there is merit to this method, with the best of the two algorithms having a 14/15 prediction-accuracy of the core distribution resulting in the fastest run. Resource distribution Core distribution Shared memory High performance computing DGEMM OpenMP Computer Sciences Datavetenskap (datalogi)
89	Modernizing and Evaluating the Autotuning Framework of SkePU 3 Nsralla, Basel January 2022 (has links) Autotuning is a method which enables a program to automatically choose the most suitable parameters that optimizes it for a certain goal e.g. speed, cost, etc. In this work autotuning is implemented in the context of the SkePU framework, in order to choose the best backend (CUDA, CPU, OpenCL, Hybrid) that would optimize a skeleton execution in terms of performance. SkePU is a framework that provides different algorithmic skeletons with implementations for the different backends (OpenCL, CUDA, OpenMP, CPU). Skeletons are parameterised with a user-provided per-element function which will run in parallel. This thesis shows how the autotuning of SkePU’s automatic backend selection for skeleton calls is implemented with respect to all the different parameters that a SkePU skeleton could have. The autotuning in this thesis is built upon the sampling technique, which is implemented by applying different combinations of sizes for the vector and matrix parameters to eventually generate an execution plan, which will be used as a lookup table when running the skeleton on all different backends. The execution plan will estimate the best performing backend for the sample. This work finally evaluates the implementation by comparing the results of running the autotuning on the different SkePU programs, to running the same programs without the autotuning. SkePU Autotuning Parallel Computing Multicore OpenCL OpenMP Computer Sciences Datavetenskap (datalogi)
90	Two-Dimensional Convex Hull Algorithm Using the Iterative Orthant Scan / Tvådimensionell Convex Hull Algoritm Med Iterativ Orthant Scan Freimanis, Davis, Hongisto, Jonas January 2017 (has links) Finding the minimal convex polygon on a set of points in space is of great interest in computer graphics and data science. Furthermore, doing this efficiently is financially preferable. This thesis explores how a novel variant of bucketing, called the Iterative orthant scan, can be applied to this problem. An algorithm implementing the Iterative orthant scan was created and implemented in C++ and its real-time and asymptotic performance was compared to the industry standard algorithms along with the traditional textbook convex hull algorithm. The worst-case time complexity was shown to be a logarithmic term worse than the best teoretical algorithm. The real-time performance was 47.3% better than the traditional algorithm and 66.7% worse than the industry standard for a uniform square distribution. The real-time performance was 61.1% better than the traditional algorithm and 73.4% worse than the industry standard for a uniform triangular distribution. For a circular distribution, the algorithm performed 89.6% worse than the traditional algorithm and 98.7% worse than the industry standard algorithm. The asymptotic performance improved for some of the distributions studied. Parallelization of the algorithm led to an average performance improvement of 35.9% across the tested distributions. Although the created algorithm exhibited worse than the industry standard algorithm, the algorithm performed better than the traditional algorithm in most cases and shows promise for parallelization. / Att hitta den minsta konvexa polygonen för en mängds punkter är av intresse i ett flertal områden inom datateknik. Att hitta denna polygon effektivt är av ekonomiskt intresse. Denna rapport utforskar hur metoden Iterative orthant scan kan appliceras på problemet att hitta detta konvexa hölje. En algoritm implementerades i C++ som utnyttjar denna metod och dess prestanda jämfördes i realtid såsom asymptotiskt mot den traditionella och den mest använda algoritmen. Den nya algoritmens asymptotiska värsta fall visades vara ett logaritmiskt gradtal sämre än den bästa teoretiska algoritmen. Realtidsprestandan av den nya algoritmen var 47,3% bättre än den traditionella algoritmen och 66,7% sämre än den mest använda algoritmen för fyrkantsdistribution av punkterna. Realtidsprestandan av den nya algoritmen var 61,1% bättre än den traditionella algoritmen och 73,4% sämre än den mest använda algoritmen för triangulär distribution av punkterna. För cirkulär distribution var vår algoritm 89,6% sämre än den traditionella algoritmen och 98,7% sämre än den vanligaste algoritmen. Parallellisering av vår algoritm ledde i medel till en förbättring på 35,9%. För vissa typer av fördelningar blev denna prestanda bättre. Även då algoritmen hade sämre prestanda än den vanligaste algoritmen, så är det ett lovande steg att prestandan var bättre än den traditionella algoritmen. Convex hull parallelization OpenMP Iterative orthant scan computer graphics minimal polygon Computer Sciences Datavetenskap (datalogi)

Search results