Global ETD Search

1	Structure-based Optimizations for Sparse Matrix-Vector Multiply Belgin, Mehmet 16 January 2011 (has links) This dissertation introduces two novel techniques, OSF and PBR, to improve the performance of Sparse Matrix-vector Multiply (SMVM) kernels, which dominate the runtime of iterative solvers for systems of linear equations. SMVM computations that use sparse formats typically achieve only a small fraction of peak CPU speeds because they are memory bound due to their low flops:byte ratio, they access memory irregularly, and exhibit poor ILP due to inefficient pipelining. We particularly focus on improving the flops:byte ratio, which is the main limiter on performance, by exploiting recurring structures or sub-structures in matrices. Our techniques also support micro-architecture level optimizations to further improve performance. Operation Stacking Framework (OSF) stacks problems in large ensemble computations, which run the same sparse kernel using an identical matrix structure, such that they share a single copy of the indexing information to significantly reduce memory bandwidth usage. OSF provides performance improvements of up to 1.94x on an AMD Opteron compared to the CSR method. We validate performance results using hardware event counters, which demonstrate significantly improved cache and pipeline utilization. Pattern-based Representation (PBR) exploits recurring block nonzero patterns by generating custom code for each recurring block pattern. In this way, no indexing data for individual nonzero elements are read from memory, reducing the overall size of the indices by up to 98%. Our code generator emits highly tuned codes that utilize SSE vectorization and software prefetching. PBR accurately identifies a block size that achieves optimal or near-optimal performance using a linear multiple regression performance model. On recent multicore machines, PBR provides performance improvements of up to 3.4x sequentially and 5x in parallel, compared to the CSR method. The PBR library we provide converts matrices at runtime, allowing our method to be used as a drop-in replacement for existing methods. We compare PBR's overhead relative to its benefits and show that PBR is beneficial for many applications that repetitively call the SMVM kernel for the same matrix structure. / Ph. D. Code Generators Vectorization Sparse SpMV SMVM Matrix Vector Multiply PBR OSF thread pool parallel SpMV
2	Optimizing Multi-Queue in Parallel Systems with Task Batching Ronestjärna, Jakob January 2024 (has links) Multi-queue has been a proven solution for problems related to high input and output for any type of hardware. Therefore, improving it with task batching may increase performance and reduce overhead. While multi-queue and task batching have been studied, combining them has only been mentioned briefly, specifically for priority queues. For this experiment, system metrics of timers and block misses will show potential areas where task batching is beneficial. The result from the performed experiment show that the size of batches has similarities with the number of input workers. The result gives some insight into further improvements of the multi-queue, and the overhead was found to be reduced whenever task batching was used. multi-queue task batching batching parallel systems thread pool overhead scaling scaling vertically non-blocking Computer Sciences Datavetenskap (datalogi)
3	Jack Rabbit : an effective Cell BE programming system for high performance parallelism Ellis, Apollo Isaac Orion 08 July 2011 (has links) The Cell processor is an example of the trade-offs made when designing a mass market power efficient multi-core machine, but the machine-exposing architecture and raw communication mechanisms of Cell are hard to manage for a programmer. Cell's design is simple and causes software complexity to go up in the areas of achieving low threading overhead, good bandwidth efficiency, and load balance. Several attempts have been made to produce efficient and effective programming systems for Cell, but the attempts have been too specialized and thus fall short. We present Jack Rabbit, an efficient thread pool work queue implementation, with load balancing mechanisms and double buffering. Our system incurs low threading overhead, gets good load balance, and achieves bandwidth efficiency. Our system represents a step towards an effective way to program Cell and any similar current or future processors. / text Cell processor Multi-core systems High performance computing Runtime Barnes Hut LU factorization Mandelbrot Double buffering Thread pool Work queue Load balance

1

Page generated in 0.0533 seconds