<p>This thesis describes novel techniques and test implementations for optimizing numerically intensive codes. Our main focus is on how given algorithms can be adapted to run efficiently on modern microprocessor exploring several architectural features including, instruction selection, and access patterns related to having several levels of cache. Our approach is also shown to be relevant for multicore architectures. Our primary target applications are linear algebra routines in the form of matrix multiply with dense matrices. We analyze how current compilers, microprocessor and common optimization techniques (like loop tiling and date relocation) interact. A tunable assembly code generator is developed, built, and tested on a basic BLAS level-3 routine to side-step some of the performance issues of modern compilers. Our generator has been test on both the Intel Pentium 4 and Intel's Core 2 processors. For the Pentium 4, a 10.8 % speed-up is achieved over ATLAS's rank2k, and a 17% speed-up is achieved over MKL's implementation for 4000-by-4032 matrices. On the Core 2 we optimize our code for 2000-by-2000 matrices and achieved a 24% and 5% speed-up over ATLAS and MKL, respectively with our multi-threaded implementation. Also for other matrix sizes, descent speed-ups are shown. Considering that our implementation is far from fully tuned, we consider these result very respectable.</p>
Identifer | oai:union.ndltd.org:UPSALLA/oai:DiVA.org:ntnu-9827 |
Date | January 2009 |
Creators | Jensen, Rune Erlend |
Publisher | Norwegian University of Science and Technology, Department of Computer and Information Science, Institutt for datateknikk og informasjonsvitenskap |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, text |
Page generated in 0.0017 seconds