Global ETD Search

71	Improving the Performance of Parallel Applications in Chip Multiprocessors with Architectural Techniques Jahre, Magnus January 2007 (has links) Chip Multiprocessors (CMPs) or multi-core architectures are a new class of processor architectures. Here, multiple processing cores are placed on the same physical chip. To reach the performance potential of these architectures with a single application, it must be multi-threaded. In these applications, the processing cores cooperate to solve a single task, and this requires a large amount of inter-processor communication in many cases. Consequently, CMPs need to support this communication in an efficient manner. To investigate inter-processor communication in CMPs, a good understanding of the state-of-the-art of CMP design options, interconnect network design and cache coherence protocol solutions is required. Furthermore, a good computer architecture simulator is needed to evaluate both new and conventional architectural solutions. The M5 simulator is used for this purpose and has been extended with a generic split transaction bus, a crossbar based on the IBM Power 5 crossbar, a butterfly network and an ideal interconnect. The unrealistic ideal interconnect provides an upper bound on the performance improvement available from enhancing the interconnect. In addition, a directory-based coherence protocol proposed by Stenström has been implemented. The performance of 2-, 4- and 8-core CMPs with crossbar and bus interconnects, private L1 caches and shared L2 caches is investigated. The bus and the crossbar are the conventional ways of implementing the L1 to L2 cache interconnect. These configurations have been evaluated with multiprogrammed workloads from the SPEC2000 benchmark suite and parallel, scientific benchmarks from the SPLASH-2 benchmark suite. With multiprogrammed workloads, the crossbar interconnect configurations perform nearly as well as a configuration with an ideal interconnect. However, the performance of the crossbar CMPs is similar to the performance of the bus CMPs when there is intensive L1 to L1 cache communication. The reason is limited L1 to L1 bandwidth. The bus CMPs experience a severe performance degradation with some benchmarks for all processor counts and workload classes. A butterfly interconnect is proposed to alleviate the L1 to L1 communication bottleneck. The butterfly CMP performs on average 3.9 times better than the bus CMP and 3.8 times better than the crossbar CMP when there are 8 processor cores. These numbers are based on the performance of the WaterNSquared, Raytrace, Radix and LUNoncontig benchmarks. The reason is that the other SPLASH-2 benchmarks had issues with the M5 thread implementation for these configurations. For the multiprogrammed workloads, the butterfly CMPs are a bit slower than the crossbar CMPs. ntnudaim SIF2 datateknikk Komplekse datasystemer
72	Directional Decomposition of Images: Implementation Issues Including GPU Techniques Dubois, Jérôme January 2008 (has links) Directional decomposition of an image consists of separating it into several components, each containing directional information in some specific directions. It has many applications in digital image processing, such as image improvement or linear feature detection, and could be used on seismic data to help geophysicists finding faults. In this thesis, we look at a directional filter bank (DFB) introduced by Bamberger and Smith and how to implement it efficiently on CPU and GPU. Graphics Processing Units (GPUs) are becoming increasingly more suitable for general scientific computing, and applications with suitable properties run much quicker on a GPU than a CPU. For instance, NVIDIA CUDA (Compute Unified Device Architecture) is a new programming interface that lets users program NVIDIA General Purpose GPUs (GPGPUs) in a C-like fashion for data parallel intensive computation. We translate the DFB algorithm from a theoretical signal processing description to an algorithmic description from computer scientists'point of view, including a readable C implementation. Tools are developed to ease our DFB investigation, including a tailored library to manipulate images in suitable text-based and binary formats and for generating test images with suitable properties. Several implementations of 1D filter banks are also provided. Finally, part of the Bamberger DFB is implemented efficiently using the CUDA environment for NVIDIA GPUs. We show that directional filter banks can efficiently be executed on GPUs and demonstrate that the CPU-GPU bandwidth affects performance considerably. Hence, care should be taken to do as many steps as possible on the GPU before returning results to the CPU. ntnudaim SIF2 datateknikk Komplekse datasystemer
73	Co-design implementation of FPGA hardware acceleration of DNA motif identification Linvåg, Elisabeth January 2008 (has links) Pattern matching in bio-informatics is a discipline in sturdy growth, and has a great need for searching through large amounts of data. At NTNU, a prototype specified in VHDL has been developed for an FPGA-solution identifying short motifs or patterns in genetic data using a Position-Weight Matrix (PWM). But programming FPGAs using VHDL is a complicated and time consuming process that requires intimate knowledge of how hardware works, and the prototype is not yet complete in terms of required functionality. Consequently, a desirable alternative is to make use of co-design languages to facilitate the use of hardware for a software developer, as well as to integrate the environment for development of soft- and hardware. This thesis deal with specification and implementation of a co-design based alternative to the existing VHDL based solution, as well as an evaluation of productivity vs final performance of the newly developed solution compared to the VHDL based solution. The chosen co-design language is Impulse-C, created by Impulse Accelerated Technologies Inc., which is a co-design language designed for data-flow oriented applications, but with the flexibility to support other programming models as well. The programming model simplifies the expression of highly parallel algorithms through the use of well-defined data communication, message passing and synchronization mechanisms. The affiliated development environment, CoDeveloper, contains tools that allow the FPGA system to be developed and debugged using Impulse-C. The software-to-hardware compiler and optimizer translates C-language processes to (RTL) VHDL code, while optimizing the generated logic and identifying opportunities for parallelism. Ease-of-use for the CoDeveloper environment is evaluated in this thesis, based on the authors experiences with the tools. In total, four variations of the Impulse-C solution has been implemented; a basic solution and a multicore solution, both implemented in a floating-point and a 'fixed-point' version. The implemented solutions are analyzed through various experiments described in this thesis, done during simulation using CoDeveloper. Attempts were made to get the solutions to run on the target platform, the Cray XD1 supercomputer Musculus, but these were unsuccessful. A wrong choice of properties and constraints in Xilinx ISE are believed to have caused the FPGA programming file to be generated faulty. There was no time to confirm and correct this. Some information about device utilization and performance could still be extracted from the Xilinx ISE 'Static timing' and 'Place and route' reports. ntnudaim SIF2 datateknikk Komplekse datasystemer
74	Online Meat Cutting Optimisation Wikborg, Uno January 2008 (has links) Nortura, Norways largest producer of meat, faces many challenges in their operation. One of these challenges is to decide which products to make out of each of the slaughtered animals. The meat from the animals can be made into different products, some more valuable than others. However, someone has to buy the products as well. It is therefore important to produce what the customers ask for. This thesis is about a computer system based on online optimisation which helps the meat cutters decide what to make. Two different meat cutting plants have been visited to specify how the system should work. This information has been used to develop a program which can give a recommendation for what to produce from carcasses during cutting. The system has been developed by considering both the attributes of the animals and the orders from the customers. The main focus of the thesis is how to deal with the fact that the attributes are only known for a small number of the animals, since they are measured right after slaughtering. A method has been made to calculate what should be made from the different carcasses, and this method has been realised with both exact and heuristic algorithms. ntnudaim SIF2 datateknikk Komplekse datasystemer
75	Empirical evaluation of metric indexing methods Fevang, Rune, Fossaa, Arne Bergene January 2008 (has links) Metric indexing is a branch of search technology that is designed for search non-textual data. Examples of this includes image search (where the search query is an image), document search (finding documents that are roughly equal) to search in high-dimensional Euclidean spaces. Metric indexing is based on the theory of metric spaces, where the only thing known about a set of objects is the distance between them (defined by a metric distance function). A large number of methods have been proposed to solve the metric indexing problem. In this thesis, we have concentrated on new approaches to solving these problems, as well as combining existing methods to create better ones. The methods studied in this thesis include D-Index, GNAT, EMVP-Forest, HC, SA-Tree, SSS-Tree, M-Tree, PM-Tree, M-Tree and PM-Tree. These have all been implemented and tested against each other to find strengths and weaknesses. This thesis also studies a group of indexing methods called hybrid methods which combines tree-based methods (like SA-Tree, SSS-tree and M-Tree), with pivoting methods (like AESA and LAESA). The thesis also proposes a method to create hybrid trees from existing trees by using features in the programming language. Hybrid methods have been shown in this thesis to be very promising. While they may have a considerable overhead in construction time,CPU usage and/or memory usage, they show large benefits in reduced number of distance computations. We also propose a new way of calculating the Minimal Spanning Tree of a graph operating on metric objects, and show that it reduces the number of distance computations needed. ntnudaim SIF2 datateknikk Komplekse datasystemer
76	Optimizing & Parallelizing a Large Commercial Code for Modeling Oil-well Networks Rudshaug, Atle January 2008 (has links) In this project, a complex, serial application that models networks of oil wells is analyzed for today's parallel architectures. By heavy use of the profiling tool Valgrind, several serial optimizations are achieved, causing up to a 30-50x speedup on previously dominant sections of the code, on different architectures. Our initial main goal is to parallelize our application for GPGPUs (General Purpose Graphics Processing Units) such as the NVIDIA GeForce 8800GTX. However, our optimized application is shown not to have a high enough computational intensity to be suitable for the GPU platforms, with the data transfer over the PCI-express port showing to be a serious bottleneck. We then target our applications for another, more common, parallel architecture -- the multi-core CPU. Instead of focusing on the low-level hotspots found by the profiler, a new approach is taken. By analyzing the functionality of the application and the problem it is to solve, the high-level structure of the application is identified. A thread pool in combination with a task queue is implemented using PThreads in Linux, which fit the structure of the application. It also supports nested parallel queues, while maintaining all serial dependencies. However, the sheer size and complexity of the serial application, introduces a lot of problems when trying to go multithreaded. A tight coupling of all parts of the code, introduces several race conditions, creating erroneous results for complex cases. Our focus is hence shifted to developing models to help analyze how suitable applications with traversal of dependence-tree structures, such as our oil well network application is, given benchmarks of the node times. First, we benchmark the serial execution of each child in the network and predict the overall parallel performance by computing dummy tasks reflecting these times on the same tree structure on two given well networks, a large and a small case. Based on these benchmarks, we then predict the speedup of these two cases, with the assumption of balanced loads on each level in the network. Finally, the minimum amount of time needed to calculate a given network is predicted. Our predictions of low scalability, due to the nature of the oil networks in the test cases, are then shown. This project thus concludes that the amount of work needed to successfully introduce multithreading in this application might not be worth it, due to all the serial dependencies in the problem the application tries to solve. However, if there are multiple individual networks to be calculated, we suggest using Grid technology to manage multiple individual instances of the application simultaneously. This can be done either by using script files or by adding DRMAA API calls in the application. This, in combination with further serial optimizations, is the way to go for good speedup for these types of applications. ntnudaim SIF2 datateknikk Komplekse datasystemer
77	Discriminating Music,Speech and other Sounds and Language Identification Strømhaug, Tommy January 2008 (has links) The tasks : discriminating music, speech and other sounds and language identification have a broad range of applications in todays multilingual multimedia community. Both tasks gave a lot of possibilities regarding methods and development tools which also brings some risk. The Language Identification(LID) problem ended up with two different approaches. One approach was discarded due to poor results in the pre-study while the other approach had some promising potential but did not deliver as hoped in the first place. On the other hand, the music, speech discrimination was solved with great accuracy using 3 simple time domain features and Support Vector Machines(SVM). Adding 'other sounds' to this discrimination problem did complicate the problem but the final solution delivered great results using the enormous BBC Sound Effects library as examples of non speech and music. Both tasks were tried being solved using Gaussian Mixture Models(GMM) because of it's known great ability to model arbitrary feature space segmentations. The tools used were Matlab together with a number of different toolboxes explained further in the text. ntnudaim SIF2 datateknikk Komplekse datasystemer
78	Linux Support for AVR32 UC3A : Adaption of the Linux kernel and toolchain Driveklepp, Pål, Morken, Olav, Rangøy, Gunnar January 2009 (has links) The use of Linux in embedded systems is steadily growing in popularity. The UC3A is a series of high performance, low power 32-bit microcontrollers aimed at several industrial and commercial applications including PLC, instrumentation, phones, vending machines and more. The main goal of this project was to complete the adaptation of the Linux kernel, compiler and loader software, in order to enable the Linux kernel to load and run applications on this device. In addition, a set of useful applications should be picked, compiled and tested on the target platform to indicate a complete software solution. This master's thesis is a continuation, by the same three students, of the work of a student project during the fall of 2008. In this report we present in detail the findings, challenges, choices and and solutions involved in the working process. During the course of this project, we have successfully adapted the Linux kernel, and a toolchain for generating binaries loadable by Linux. A set of test applications have been compiled and tested on the resulting platform. This project has resulted in the submission of a revised patch series for the U-Boot boot loader, one patch series for Linux, and one for the toolchain. Requirements have been created, and tests for the requirements have been carried out. ntnudaim SIF2 datateknikk Komplekse datasystemer
79	Multimodal Behaviour Generation Frameworks in Virtual Heritage Applications : A Virtual Museum at Sverresborg Stokes, Michael James January 2009 (has links) This masters thesis proposes that multimodal behaviour generation frameworks are an appropriate way to increase the believability of animated characters in virtual heritage applications. To investigate this proposal, an existing virtual museum guide application developed by the author is extended by integrating the Behavioural Markup Language (BML), and the open-source BML realiser SmartBody. The architectural and implementation decisions involved in this process are catalogued and discussed. The integration of BML and SmartBody results in a dramatic improvement in the quality of character animation in the application, as well as greater flexibility and extensibility, including the ability to create scripted sequences of behaviour for multiple characters in the virtual museum. The successful integration confirms that multimodal behaviour generation frameworks have a place in virtual heritage applications. ntnudaim SIF2 datateknikk Komplekse datasystemer
80	Modeling Communication on Multi-GPU Systems Spampinato, Daniele January 2009 (has links) Coupling commodity CPUs and modern GPUs give you heterogeneous systems that are cheap, high-performance with incredible FLOPS counts. Recent evolution of GPGPU models and technologies make these systems even more appealing as compute devices for a range of HPC applications including image processing, seismic processing and other physical modeling, as well as linear programming applications. In fact, graphics vendor such as NVIDIA and AMD are now targeting HPC with some of their products. Due to the power and frequency walls, the trend is now to use multiple GPUs on a given system, much like you will find multiple cores on CPU-based systems. However, increasing the hierarchy of resource wides the spectrum of factors that may impact on the performance of the system. The lack of good models for GPU-based, heterogeneous systems also makes it harder to understand which factors impact performance the most. The goal of this thesis is to analyze such factors by investigating and benchmarking NVIDIA's multi-GPU solution, their recent NVIDIA Tesla S1070 Computing System. This system combines four T10 GPUs making available up to 4 TFLOPS of computational power. Based on a comparative study of fundamental parallel computing models and on the specific heterogeneous features exposed by the system, we define a test space for performance analysis. As a case study, we develop a red-black, SOR PDE solver for Laplace equations with Dirichlet boundaries, well known for requiring constant communication in order to exchange neighboring data. To aid both design and analysis, we propose a model for multi-GPU systems targeting communication between the several GPUs. The main variables exposed by the benchmark application are: domain size and shape, kind of data partitioning, number of GPUs, width of the borders to exchange, kernels to use, and kind of synchronization between the GPU contexts. Among other results, the framework is able to point out the most critical bounds of the S1070 system when dealing with applications like the one in our case study. We show that the multi-GPU system greatly benefits from using all its four GPUs on very large data volumes. Our results show the four GPUs almost four times faster than a single GPU, and twice as fast as two. Our analysis outcomes also allow us to refine our static communication model, enriching it with regression-based predictions. ntnudaim SIF2 datateknikk Komplekse datasystemer

Search results