Global ETD Search

41	Performance Analysis of kNN on large datasets using CUDA & Pthreads : Comparing between CPU & GPU Kankatala, Sriram January 2015 (has links) Several organizations have large databases which are growing at a rapid rate day by day, which need to be regularly maintained. Content based searches are similar searched based on certain features that are obtained from various multi media data. For various applications like multimedia content retrieval, data mining, pattern recognition, etc., performing the nearest neighbor search is a challenging task in multidimensional data. The important factors in nearest neighbor search kNN are searching speed and accuracy. Implementation of kNN on GPU is an ongoing research from last few years, focusing on improving the performance of kNN. By considering these aspects, our research has been started and found a gap in this research area. This master thesis shows effective and efficient parallelism on multi-core of CPU and GPU to compare the performance with single core CPU. This paper shows an experimental implementation of kNN on single core CPU, Mutli-core CPU and GPU using C, Pthreads and CUDA respectively. We considered different levels of inputs (size, dimensions) to evaluate the performance. The experiment shows the GPU outperforms for kNN when compared to CPU single core with a factor of approximately 5.8 to 16 and CPU multi-core with a factor of approximately 1.2 to 3 for different levels of inputs. GPU Multicore CPU Parallel computing Performance Single core CPU
42	AutoPilot: A Message-Passing Parallel Programming Library for the IMAPCAR2 Kelly, Benjamin 14 March 2013 (has links) The IMAPCAR2 from Renesas Electronics is an embedded realtime image processor, combining a single core with a 128-way SIMD array. At runtime, sections of the SIMD array can be retasked as additional CPU cores, interconnected via a message passing ring. Using these cores effectively, however, is made difficult by the low-level nature of the message passing API and the lack of cache coherency between processors. Developing and debugging software for this platform is a difficult task. The AutoPilot library addresses this by providing a high-level message-oriented parallel programming model for the IMAPCAR2. AutoPilot's API is closely based on that of Pilot, a wrapper around the Message Passing Interface (MPI) for cluster computing. By reimplementing the Pilot API for the IMAPCAR2, AutoPilot shows that its processes-and-channels architecture is a viable choice for parallel programming on cache-incoherent multicore architectures. At the same time, it provides a simpler API for programmers, with builtin safety checks that eliminate some common sources of errors. Pilot IMAPCAR2 parallel programming embedded multicore cache coherence message passing
43	A Parallel Particle Swarm Optimization Algorithm for Option Pricing Prasain, Hari 19 July 2010 (has links) Financial derivatives play significant role in an investor's success. Financial option is one form of derivatives. Option pricing is one of the challenging and fundamental problems of computational finance. Due to highly volatile and dynamic market conditions, there are no closed form solutions available except for simple styles of options such as, European options. Due to the complex nature of the governing mathematics, several numerical approaches have been proposed in the past to price American style and other complex options approximately. Bio-inspired and nature-inspired algorithms have been considered for solving large, dynamic and complex scientific and engineering problems. These algorithms are inspired by techniques developed by the insect societies for their own survival. Nature-inspired algorithms, in particular, have gained prominence in real world optimization problems such as in mobile ad hoc networks. The option pricing problem fits very well into this category of problems due to the ad hoc nature of the market. Particle swarm optimization (PSO) is one of the novel global search algorithms based on a class of nature-inspired techniques known as swarm intelligence. In this research, we have designed a sequential PSO based option pricing algorithm using basic principles of PSO. The algorithm is applicable for both European and American options, and handles both constant and variable volatility. We show that our results for European options compare well with Black-Scholes-Merton formula. Since it is very important and critical to lock-in profit making opportunities in the real market, we have also designed and developed parallel algorithm to expedite the computing process. We evaluate the performance of our algorithm on a cluster of multicore machines that supports three different architectures: shared memory, distributed memory, and a hybrid architectures. We conclude that for a shared memory architecture or a hybrid architecture, one-to-one mapping of particles to processors is recommended for performance speedup. We get a speedup of 20 on a cluster of four nodes with 8 dual-core processors per node. PSO option pricing algorithm Parallel PSO multicore architecture
44	Static Timing Analysis of Parallel Systems Using Abstract Execution Gustavsson, Andreas January 2014 (has links) The Power Wall has stopped the past trend of increasing processor throughput by increasing the clock frequency and the instruction level parallelism.Therefore, the current trend in computer hardware design is to expose explicit parallelism to the software level.This is most often done using multiple processing cores situated on a single processor chip.The cores usually share some resources on the chip, such as some level of cache memory (which means that they also share the interconnect, e.g. a bus, to that memory and also all higher levels of memory), and to fully exploit this type of parallel processor chip, programs running on it will have to be concurrent.Since multi-core processors are the new standard, even embedded real-time systems will (and some already do) incorporate this kind of processor and concurrent code. A real-time system is any system whose correctness is dependent both on its functional and temporal output. For some real-time systems, a failure to meet the temporal requirements can have catastrophic consequences. Therefore, it is of utmost importance that methods to analyze and derive safe estimations on the timing properties of parallel computer systems are developed. This thesis presents an analysis that derives safe (lower and upper) bounds on the execution time of a given parallel system.The interface to the analysis is a small concurrent programming language, based on communicating and synchronizing threads, that is formally (syntactically and semantically) defined in the thesis.The analysis is based on abstract execution, which is itself based on abstract interpretation techniques that have been commonly used within the field of timing analysis of single-core computer systems, to derive safe timing bounds in an efficient (although, over-approximative) way.Basically, abstract execution simulates the execution of several real executions of the analyzed program in one go.The thesis also proves the soundness of the presented analysis (i.e. that the estimated timing bounds are indeed safe) and includes some examples, each showing different features or characteristics of the analysis. / Worst-Case Execution Time Analysis of Parallel Systems / RALF3 - Software for Embedded High Performance Architectures WCET analysis parallel systems multi-core multicore threaded programming language
45	A Parallel Particle Swarm Optimization Algorithm for Option Pricing Prasain, Hari 19 July 2010 (has links) Financial derivatives play significant role in an investor's success. Financial option is one form of derivatives. Option pricing is one of the challenging and fundamental problems of computational finance. Due to highly volatile and dynamic market conditions, there are no closed form solutions available except for simple styles of options such as, European options. Due to the complex nature of the governing mathematics, several numerical approaches have been proposed in the past to price American style and other complex options approximately. Bio-inspired and nature-inspired algorithms have been considered for solving large, dynamic and complex scientific and engineering problems. These algorithms are inspired by techniques developed by the insect societies for their own survival. Nature-inspired algorithms, in particular, have gained prominence in real world optimization problems such as in mobile ad hoc networks. The option pricing problem fits very well into this category of problems due to the ad hoc nature of the market. Particle swarm optimization (PSO) is one of the novel global search algorithms based on a class of nature-inspired techniques known as swarm intelligence. In this research, we have designed a sequential PSO based option pricing algorithm using basic principles of PSO. The algorithm is applicable for both European and American options, and handles both constant and variable volatility. We show that our results for European options compare well with Black-Scholes-Merton formula. Since it is very important and critical to lock-in profit making opportunities in the real market, we have also designed and developed parallel algorithm to expedite the computing process. We evaluate the performance of our algorithm on a cluster of multicore machines that supports three different architectures: shared memory, distributed memory, and a hybrid architectures. We conclude that for a shared memory architecture or a hybrid architecture, one-to-one mapping of particles to processors is recommended for performance speedup. We get a speedup of 20 on a cluster of four nodes with 8 dual-core processors per node. PSO option pricing algorithm Parallel PSO multicore architecture
46	Hierarchical scheduling for predictable execution of real-time software components and legacy systems Inam, Rafia January 2014 (has links) This dissertation presents techniques to achieve predictable execution of coarse-grained software components and for preservation of temporal properties of components during their integration and reuse. The dissertation presents a novel concept runnable virtual node (RVN) which interaction with the environment is bounded both by a functional and a temporal interface, and the validity of its internal temporal behaviour is preserved when integrated with other components or when reused in a new environment. The realization of RVN exploits techniques for hierarchical scheduling to achieve temporal isolation, and the principles from component-based software-engineering to achieve functional isolation. The proof-of-concept case studies executed on a micro-controller demonstrate the preserving of real-time properties within software components for predictable integration and reusability in a new environment, in both hierarchical scheduling and RVN contexts. Further, a multi-resource server (MRS) is proposed and implemented to enable predictable execution when composing multiple real-time components on a COTS multicore platform. MRS uses resource reservation for both CPU-bandwidth and memory-bus bandwidth to bound the interferences between tasks running on the same core, as well as, between tasks running on different cores. The later could, without MRS, interfere with each other due to contention on a shared memory-bus and memory. The results indicated that MRS can be used to "encapsulate" legacy systems and to give them enough resources to fulfill their purpose. In the dissertation, the compositional schedulability analysis for MRS is also provided and an experimental study is performed to bring insight on the correlation between the server budgets. We believe that the proposed approaches enable a faster software integration and support legacy reuse and that this work transcend the boundaries of software engineering and real-time systems. / PPMSched / PROGRESS real-time systems component integration and reuse hierarchical scheduling multicore
47	Exploring the Epiphany manycore architecturefor the Lattice Boltzmann algorithm Raase, Sebastian January 2014 (has links) Computational fluid dynamics (CFD) plays an important role in many scientific applications, ranging from designing more effective boat engines or aircraft wings to predicting tomorrow's weather, but at the cost of requiring huge amounts of computing time. Also, traditional algorithms suffer from scalability limitations, making them hard to parallelize massively. As a relatively new and promising method for computational fluid dynamics, the Lattice Boltzmann algorithm tries to solve the scalability problems of conventional, but well-tested algorithms in computational fluid dynamics. Through its inherently local structure, it is well suited for parallel processing, and has been implemented on many different kinds of parallel platforms. Adapteva's Epiphany platform is a modern, low-power manycore architecture, which is designed to scale up to thousands of cores, and has even more ambitious plans for the future. Hardware support for floating-point calculations makes it a possible choice in scientific settings. The goal of this thesis is to analyze the performance of the Lattice Boltzmann algorithm on the Epiphany platform. This is done by implementing and testing the lid cavity test case in two and three dimensions. In real applications, high performance on large lattices with millions of nodes is very important. Although the tested Epiphany implementation scales very good, the hardware does not provide adequate amounts of local memory and external memory bandwidth, currently preventing widespread use in computational fluid dynamics. manycore multicore adapteva epiphany lattice boltzmann computational fluid dynamics cfd
48	Improving Code Overlay Performance by Pre-fetching in Scratch Pad Memory Systems January 2011 (has links) abstract: Advances in electronics technology and innovative manufacturing processes have driven the semiconductor industry towards extensive miniaturization & ever greater integration of chip design. One consequence of this sustained evolution has been the growing relative cost of accessing off-chip components with external memory being one of the dominant contributors. In embedded systems and applications, where power consumption and cost are extremely crucial factors, the use of on chip Scratch Pad Memories (SPMs) has proven to be a good alternative to caches. SPMs are more efficient than on-chip caches in a wide variety of aspects including energy consumption, power dissipation, speed performance, area, and timing predictability. However, at the same time, they entail explicit software-level management. Specifically, the system performance depends upon overlay scheme for mapping code and data onto the size-limited SPMs. It has been found that for applications with large code sizes, the overlay overhead cost becomes significant. This work aims to evaluate and implement pre-fetching as a performance improvement technique for SPMs. It is implemented in code overlay manager, provided with the Cell Broadband Engine (CBE) Synergistic Processing Unit (SPU) compiler from IBM, spu-gcc. Four different approaches proposed in this work use profiling information to predict pre-fetch calls. The pre-fetching technique achieves considerable performance improvement by hiding some of the code overlay cost behind active computations by fetching the required code segment in advance into SPM. Experimental results supporting this claim are obtained using the IBM Cell architecture platform with substantial gain of more than 30%. / Dissertation/Thesis / M.S. Computer Science 2011 Computer Science Code Overlay Multicore Scratch Pad Memory
49	Uma metodologia de avaliação de desempenho para identificar as melhore regiões paralelas para reduzir o consumo de energia / A performance evaluation methodology to find the best parallel regions to reduce energy consumption Millani, Luís Felipe Garlet January 2015 (has links) Devido as limitações de consumo energético impostas a supercomputadores, métricas de eficiência energética estão sendo usadas para analisar aplicações paralelas desenvolvidas para computadores de alto desempenho. O objetivo é a redução do custo energético dessas aplicações. Algumas estratégias de redução de consumo energética consideram a aplicação como um todo, outras reduzem ajustam a frequência dos núcleos apenas em certas regiões do código paralelo. Fases de balanceamento de carga ou de comunicação bloqueante podem ser oportunas para redução do consumo energético. A análise de eficiência dessas estratégias é geralmente realizada com metodologias tradicionais derivadas do domínio de análise de desempenho. Uma metodologia de grão mais fino, onde a redução de energia é avaliada para cada região de código e frequência pode lever a um melhor entendimento de como o consumo energético pode ser minimizado para uma determinada implementação. Para tal, os principais desafios são: (a) a detecção de um número possivelmente grande de regiões paralelas; (b) qual frequência deve ser adotada para cada região de forma a limitar o impacto no tempo de execução; e (c) o custo do ajuste dinâmico da frequência dos núcleos. O trabalho descrito nesta dissertação apresenta uma metodologia de análise de desempenho para encontrar, dentre as regiões paralelas, os melhores candidatos a redução do consumo energético. (Cotninua0 Esta proposta consiste de: (a) um design inteligente de experimentos baseado em Plackett-Burman, especialmente importante quando um grande número de regiões paralelas é detectado na aplicação; (b) análise tradicional de energia e desempenho sobre as regiões consideradas candidatas a redução do consumo energético; e (c) análise baseada em eficiência de Pareto mostrando a dificuldade em otimizar o consumo energético. Em (c) também são mostrados os diferentes pontos de equilíbrio entre desempenho e eficiência energética que podem ser interessantes ao desenvolvedor. Nossa abordagem é validada por três aplicações: Graph500, busca em largura, e refinamento de Delaunay. / Due to energy limitations imposed to supercomputers, parallel applications developed for High Performance Computers (HPC) are currently being investigated with energy efficiency metrics. The idea is to reduce the energy footprint of these applications. While some energy reduction strategies consider the application as a whole, certain strategies adjust the core frequency only for certain regions of the parallel code. Load balancing or blocking communication phases could be used as opportunities for energy reduction, for instance. The efficiency analysis of such strategies is usually carried out with traditional methodologies derived from the performance analysis domain. It is clear that a finer grain methodology, where the energy reduction is evaluated per each code region and frequency configuration, could potentially lead to a better understanding of how energy consumption can be reduced for a particular algorithm implementation. To get this, the main challenges are: (a) the detection of such, possibly parallel, code regions and the large number of them; (b) which frequency should be adopted for that region (to reduce energy consumption without too much penalty for the runtime); and (c) the cost to dynamically adjust core frequency. The work described in this dissertation presents a performance analysis methodology to find the best parallel region candidates to reduce energy consumption. The proposal is three folded: (a) a clever design of experiments based on screening, especially important when a large number of parallel regions is detected in the applications; (b) a traditional energy and performance evaluation on the regions that were considered as good candidates for energy reduction; and (c) a Pareto-based analysis showing how hard is to obtain energy gains in optimized codes. In (c), we also show other trade-offs between performance loss and energy gains that might be of interest of the application developer. Our approach is validated against three HPC application codes: Graph500; Breadth-First Search, and Delaunay Refinement. Supercomputadores Processamento paralelo Methodology Energy HPC DVFS Multicore Performance OpenMP
50	Uma metodologia de avaliação de desempenho para identificar as melhore regiões paralelas para reduzir o consumo de energia / A performance evaluation methodology to find the best parallel regions to reduce energy consumption Millani, Luís Felipe Garlet January 2015 (has links) Devido as limitações de consumo energético impostas a supercomputadores, métricas de eficiência energética estão sendo usadas para analisar aplicações paralelas desenvolvidas para computadores de alto desempenho. O objetivo é a redução do custo energético dessas aplicações. Algumas estratégias de redução de consumo energética consideram a aplicação como um todo, outras reduzem ajustam a frequência dos núcleos apenas em certas regiões do código paralelo. Fases de balanceamento de carga ou de comunicação bloqueante podem ser oportunas para redução do consumo energético. A análise de eficiência dessas estratégias é geralmente realizada com metodologias tradicionais derivadas do domínio de análise de desempenho. Uma metodologia de grão mais fino, onde a redução de energia é avaliada para cada região de código e frequência pode lever a um melhor entendimento de como o consumo energético pode ser minimizado para uma determinada implementação. Para tal, os principais desafios são: (a) a detecção de um número possivelmente grande de regiões paralelas; (b) qual frequência deve ser adotada para cada região de forma a limitar o impacto no tempo de execução; e (c) o custo do ajuste dinâmico da frequência dos núcleos. O trabalho descrito nesta dissertação apresenta uma metodologia de análise de desempenho para encontrar, dentre as regiões paralelas, os melhores candidatos a redução do consumo energético. (Cotninua0 Esta proposta consiste de: (a) um design inteligente de experimentos baseado em Plackett-Burman, especialmente importante quando um grande número de regiões paralelas é detectado na aplicação; (b) análise tradicional de energia e desempenho sobre as regiões consideradas candidatas a redução do consumo energético; e (c) análise baseada em eficiência de Pareto mostrando a dificuldade em otimizar o consumo energético. Em (c) também são mostrados os diferentes pontos de equilíbrio entre desempenho e eficiência energética que podem ser interessantes ao desenvolvedor. Nossa abordagem é validada por três aplicações: Graph500, busca em largura, e refinamento de Delaunay. / Due to energy limitations imposed to supercomputers, parallel applications developed for High Performance Computers (HPC) are currently being investigated with energy efficiency metrics. The idea is to reduce the energy footprint of these applications. While some energy reduction strategies consider the application as a whole, certain strategies adjust the core frequency only for certain regions of the parallel code. Load balancing or blocking communication phases could be used as opportunities for energy reduction, for instance. The efficiency analysis of such strategies is usually carried out with traditional methodologies derived from the performance analysis domain. It is clear that a finer grain methodology, where the energy reduction is evaluated per each code region and frequency configuration, could potentially lead to a better understanding of how energy consumption can be reduced for a particular algorithm implementation. To get this, the main challenges are: (a) the detection of such, possibly parallel, code regions and the large number of them; (b) which frequency should be adopted for that region (to reduce energy consumption without too much penalty for the runtime); and (c) the cost to dynamically adjust core frequency. The work described in this dissertation presents a performance analysis methodology to find the best parallel region candidates to reduce energy consumption. The proposal is three folded: (a) a clever design of experiments based on screening, especially important when a large number of parallel regions is detected in the applications; (b) a traditional energy and performance evaluation on the regions that were considered as good candidates for energy reduction; and (c) a Pareto-based analysis showing how hard is to obtain energy gains in optimized codes. In (c), we also show other trade-offs between performance loss and energy gains that might be of interest of the application developer. Our approach is validated against three HPC application codes: Graph500; Breadth-First Search, and Delaunay Refinement. Supercomputadores Processamento paralelo Methodology Energy HPC DVFS Multicore Performance OpenMP

Search results