• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 29
  • 8
  • 4
  • 4
  • 3
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 73
  • 73
  • 73
  • 44
  • 20
  • 19
  • 15
  • 12
  • 12
  • 12
  • 11
  • 11
  • 8
  • 7
  • 7
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
51

Programmable MIMO detectors

Janhunen, J. (Janne) 22 November 2011 (has links)
Abstract The multiple-input multiple-output (MIMO) technique combined with an orthogonal frequency division multiplexing (MIMO--OFDM) has been introduced as a promising approach for the ever increasing capacity and quality of service (QoS) requirements for wireless communication systems. An efficient radio spectrum utilization expects a flexible transceiver solution, which has been the reason for the development of the software defined radio (SDR) technologies which in their turn are expected to enable the creation of cognitive radios. As a result, any radio solution could be invoked on demand on any platform. In this thesis work, we have studied detector algorithms and programmable processor architectures in order to find practical solutions for the future wireless systems. A programmable receiver can reduce the energy dissipation of the receiver by changing the detection algorithm based on the current channel realizations. To provide a realistic aspect to the implementations in different channel realizations, we present a wide state-of-the-art detector comparison. In addition, we present an extensive number arithmetic and word length study in order to evaluate realistic hardware complexity and energy dissipations of the implementations. The study includes a comprehensive design chain from the algorithm development to the actual processor design and finally programming software for the platforms. We evaluate single and multi-core processor implementations by comparing the achieved results to the Long Term Evolution (LTE) performance requirements. We implement detectors on digital signal processors (DSPs), graphics processing unit (GPU) and transport triggered architecture (TTA). The implementation results are compared in throughput, silicon area and energy efficiency. Finally, we discuss the advantages and disadvantages of the architectures and the implementation effort. / Tiivistelmä Usean antennin tekniikka yhdistettynä ortogonaaliseen taajuusvaihtelumodulointiin lähetin-vastaanotimessa on esitetty eräänä lupaavana ratkaisuna jatkuvasti kasvaviin kapasiteetti- ja palvelunlaatuvaatimuksiin langattomissa tietoliikennejärjestelmissä. Tehokas radiospektrin käyttö edellyttää joustavaa lähetin-vastaanotinratkaisua, mikä on ollut syynä ohjelmistoradioteknologioiden kehitykselle. Ohjelmistoradioiden kehityksen on puolestaan odotettu mahdollistavan kognitiiviradioiden syntymisen. Tuloksena, mikä tahansa radiosovellus voitaisiin herättää tarpeen mukaan millä tahansa ohjelmoitavalla sovellusalustalla. Tässä väitöskirjatyössä tutkitaan ilmaisinalgoritmeja sekä ohjelmoitavia prosessoriarkkitehtuureja tarkoituksena löytää käytännöllisiä ratkaisuja tulevaisuuden langattomiin järjestelmiin. Ohjelmoitavalla vastaanottimella voidaan vähentää vastaanottimen energiankulutusta vaihtamalla ilmaisinalgoritmeja vallitsevan kanavatilan mukaan. Työssä esitellään laaja, viimeisintä tutkimusta edustava ilmaisinalgoritmivertailu, joka antaa realistisen näkökannan toteutuksiin erilaisissa kanavatiloissa. Lisäksi työssä esitellään numeroaritmetiikka- ja sananpituustutkimus, jonka tarkoituksena on arvioida toteutusten realistista kovokompleksisuutta sekä energiankulutusta. Tutkimus sisältää kattavan suunnitteluketjun algoritmikehityksestä todelliseen prosessorisuunnitteluun ja lopulta algoritmin ohjelmointiin tietylle sovellusalustalle. Väitöskirjatyössä arvioidaan yksi- ja moniytimisiä prosessoritoteutuksia vertaamalla saavutettuja tuloksia Long Term Evolution -standardin suorituskykyvaatimuksiin. Ilmaisimia toteutetaan digitaalisilla signaaliprosessoreilla, grafiikkaprosessorilla sekä siirtoliipaisuarkkitehtuurilla. Toteutustuloksia vertaillaan laskentatehona, pinta-alana sekä energiatehokkuutena. Lopuksi käsitellään arkkitehtuurien hyviä ja huonoja puolia sekä suunnittelun työläyttä.
52

Faster upper body pose recognition and estimation using compute unified device architecture

Brown, Dane January 2013 (has links)
>Magister Scientiae - MSc / The SASL project is in the process of developing a machine translation system that can translate fully-fledged phrases between SASL and English in real-time. To-date, several systems have been developed by the project focusing on facial expression, hand shape, hand motion, hand orientation and hand location recognition and estimation. Achmed developed a highly accurate upper body pose recognition and estimation system. The system is capable of recognizing and estimating the location of the arms from a twodimensional video captured from a monocular view at an accuracy of 88%. The system operates at well below real-time speeds. This research aims to investigate the use of optimizations and parallel processing techniques using the CUDA framework on Achmed’s algorithm to achieve real-time upper body pose recognition and estimation. A detailed analysis of Achmed’s algorithm identified potential improvements to the algorithm. Are- implementation of Achmed’s algorithm on the CUDA framework, coupled with these improvements culminated in an enhanced upper body pose recognition and estimation system that operates in real-time with an increased accuracy.
53

Detekce objektů na GPU / Object Detection on GPU

Macenauer, Pavel January 2015 (has links)
This thesis addresses the topic of object detection on graphics processing units. As a part of it, a system for object detection using NVIDIA CUDA was designed and implemented, allowing for realtime video object detection and bulk processing. Its contribution is mainly to study the options of NVIDIA CUDA technology and current graphics processing units for object detection acceleration. Also parallel algorithms for object detection are discussed and suggested.
54

Numerical solution of the two-phase incompressible navier-stokes equations using a gpu-accelerated meshless method

Kelly, Jesse 01 January 2009 (has links)
This project presents the development and implementation of a GPU-accelerated meshless two-phase incompressible fluid flow solver. The solver uses a variant of the Generalized Finite Difference Meshless Method presented by Gerace et al. [1]. The Level Set Method [2] is used for capturing the fluid interface. The Compute Unified Device Architecture (CUDA) language for general-purpose computing on the graphics-processing-unit is used to implement the GPU-accelerated portions of the solver. CUDA allows the programmer to take advantage of the massive parallelism offered by the GPU at a cost that is significantly lower than other parallel computing options. Through the combined use of GPU-acceleration and a radial-basis function (RBF) collocation meshless method, this project seeks to address the issue of speed in computational fluid dynamics. Traditional mesh-based methods require a large amount of user input in the generation and verification of a computational mesh, which is quite time consuming. The RBF meshless method seeks to rectify this issue through the use of a grid of data centers that need not meet stringent geometric requirements like those required by finite-volume and finite-element methods. Further, the use of the GPU to accelerate the method has been shown to provide a 16-fold increase in speed for the solver subroutines that have been accelerated.
55

Evaluating the OpenACC API for Parallelization of CFD Applications

Pickering, Brent Phillip 06 September 2014 (has links)
Directive-based programming of graphics processing units (GPUs) has recently appeared as a viable alternative to using specialized low-level languages such as CUDA C and OpenCL for general-purpose GPU programming. This technique, which uses directive or pragma statements to annotate source codes written in traditional high-level languages, is designed to permit a unified code base to serve multiple computational platforms and to simplify the transition of legacy codes to new architectures. This work analyzes the popular OpenACC programming standard, as implemented by the PGI compiler suite, in order to evaluate its utility and performance potential in computational fluid dynamics (CFD) applications. Of particular interest is the handling of stencil algorithms, which are an important component of finite-difference and finite-volume numerical methods. To this end, the process of applying the OpenACC Fortran API to a preexisting finite-difference CFD code is examined in detail, and all modifications that must be made to the original source in order to run efficiently on the GPU are noted. Optimization techniques for OpenACC are also explored, and it is demonstrated that tuning the code for a particular accelerator architecture can result in performance increases of over 30%. There are also some limitations and programming restrictions imposed by the API: it is observed that certain useful features of modern Fortran (2003/8) are effectively disabled within OpenACC regions. Finally, a combination of OpenACC and OpenMP directives is used to create a truly cross-platform Fortran code that can be compiled for either CPU or GPU hardware. The performance of the OpenACC code is measured on several contemporary NVIDIA GPU architectures, and a comparison is made between double and single precision arithmetic showing that if reduced precision can be tolerated, it can lead to significant speedups. To assess the performance gains relative to a typical CPU implementation, the execution time for a standard benchmark case (lid-driven cavity) is used as a reference. The OpenACC version is compared against the identical Fortran code recompiled to use OpenMP on multicore CPUs, as well as a highly-optimized C++ version of the code that utilizes hardware aware programming techniques to attain higher performance on the Intel Xeon platforms being tested. Low-level optimizations specific to these architectures are analyzed and it is observed that the stencil access pattern required by the structured-grid CFD code sometimes leads to performance degrading conflict misses in the hardware managed CPU caches. The GPU code, which primarily uses software managed caching, is found to be free from these issues. Overall, it is observed that the OpenACC GPU code compares favorably against even the best optimized CPU version: using a single NVIDIA K20x GPU, the Fortran+OpenACC code is seen to outperform the optimized C++ version by 20% and the Fortran+OpenMP version by more than 100% with both CPU codes running on a 16-core Xeon workstation. / Master of Science
56

Eismo dalyvių kelyje atpažinimas naudojant dirbtinius neuroninius tinklus ir grafikos procesorių / On - road vehicle recognition using neural networks and graphics processing unit

Kinderis, Povilas 27 June 2014 (has links)
Kasmet daugybė žmonių būna sužalojami autoįvykiuose, iš kurių dalis sužalojimų būna rimti arba pasibaigia mirtimi. Dedama vis daugiau pastangų kuriant įvairias sistemas, kurios padėtų mažinti nelaimių skaičių kelyje. Tokios sistemos gebėtų perspėti vairuotojus apie galimus pavojus, atpažindamos eismo dalyvius ir sekdamos jų padėtį kelyje. Eismo dalyvių kelyje atpažinimas iš vaizdo yra pakankamai sudėtinga, daug skaičiavimų reikalaujanti problema. Šiame darbe šiai problemai spręsti pasitelkti stereo vaizdai, nesugretinamumo žemėlapis bei konvoliuciniai neuroniniai tinklai. Konvoliuciniai neuroniniai tinklai reikalauja daug skaičiavimų, todėl jie optimizuoti pasitelkus grafikos procesorių ir OpenCL. Gautas iki 33,4% spartos pagerėjimas lyginant su centriniu procesoriumi. Stereo vaizdai ir nesugretinamumo žemėlapis leidžia atmesti didelius kadro regionus, kurių nereikia klasifikuoti su konvoliuciniu neuroniniu tinklu. Priklausomai nuo scenos vaizde, reikalingų klasifikavimo operacijų skaičius sumažėja vidutiniškai apie 70-95% ir tai leidžia kadrą apdoroti atitinkamai greičiau. / Many people are injured during auto accidents each year, some injures are serious or end in death. Many efforts are being put in developing various systems, which could help to reduce accidents on the road. Such systems could warn drivers of a potential danger, while recognizing on-road vehicles and tracking their position on the road. On-road vehicle recognition on image is a complex and computationally very intensive problem. In this paper, to solve this problem, stereo images, disparity map and convolutional neural networks are used. Convolutional neural networks are very computational intensive, so to optimize it GPU and OpenCL are used. 33.4% speed improvement was achieved compared to the central processor. Stereo images and disparity map allows to discard large areas of the image, which are not needed to be classified using convolutional neural networks. Depending on the scene of the image, the number of the required classification operations decreases on average by 70-95% and this allows to process the image accordingly faster.
57

Medical Image Processing on the GPU : Past, Present and Future

Eklund, Anders, Dufort, Paul, Forsberg, Daniel, LaConte, Stephen January 2013 (has links)
Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges.
58

Método automático para descoberta de funções de ordenação utilizando programação genética paralela em GPU / Automatic raking function discovery method using parallel genetic programming on GPU

Coimbra, Andre Rodrigues 28 March 2014 (has links)
Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2015-05-15T13:33:06Z No. of bitstreams: 2 Dissertação - André Rodrigues Coimbra - 2014.pdf: 5214859 bytes, checksum: d951502129d7be5d60b6a785516c3ad1 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2015-05-15T13:37:45Z (GMT) No. of bitstreams: 2 Dissertação - André Rodrigues Coimbra - 2014.pdf: 5214859 bytes, checksum: d951502129d7be5d60b6a785516c3ad1 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Made available in DSpace on 2015-05-15T13:37:45Z (GMT). No. of bitstreams: 2 Dissertação - André Rodrigues Coimbra - 2014.pdf: 5214859 bytes, checksum: d951502129d7be5d60b6a785516c3ad1 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Previous issue date: 2014-03-28 / Ranking functions have a vital role in the performance of information retrieval systems ensuring that documents more related to the user’s search need – represented as a query – are shown in the top results, preventing the user from having to examine a range of documents that are not really relevant. Therefore, this work uses Genetic Programming (GP), an Evolutionary Computation technique, to find ranking functions automaticaly and systematicaly. Moreover, in this project the technique of GP was developed following a strategy that exploits parallelism through graphics processing units. Other known methods in the context of information retrieval as classification committees and the Lazy strategy were combined with the proposed approach – called Finch. These combinations were only feasible due to the GP nature and the use of parallelism. The experimental results with the Finch, regarding the ranking functions quality, surpassed the results of several strategies known in the literature. Considering the time performance, significant gains were also achieved. The solution developed exploiting the parallelism spends around twenty times less time than the solution using only the central processing unit. / Funções de ordenação têm um papel vital no desempenho de sistemas de recuperação de informação garantindo que os documentos mais relacionados com o desejo do usuário – representado através de uma consulta – sejam trazidos no topo dos resultados, evitando que o usuário tenha que analisar uma série de documentos que não sejam realmente relevantes. Assim, utiliza-se a Programação Genética (PG), uma técnica da Computação Evolucionária, para descobrir de forma automática e sistemática funções de ordenação. Além disso, neste trabalho a técnica de PG foi desenvolvida seguindo uma estratégia que explora o paralelismo através de unidades gráficas de processamento. Foram agregados ainda na abordagem proposta – denominada Finch – outros métodos conhecidos no contexto de recuperação de informação como os comitês de classificação e a estratégia Lazy. Sendo que essa complementação só foi viável devido a natureza da PG e em virtude da utilização do paralelismo. Os resultados experimentais encontrados com a Finch, em relação à qualidade das funções de ordenação descobertas, superaram os resultados de diversas estratégias conhecidas na literatura. Considerando o desempenho da abordagem em função do tempo, também foram alcançados ganhos significativos. A solução desenvolvida explorando o paralelismo gasta, em média, vinte vezes menos tempo que a solução utilizando somente a unidade central de processamento.
59

A Runtime Framework for Regular and Irregular Message-Driven Parallel Applications on GPU Systems

Rengasamy, Vasudevan January 2014 (has links) (PDF)
The effective use of GPUs for accelerating applications depends on a number of factors including effective asynchronous use of heterogeneous resources, reducing data transfer between CPU and GPU, increasing occupancy of GPU kernels, overlapping data transfers with computations, reducing GPU idling and kernel optimizations. Overcoming these challenges require considerable effort on the part of the application developers. Most optimization strategies are often proposed and tuned specifically for individual applications. Message-driven executions with over-decomposition of tasks constitute an important model for parallel programming and provide multiple benefits including communication-computation overlap and reduced idling on resources. Charm++ is one such message-driven language which employs over decomposition of tasks, computation-communication overlap and a measurement-based load balancer to achieve high CPU utilization. This research has developed an adaptive runtime framework for efficient executions of Charm++ message-driven parallel applications on GPU systems. In the first part of our research, we have developed a runtime framework, G-Charm with the focus primarily on optimizing regular applications. At runtime, G-Charm automatically combines multiple small GPU tasks into a single larger kernel which reduces the number of kernel invocations while improving CUDA occupancy. G-Charm also enables reuse of existing data in GPU global memory, performs GPU memory management and dynamic scheduling of tasks across CPU and GPU in order to reduce idle time. In order to combine the partial results obtained from the computations performed on CPU and GPU, G-Charm allows the user to specify an operator using which the partial results are combined at runtime. We also perform compile time code generation to reduce programming overhead. For Cholesky factorization, a regular parallel application, G-Charm provides 14% improvement over a highly tuned implementation. In the second part of our research, we extended our runtime to overcome the challenges presented by irregular applications such as a periodic generation of tasks, irregular memory access patterns and varying workloads during application execution. We developed models for deciding the number of tasks that can be combined into a kernel based on the rate of task generation, and the GPU occupancy of the tasks. For irregular applications, data reuse results in uncoalesced GPU memory access. We evaluated the effect of altering the global memory access pattern in improving coalesced access. We’ve also developed adaptive methods for hybrid execution on CPU and GPU wherein we consider the varying workloads while scheduling tasks across the CPU and GPU. We demonstrate that our dynamic strategies result in 8-38% reduction in execution times for an N-body simulation application and a molecular dynamics application over the corresponding static strategies that are amenable for regular applications.
60

Cooperative Execution of Opencl Programs on Multiple Heterogeneous Devices

Pandit, Prasanna Vasant January 2013 (has links) (PDF)
Computing systems have become heterogeneous with the increasing prevalence of multi-core CPUs, Graphics Processing Units (GPU) and other accelerators in them. OpenCL has emerged as an attractive programming framework for heterogeneous systems. However, utilizing mul- tiple devices in OpenCL is a challenge as it requires the programmer to explicitly map data and computation to each device. Utilizing multiple devices simultaneously to speed up execu- tion of a kernel is even more complex, as the relative execution time of the kernel on different devices can vary significantly. Also, after each kernel execution, a coherent version of the data needs to be established. This means that, in order to utilize all devices effectively, the programmer has to spend considerable time and effort to distribute work across all devices, keep track of modified data in these devices and correctly perform a merging step to put the data together. Further, the relative performance of a program may vary across different inputs, which means a statically determined work distribution may not work well. In this work, we present FluidiCL, an OpenCL runtime that takes a program written for a single device and uses multiple heterogeneous devices to execute each kernel. The runtime performs dynamic work distribution and cooperatively executes each kernel on all available devices. Since we consider a setup with devices having discrete address spaces, our solution ensures that execution of OpenCL work-groups on devices is adjusted by taking into account the overheads for data management. The data transfers and data merging needed to ensure coherence are handled transparently without requiring any effort from the programmer. Flu- idiCL also does not require prior training or profiling and is completely portable across dif- ferent machines. Because it is dynamic, the runtime is able to adapt to system load. We have developed several optimizations for improving the performance of FluidiCL. We evaluate the runtime across different sets of devices. On a machine with an Intel quad-core processor and an NVidia Fermi GPU, FluidiCL shows a geomean speedup of nearly 64% over the GPU, 88% over the CPU and 14% over the best of the two devices in each benchmark. In all benchmarks, performance of our runtime comes to within 13% of the best of the two devices. FluidiCL shows similar results on a machine with a quad-core CPU and an NVidia Kepler GPU, with up to 26% speedup over the best of the two. We also present results considering an Intel Xeon Phi accelerator and a CPU and find that FluidiCL performs up to 45% faster than the best of the two devices. We extend FluidiCL from a CPU–GPU scenario to a three-device setup hav- ing a quad-core CPU, an NVidia Kepler GPU and an Intel Xeon Phi accelerator and find that FluidiCL obtains a geomean improvement of 6% in kernel execution time over the best of the three devices considered in each case.

Page generated in 0.4532 seconds