Global ETD Search

51	Hybrid Spectral Ray Tracing Method for Multi-scale Millimeter-wave and Photonic Propagation Problems Hailu, Daniel 30 September 2011 (has links) This thesis presents an efficient self-consistent Hybrid Spectral Ray Tracing (HSRT) technique for analysis and design of multi-scale sub-millimeter wave problems, where sub-wavelength features are modeled using rigorous methods, and complex structures with dimensions in the order of tens or even hundreds of wavelengths are modeled by asymptotic methods. Quasi-optical devices are used in imaging arrays for sub-millimeter and terahertz applications, THz time-domain spectroscopy (THz-TDS), high-speed wireless communications, and space applications to couple terahertz radiation from space to a hot electron bolometer. These devices and structures, as physically small they have become, are very large in terms of the wavelength of the driving quasi-optical sources and may have dimension in the tens or even hundreds of wavelengths. Simulation and design optimization of these devices and structures is an extremely challenging electromagnetic problem. The analysis of complex electrically large unbounded wave structures using rigorous methods such as method of moments (MoM), finite element method (FEM), and finite difference time domain (FDTD) method can become almost impossible due to the need for large computational resources. Asymptotic high-frequency techniques are used for analysis of electrically large quasi-optical systems and hybrid methods for solving multi-scale problems. Spectral Ray Tracing (SRT) has a number of unique advantages as a candidate for hybridization. The SRT method has the advantages of Spectral Theory of Diffraction (STD). STD can model reflection, refraction and diffraction of an arbitrary wave incident on the complex structure, which is not the case for diffraction theories such as Geometrical Theory of Diffraction (GTD), Uniform theory of Diffraction (UTD) and Uniform Asymptotic Theory (UAT). By including complex rays, SRT can effectively analyze both near-fields and far-fields accurately with minimal approximations. In this thesis, a novel matrix representation of SRT is presented that uses only one spectral integration per observation point and applied to modeling a hemispherical and hyper-hemispherical lens. The hybridization of SRT with commercially available FEM and MoM software is proposed in this work to solve the complexity of multi-scale analysis. This yields a computationally efficient self-consistent HSRT algorithm. Various arrangements of the Hybrid SRT method such as FEM-SRT, and MoM-SRT, are investigated and validated through comparison of radiation patterns with Ansoft HFSS for the FEM method, FEKO for MoM, Multi-level Fast Multipole Method (MLFMM) and physical optics. For that a bow-tie terahertz antenna backed by hyper-hemispherical silicon lens, an on-chip planar dipole fabricated in SiGe:C BiCMOS technology and attached to a hyper-hemispherical silicon lens and a double-slot antenna backed by silica lens will be used as sample structures to be analyzed using the HSRT. Computational performance (memory requirement, CPU/GPU time) of developed algorithm is compared to other methods in commercially available software. It is shown that the MoM-SRT, in its present implementation, is more accurate than MoM-PO but comparable in speed. However, as shown in this thesis, MoM-SRT can take advantage of parallel processing and GPU. The HSRT algorithm is applied to simulation of on-chip dipole antenna backed by Silicon lens and integrated with a 180-GHz VCO and radiation pattern compared with measurements. The radiation pattern is measured in a quasi-optical configuration using a power detector. In addition, it is shown that the matrix formulation of SRT and HSRT are promising approaches for solving complex electrically large problems with high accuracy. This thesis also expounds on new measurement setup specifically developed for measuring integrated antennas, radiation pattern and gain of the embedded on-chip antenna in the mmW/ terahertz range. In this method, the radiation pattern is first measured in a quasi-optical configuration using a power detector. Subsequently, the radiated power is estimated form the integration over the radiation pattern. Finally, the antenna gain is obtained from the measurement of a two-antenna system. Antennas Electromagnetic radiation Method of Moments Physical Optics graphics processing unit (GPU) lens antennas ray tracing Spectral Ray Tracing SiGe technology Terahertz Electrical and Computer Engineering
52	Programmable MIMO detectors Janhunen, J. (Janne) 22 November 2011 (has links) Abstract The multiple-input multiple-output (MIMO) technique combined with an orthogonal frequency division multiplexing (MIMO--OFDM) has been introduced as a promising approach for the ever increasing capacity and quality of service (QoS) requirements for wireless communication systems. An efficient radio spectrum utilization expects a flexible transceiver solution, which has been the reason for the development of the software defined radio (SDR) technologies which in their turn are expected to enable the creation of cognitive radios. As a result, any radio solution could be invoked on demand on any platform. In this thesis work, we have studied detector algorithms and programmable processor architectures in order to find practical solutions for the future wireless systems. A programmable receiver can reduce the energy dissipation of the receiver by changing the detection algorithm based on the current channel realizations. To provide a realistic aspect to the implementations in different channel realizations, we present a wide state-of-the-art detector comparison. In addition, we present an extensive number arithmetic and word length study in order to evaluate realistic hardware complexity and energy dissipations of the implementations. The study includes a comprehensive design chain from the algorithm development to the actual processor design and finally programming software for the platforms. We evaluate single and multi-core processor implementations by comparing the achieved results to the Long Term Evolution (LTE) performance requirements. We implement detectors on digital signal processors (DSPs), graphics processing unit (GPU) and transport triggered architecture (TTA). The implementation results are compared in throughput, silicon area and energy efficiency. Finally, we discuss the advantages and disadvantages of the architectures and the implementation effort. / Tiivistelmä Usean antennin tekniikka yhdistettynä ortogonaaliseen taajuusvaihtelumodulointiin lähetin-vastaanotimessa on esitetty eräänä lupaavana ratkaisuna jatkuvasti kasvaviin kapasiteetti- ja palvelunlaatuvaatimuksiin langattomissa tietoliikennejärjestelmissä. Tehokas radiospektrin käyttö edellyttää joustavaa lähetin-vastaanotinratkaisua, mikä on ollut syynä ohjelmistoradioteknologioiden kehitykselle. Ohjelmistoradioiden kehityksen on puolestaan odotettu mahdollistavan kognitiiviradioiden syntymisen. Tuloksena, mikä tahansa radiosovellus voitaisiin herättää tarpeen mukaan millä tahansa ohjelmoitavalla sovellusalustalla. Tässä väitöskirjatyössä tutkitaan ilmaisinalgoritmeja sekä ohjelmoitavia prosessoriarkkitehtuureja tarkoituksena löytää käytännöllisiä ratkaisuja tulevaisuuden langattomiin järjestelmiin. Ohjelmoitavalla vastaanottimella voidaan vähentää vastaanottimen energiankulutusta vaihtamalla ilmaisinalgoritmeja vallitsevan kanavatilan mukaan. Työssä esitellään laaja, viimeisintä tutkimusta edustava ilmaisinalgoritmivertailu, joka antaa realistisen näkökannan toteutuksiin erilaisissa kanavatiloissa. Lisäksi työssä esitellään numeroaritmetiikka- ja sananpituustutkimus, jonka tarkoituksena on arvioida toteutusten realistista kovokompleksisuutta sekä energiankulutusta. Tutkimus sisältää kattavan suunnitteluketjun algoritmikehityksestä todelliseen prosessorisuunnitteluun ja lopulta algoritmin ohjelmointiin tietylle sovellusalustalle. Väitöskirjatyössä arvioidaan yksi- ja moniytimisiä prosessoritoteutuksia vertaamalla saavutettuja tuloksia Long Term Evolution -standardin suorituskykyvaatimuksiin. Ilmaisimia toteutetaan digitaalisilla signaaliprosessoreilla, grafiikkaprosessorilla sekä siirtoliipaisuarkkitehtuurilla. Toteutustuloksia vertaillaan laskentatehona, pinta-alana sekä energiatehokkuutena. Lopuksi käsitellään arkkitehtuurien hyviä ja huonoja puolia sekä suunnittelun työläyttä. MIMO OFDM digital signal processor graphics processing unit list detection programmable architecture transport triggered architecture digitaalinen signaaliprosessori grafiikkaprosessointiyksikkö listailmaisu moniantennijärjestelmä ohjelmoitava arkkitehtuuri ortogonaalinen taajuusjakomodulointi siirtoliipaisuarkkitehtuuri
53	Faster upper body pose recognition and estimation using compute unified device architecture Brown, Dane January 2013 (has links) >Magister Scientiae - MSc / The SASL project is in the process of developing a machine translation system that can translate fully-fledged phrases between SASL and English in real-time. To-date, several systems have been developed by the project focusing on facial expression, hand shape, hand motion, hand orientation and hand location recognition and estimation. Achmed developed a highly accurate upper body pose recognition and estimation system. The system is capable of recognizing and estimating the location of the arms from a twodimensional video captured from a monocular view at an accuracy of 88%. The system operates at well below real-time speeds. This research aims to investigate the use of optimizations and parallel processing techniques using the CUDA framework on Achmed’s algorithm to achieve real-time upper body pose recognition and estimation. A detailed analysis of Achmed’s algorithm identified potential improvements to the algorithm. Are- implementation of Achmed’s algorithm on the CUDA framework, coupled with these improvements culminated in an enhanced upper body pose recognition and estimation system that operates in real-time with an increased accuracy. Pose recognition and estimation Graphics processing unit Compute unified device architecture Face detection Skin detection Background subtraction Morphological operations Haar features Support vector machine Blender
54	Detekce objektů na GPU / Object Detection on GPU Macenauer, Pavel January 2015 (has links) This thesis addresses the topic of object detection on graphics processing units. As a part of it, a system for object detection using NVIDIA CUDA was designed and implemented, allowing for realtime video object detection and bulk processing. Its contribution is mainly to study the options of NVIDIA CUDA technology and current graphics processing units for object detection acceleration. Also parallel algorithms for object detection are discussed and suggested.
55	Numerical solution of the two-phase incompressible navier-stokes equations using a gpu-accelerated meshless method Kelly, Jesse 01 January 2009 (has links) This project presents the development and implementation of a GPU-accelerated meshless two-phase incompressible fluid flow solver. The solver uses a variant of the Generalized Finite Difference Meshless Method presented by Gerace et al. [1]. The Level Set Method [2] is used for capturing the fluid interface. The Compute Unified Device Architecture (CUDA) language for general-purpose computing on the graphics-processing-unit is used to implement the GPU-accelerated portions of the solver. CUDA allows the programmer to take advantage of the massive parallelism offered by the GPU at a cost that is significantly lower than other parallel computing options. Through the combined use of GPU-acceleration and a radial-basis function (RBF) collocation meshless method, this project seeks to address the issue of speed in computational fluid dynamics. Traditional mesh-based methods require a large amount of user input in the generation and verification of a computational mesh, which is quite time consuming. The RBF meshless method seeks to rectify this issue through the use of a grid of data centers that need not meet stringent geometric requirements like those required by finite-volume and finite-element methods. Further, the use of the GPU to accelerate the method has been shown to provide a 16-fold increase in speed for the solver subroutines that have been accelerated. Mechanical Engineering
56	Evaluating the OpenACC API for Parallelization of CFD Applications Pickering, Brent Phillip 06 September 2014 (has links) Directive-based programming of graphics processing units (GPUs) has recently appeared as a viable alternative to using specialized low-level languages such as CUDA C and OpenCL for general-purpose GPU programming. This technique, which uses directive or pragma statements to annotate source codes written in traditional high-level languages, is designed to permit a unified code base to serve multiple computational platforms and to simplify the transition of legacy codes to new architectures. This work analyzes the popular OpenACC programming standard, as implemented by the PGI compiler suite, in order to evaluate its utility and performance potential in computational fluid dynamics (CFD) applications. Of particular interest is the handling of stencil algorithms, which are an important component of finite-difference and finite-volume numerical methods. To this end, the process of applying the OpenACC Fortran API to a preexisting finite-difference CFD code is examined in detail, and all modifications that must be made to the original source in order to run efficiently on the GPU are noted. Optimization techniques for OpenACC are also explored, and it is demonstrated that tuning the code for a particular accelerator architecture can result in performance increases of over 30%. There are also some limitations and programming restrictions imposed by the API: it is observed that certain useful features of modern Fortran (2003/8) are effectively disabled within OpenACC regions. Finally, a combination of OpenACC and OpenMP directives is used to create a truly cross-platform Fortran code that can be compiled for either CPU or GPU hardware. The performance of the OpenACC code is measured on several contemporary NVIDIA GPU architectures, and a comparison is made between double and single precision arithmetic showing that if reduced precision can be tolerated, it can lead to significant speedups. To assess the performance gains relative to a typical CPU implementation, the execution time for a standard benchmark case (lid-driven cavity) is used as a reference. The OpenACC version is compared against the identical Fortran code recompiled to use OpenMP on multicore CPUs, as well as a highly-optimized C++ version of the code that utilizes hardware aware programming techniques to attain higher performance on the Intel Xeon platforms being tested. Low-level optimizations specific to these architectures are analyzed and it is observed that the stencil access pattern required by the structured-grid CFD code sometimes leads to performance degrading conflict misses in the hardware managed CPU caches. The GPU code, which primarily uses software managed caching, is found to be free from these issues. Overall, it is observed that the OpenACC GPU code compares favorably against even the best optimized CPU version: using a single NVIDIA K20x GPU, the Fortran+OpenACC code is seen to outperform the optimized C++ version by 20% and the Fortran+OpenMP version by more than 100% with both CPU codes running on a 16-core Xeon workstation. / Master of Science graphics processing unit (GPU) computational fluid dynamics (CFD) directive-based programming parallel programming OpenACC Fortran 2003 stencil code finite-volume method
57	Eismo dalyvių kelyje atpažinimas naudojant dirbtinius neuroninius tinklus ir grafikos procesorių / On - road vehicle recognition using neural networks and graphics processing unit Kinderis, Povilas 27 June 2014 (has links) Kasmet daugybė žmonių būna sužalojami autoįvykiuose, iš kurių dalis sužalojimų būna rimti arba pasibaigia mirtimi. Dedama vis daugiau pastangų kuriant įvairias sistemas, kurios padėtų mažinti nelaimių skaičių kelyje. Tokios sistemos gebėtų perspėti vairuotojus apie galimus pavojus, atpažindamos eismo dalyvius ir sekdamos jų padėtį kelyje. Eismo dalyvių kelyje atpažinimas iš vaizdo yra pakankamai sudėtinga, daug skaičiavimų reikalaujanti problema. Šiame darbe šiai problemai spręsti pasitelkti stereo vaizdai, nesugretinamumo žemėlapis bei konvoliuciniai neuroniniai tinklai. Konvoliuciniai neuroniniai tinklai reikalauja daug skaičiavimų, todėl jie optimizuoti pasitelkus grafikos procesorių ir OpenCL. Gautas iki 33,4% spartos pagerėjimas lyginant su centriniu procesoriumi. Stereo vaizdai ir nesugretinamumo žemėlapis leidžia atmesti didelius kadro regionus, kurių nereikia klasifikuoti su konvoliuciniu neuroniniu tinklu. Priklausomai nuo scenos vaizde, reikalingų klasifikavimo operacijų skaičius sumažėja vidutiniškai apie 70-95% ir tai leidžia kadrą apdoroti atitinkamai greičiau. / Many people are injured during auto accidents each year, some injures are serious or end in death. Many efforts are being put in developing various systems, which could help to reduce accidents on the road. Such systems could warn drivers of a potential danger, while recognizing on-road vehicles and tracking their position on the road. On-road vehicle recognition on image is a complex and computationally very intensive problem. In this paper, to solve this problem, stereo images, disparity map and convolutional neural networks are used. Convolutional neural networks are very computational intensive, so to optimize it GPU and OpenCL are used. 33.4% speed improvement was achieved compared to the central processor. Stereo images and disparity map allows to discard large areas of the image, which are not needed to be classified using convolutional neural networks. Depending on the scene of the image, the number of the required classification operations decreases on average by 70-95% and this allows to process the image accordingly faster. Eismo dalyvių atpažinimas Konvoliucinis neuroninis tinklas Grafinis procesorius GP OpenCL OpenCV Stereo vaizdas Nesugretinamumo žemėlapis On-road vehicle recognition Convolutional neural network Graphics processing unit GPU Stereo image Disparity map
58	Medical Image Processing on the GPU : Past, Present and Future Eklund, Anders, Dufort, Paul, Forsberg, Daniel, LaConte, Stephen January 2013 (has links) Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges. Graphics processing unit (GPU) OpenGL DirectX CUDA OpenCL Filtering Interpolation Histogram estimation Distance transforms Image registration Image segmentation Image denoising CT PET SPECT MRI fMRI DTI Ultrasound Optical imaging Microscopy
59	Método automático para descoberta de funções de ordenação utilizando programação genética paralela em GPU / Automatic raking function discovery method using parallel genetic programming on GPU Coimbra, Andre Rodrigues 28 March 2014 (has links) Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2015-05-15T13:33:06Z No. of bitstreams: 2 Dissertação - André Rodrigues Coimbra - 2014.pdf: 5214859 bytes, checksum: d951502129d7be5d60b6a785516c3ad1 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2015-05-15T13:37:45Z (GMT) No. of bitstreams: 2 Dissertação - André Rodrigues Coimbra - 2014.pdf: 5214859 bytes, checksum: d951502129d7be5d60b6a785516c3ad1 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Made available in DSpace on 2015-05-15T13:37:45Z (GMT). No. of bitstreams: 2 Dissertação - André Rodrigues Coimbra - 2014.pdf: 5214859 bytes, checksum: d951502129d7be5d60b6a785516c3ad1 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Previous issue date: 2014-03-28 / Ranking functions have a vital role in the performance of information retrieval systems ensuring that documents more related to the user’s search need – represented as a query – are shown in the top results, preventing the user from having to examine a range of documents that are not really relevant. Therefore, this work uses Genetic Programming (GP), an Evolutionary Computation technique, to find ranking functions automaticaly and systematicaly. Moreover, in this project the technique of GP was developed following a strategy that exploits parallelism through graphics processing units. Other known methods in the context of information retrieval as classification committees and the Lazy strategy were combined with the proposed approach – called Finch. These combinations were only feasible due to the GP nature and the use of parallelism. The experimental results with the Finch, regarding the ranking functions quality, surpassed the results of several strategies known in the literature. Considering the time performance, significant gains were also achieved. The solution developed exploiting the parallelism spends around twenty times less time than the solution using only the central processing unit. / Funções de ordenação têm um papel vital no desempenho de sistemas de recuperação de informação garantindo que os documentos mais relacionados com o desejo do usuário – representado através de uma consulta – sejam trazidos no topo dos resultados, evitando que o usuário tenha que analisar uma série de documentos que não sejam realmente relevantes. Assim, utiliza-se a Programação Genética (PG), uma técnica da Computação Evolucionária, para descobrir de forma automática e sistemática funções de ordenação. Além disso, neste trabalho a técnica de PG foi desenvolvida seguindo uma estratégia que explora o paralelismo através de unidades gráficas de processamento. Foram agregados ainda na abordagem proposta – denominada Finch – outros métodos conhecidos no contexto de recuperação de informação como os comitês de classificação e a estratégia Lazy. Sendo que essa complementação só foi viável devido a natureza da PG e em virtude da utilização do paralelismo. Os resultados experimentais encontrados com a Finch, em relação à qualidade das funções de ordenação descobertas, superaram os resultados de diversas estratégias conhecidas na literatura. Considerando o desempenho da abordagem em função do tempo, também foram alcançados ganhos significativos. A solução desenvolvida explorando o paralelismo gasta, em média, vinte vezes menos tempo que a solução utilizando somente a unidade central de processamento. Programação genética Computação paralela Sistema de recuperação de informação Ordenação de documentos Computação evolucionária CUDA Unidade gráfica de processamento Inteligência computacional Genetic programming Parallel computing Information retrieval system Document ranking Evolutionary computation CUDA Graphics processing unit Machine learning
60	A Runtime Framework for Regular and Irregular Message-Driven Parallel Applications on GPU Systems Rengasamy, Vasudevan January 2014 (has links) (PDF) The effective use of GPUs for accelerating applications depends on a number of factors including effective asynchronous use of heterogeneous resources, reducing data transfer between CPU and GPU, increasing occupancy of GPU kernels, overlapping data transfers with computations, reducing GPU idling and kernel optimizations. Overcoming these challenges require considerable effort on the part of the application developers. Most optimization strategies are often proposed and tuned specifically for individual applications. Message-driven executions with over-decomposition of tasks constitute an important model for parallel programming and provide multiple benefits including communication-computation overlap and reduced idling on resources. Charm++ is one such message-driven language which employs over decomposition of tasks, computation-communication overlap and a measurement-based load balancer to achieve high CPU utilization. This research has developed an adaptive runtime framework for efficient executions of Charm++ message-driven parallel applications on GPU systems. In the first part of our research, we have developed a runtime framework, G-Charm with the focus primarily on optimizing regular applications. At runtime, G-Charm automatically combines multiple small GPU tasks into a single larger kernel which reduces the number of kernel invocations while improving CUDA occupancy. G-Charm also enables reuse of existing data in GPU global memory, performs GPU memory management and dynamic scheduling of tasks across CPU and GPU in order to reduce idle time. In order to combine the partial results obtained from the computations performed on CPU and GPU, G-Charm allows the user to specify an operator using which the partial results are combined at runtime. We also perform compile time code generation to reduce programming overhead. For Cholesky factorization, a regular parallel application, G-Charm provides 14% improvement over a highly tuned implementation. In the second part of our research, we extended our runtime to overcome the challenges presented by irregular applications such as a periodic generation of tasks, irregular memory access patterns and varying workloads during application execution. We developed models for deciding the number of tasks that can be combined into a kernel based on the rate of task generation, and the GPU occupancy of the tasks. For irregular applications, data reuse results in uncoalesced GPU memory access. We evaluated the effect of altering the global memory access pattern in improving coalesced access. We’ve also developed adaptive methods for hybrid execution on CPU and GPU wherein we consider the varying workloads while scheduling tasks across the CPU and GPU. We demonstrate that our dynamic strategies result in 8-38% reduction in execution times for an N-body simulation application and a molecular dynamics application over the corresponding static strategies that are amenable for regular applications. Graphics Processing Unit (GPU) Parallel Programming (Computer Science) Parallel Programming Models Parallel Programming Frameworks Charm++ (Computer Program Language) HybridAPI-GPU Management Framework G-Charm Framework Accelerator Based Computing Cholesky Factorization Computer Science

Search results