• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 8
  • 5
  • 2
  • 1
  • 1
  • Tagged with
  • 24
  • 24
  • 24
  • 24
  • 8
  • 6
  • 6
  • 5
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Hardware Acceleration of a Monte Carlo Simulation for Photodynamic Therapy Treatment Planning

Lo, William Chun Yip 15 February 2010 (has links)
Monte Carlo (MC) simulations are widely used in the field of medical biophysics, particularly for modelling light propagation in biological tissue. The iterative nature of MC simulations and their high computation time currently limit their use to solving the forward solution for a given set of source characteristics and tissue optical properties. However, applications such as photodynamic therapy treatment planning or image reconstruction in diffuse optical tomography require solving the inverse problem given a desired light dose distribution or absorber distribution, respectively. A faster means for performing MC simulations would enable the use of MC-based models for such tasks. In this thesis, a gold standard MC code called MCML was accelerated using two distinct hardware-based approaches, namely designing custom hardware on field-programmable gate arrays (FPGAs) and programming commodity graphics processing units (GPUs). Currently, the GPU-based approach is promising, offering approximately 1000-fold speedup with 4 GPUs compared to an Intel Xeon CPU.
12

Hardware Acceleration of a Monte Carlo Simulation for Photodynamic Therapy Treatment Planning

Lo, William Chun Yip 15 February 2010 (has links)
Monte Carlo (MC) simulations are widely used in the field of medical biophysics, particularly for modelling light propagation in biological tissue. The iterative nature of MC simulations and their high computation time currently limit their use to solving the forward solution for a given set of source characteristics and tissue optical properties. However, applications such as photodynamic therapy treatment planning or image reconstruction in diffuse optical tomography require solving the inverse problem given a desired light dose distribution or absorber distribution, respectively. A faster means for performing MC simulations would enable the use of MC-based models for such tasks. In this thesis, a gold standard MC code called MCML was accelerated using two distinct hardware-based approaches, namely designing custom hardware on field-programmable gate arrays (FPGAs) and programming commodity graphics processing units (GPUs). Currently, the GPU-based approach is promising, offering approximately 1000-fold speedup with 4 GPUs compared to an Intel Xeon CPU.
13

Hybrid Spectral Ray Tracing Method for Multi-scale Millimeter-wave and Photonic Propagation Problems

Hailu, Daniel 30 September 2011 (has links)
This thesis presents an efficient self-consistent Hybrid Spectral Ray Tracing (HSRT) technique for analysis and design of multi-scale sub-millimeter wave problems, where sub-wavelength features are modeled using rigorous methods, and complex structures with dimensions in the order of tens or even hundreds of wavelengths are modeled by asymptotic methods. Quasi-optical devices are used in imaging arrays for sub-millimeter and terahertz applications, THz time-domain spectroscopy (THz-TDS), high-speed wireless communications, and space applications to couple terahertz radiation from space to a hot electron bolometer. These devices and structures, as physically small they have become, are very large in terms of the wavelength of the driving quasi-optical sources and may have dimension in the tens or even hundreds of wavelengths. Simulation and design optimization of these devices and structures is an extremely challenging electromagnetic problem. The analysis of complex electrically large unbounded wave structures using rigorous methods such as method of moments (MoM), finite element method (FEM), and finite difference time domain (FDTD) method can become almost impossible due to the need for large computational resources. Asymptotic high-frequency techniques are used for analysis of electrically large quasi-optical systems and hybrid methods for solving multi-scale problems. Spectral Ray Tracing (SRT) has a number of unique advantages as a candidate for hybridization. The SRT method has the advantages of Spectral Theory of Diffraction (STD). STD can model reflection, refraction and diffraction of an arbitrary wave incident on the complex structure, which is not the case for diffraction theories such as Geometrical Theory of Diffraction (GTD), Uniform theory of Diffraction (UTD) and Uniform Asymptotic Theory (UAT). By including complex rays, SRT can effectively analyze both near-fields and far-fields accurately with minimal approximations. In this thesis, a novel matrix representation of SRT is presented that uses only one spectral integration per observation point and applied to modeling a hemispherical and hyper-hemispherical lens. The hybridization of SRT with commercially available FEM and MoM software is proposed in this work to solve the complexity of multi-scale analysis. This yields a computationally efficient self-consistent HSRT algorithm. Various arrangements of the Hybrid SRT method such as FEM-SRT, and MoM-SRT, are investigated and validated through comparison of radiation patterns with Ansoft HFSS for the FEM method, FEKO for MoM, Multi-level Fast Multipole Method (MLFMM) and physical optics. For that a bow-tie terahertz antenna backed by hyper-hemispherical silicon lens, an on-chip planar dipole fabricated in SiGe:C BiCMOS technology and attached to a hyper-hemispherical silicon lens and a double-slot antenna backed by silica lens will be used as sample structures to be analyzed using the HSRT. Computational performance (memory requirement, CPU/GPU time) of developed algorithm is compared to other methods in commercially available software. It is shown that the MoM-SRT, in its present implementation, is more accurate than MoM-PO but comparable in speed. However, as shown in this thesis, MoM-SRT can take advantage of parallel processing and GPU. The HSRT algorithm is applied to simulation of on-chip dipole antenna backed by Silicon lens and integrated with a 180-GHz VCO and radiation pattern compared with measurements. The radiation pattern is measured in a quasi-optical configuration using a power detector. In addition, it is shown that the matrix formulation of SRT and HSRT are promising approaches for solving complex electrically large problems with high accuracy. This thesis also expounds on new measurement setup specifically developed for measuring integrated antennas, radiation pattern and gain of the embedded on-chip antenna in the mmW/ terahertz range. In this method, the radiation pattern is first measured in a quasi-optical configuration using a power detector. Subsequently, the radiated power is estimated form the integration over the radiation pattern. Finally, the antenna gain is obtained from the measurement of a two-antenna system.
14

Evaluating the OpenACC API for Parallelization of CFD Applications

Pickering, Brent Phillip 06 September 2014 (has links)
Directive-based programming of graphics processing units (GPUs) has recently appeared as a viable alternative to using specialized low-level languages such as CUDA C and OpenCL for general-purpose GPU programming. This technique, which uses directive or pragma statements to annotate source codes written in traditional high-level languages, is designed to permit a unified code base to serve multiple computational platforms and to simplify the transition of legacy codes to new architectures. This work analyzes the popular OpenACC programming standard, as implemented by the PGI compiler suite, in order to evaluate its utility and performance potential in computational fluid dynamics (CFD) applications. Of particular interest is the handling of stencil algorithms, which are an important component of finite-difference and finite-volume numerical methods. To this end, the process of applying the OpenACC Fortran API to a preexisting finite-difference CFD code is examined in detail, and all modifications that must be made to the original source in order to run efficiently on the GPU are noted. Optimization techniques for OpenACC are also explored, and it is demonstrated that tuning the code for a particular accelerator architecture can result in performance increases of over 30%. There are also some limitations and programming restrictions imposed by the API: it is observed that certain useful features of modern Fortran (2003/8) are effectively disabled within OpenACC regions. Finally, a combination of OpenACC and OpenMP directives is used to create a truly cross-platform Fortran code that can be compiled for either CPU or GPU hardware. The performance of the OpenACC code is measured on several contemporary NVIDIA GPU architectures, and a comparison is made between double and single precision arithmetic showing that if reduced precision can be tolerated, it can lead to significant speedups. To assess the performance gains relative to a typical CPU implementation, the execution time for a standard benchmark case (lid-driven cavity) is used as a reference. The OpenACC version is compared against the identical Fortran code recompiled to use OpenMP on multicore CPUs, as well as a highly-optimized C++ version of the code that utilizes hardware aware programming techniques to attain higher performance on the Intel Xeon platforms being tested. Low-level optimizations specific to these architectures are analyzed and it is observed that the stencil access pattern required by the structured-grid CFD code sometimes leads to performance degrading conflict misses in the hardware managed CPU caches. The GPU code, which primarily uses software managed caching, is found to be free from these issues. Overall, it is observed that the OpenACC GPU code compares favorably against even the best optimized CPU version: using a single NVIDIA K20x GPU, the Fortran+OpenACC code is seen to outperform the optimized C++ version by 20% and the Fortran+OpenMP version by more than 100% with both CPU codes running on a 16-core Xeon workstation. / Master of Science
15

Medical Image Processing on the GPU : Past, Present and Future

Eklund, Anders, Dufort, Paul, Forsberg, Daniel, LaConte, Stephen January 2013 (has links)
Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges.
16

A Runtime Framework for Regular and Irregular Message-Driven Parallel Applications on GPU Systems

Rengasamy, Vasudevan January 2014 (has links) (PDF)
The effective use of GPUs for accelerating applications depends on a number of factors including effective asynchronous use of heterogeneous resources, reducing data transfer between CPU and GPU, increasing occupancy of GPU kernels, overlapping data transfers with computations, reducing GPU idling and kernel optimizations. Overcoming these challenges require considerable effort on the part of the application developers. Most optimization strategies are often proposed and tuned specifically for individual applications. Message-driven executions with over-decomposition of tasks constitute an important model for parallel programming and provide multiple benefits including communication-computation overlap and reduced idling on resources. Charm++ is one such message-driven language which employs over decomposition of tasks, computation-communication overlap and a measurement-based load balancer to achieve high CPU utilization. This research has developed an adaptive runtime framework for efficient executions of Charm++ message-driven parallel applications on GPU systems. In the first part of our research, we have developed a runtime framework, G-Charm with the focus primarily on optimizing regular applications. At runtime, G-Charm automatically combines multiple small GPU tasks into a single larger kernel which reduces the number of kernel invocations while improving CUDA occupancy. G-Charm also enables reuse of existing data in GPU global memory, performs GPU memory management and dynamic scheduling of tasks across CPU and GPU in order to reduce idle time. In order to combine the partial results obtained from the computations performed on CPU and GPU, G-Charm allows the user to specify an operator using which the partial results are combined at runtime. We also perform compile time code generation to reduce programming overhead. For Cholesky factorization, a regular parallel application, G-Charm provides 14% improvement over a highly tuned implementation. In the second part of our research, we extended our runtime to overcome the challenges presented by irregular applications such as a periodic generation of tasks, irregular memory access patterns and varying workloads during application execution. We developed models for deciding the number of tasks that can be combined into a kernel based on the rate of task generation, and the GPU occupancy of the tasks. For irregular applications, data reuse results in uncoalesced GPU memory access. We evaluated the effect of altering the global memory access pattern in improving coalesced access. We’ve also developed adaptive methods for hybrid execution on CPU and GPU wherein we consider the varying workloads while scheduling tasks across the CPU and GPU. We demonstrate that our dynamic strategies result in 8-38% reduction in execution times for an N-body simulation application and a molecular dynamics application over the corresponding static strategies that are amenable for regular applications.
17

Cooperative Execution of Opencl Programs on Multiple Heterogeneous Devices

Pandit, Prasanna Vasant January 2013 (has links) (PDF)
Computing systems have become heterogeneous with the increasing prevalence of multi-core CPUs, Graphics Processing Units (GPU) and other accelerators in them. OpenCL has emerged as an attractive programming framework for heterogeneous systems. However, utilizing mul- tiple devices in OpenCL is a challenge as it requires the programmer to explicitly map data and computation to each device. Utilizing multiple devices simultaneously to speed up execu- tion of a kernel is even more complex, as the relative execution time of the kernel on different devices can vary significantly. Also, after each kernel execution, a coherent version of the data needs to be established. This means that, in order to utilize all devices effectively, the programmer has to spend considerable time and effort to distribute work across all devices, keep track of modified data in these devices and correctly perform a merging step to put the data together. Further, the relative performance of a program may vary across different inputs, which means a statically determined work distribution may not work well. In this work, we present FluidiCL, an OpenCL runtime that takes a program written for a single device and uses multiple heterogeneous devices to execute each kernel. The runtime performs dynamic work distribution and cooperatively executes each kernel on all available devices. Since we consider a setup with devices having discrete address spaces, our solution ensures that execution of OpenCL work-groups on devices is adjusted by taking into account the overheads for data management. The data transfers and data merging needed to ensure coherence are handled transparently without requiring any effort from the programmer. Flu- idiCL also does not require prior training or profiling and is completely portable across dif- ferent machines. Because it is dynamic, the runtime is able to adapt to system load. We have developed several optimizations for improving the performance of FluidiCL. We evaluate the runtime across different sets of devices. On a machine with an Intel quad-core processor and an NVidia Fermi GPU, FluidiCL shows a geomean speedup of nearly 64% over the GPU, 88% over the CPU and 14% over the best of the two devices in each benchmark. In all benchmarks, performance of our runtime comes to within 13% of the best of the two devices. FluidiCL shows similar results on a machine with a quad-core CPU and an NVidia Kepler GPU, with up to 26% speedup over the best of the two. We also present results considering an Intel Xeon Phi accelerator and a CPU and find that FluidiCL performs up to 45% faster than the best of the two devices. We extend FluidiCL from a CPU–GPU scenario to a three-device setup hav- ing a quad-core CPU, an NVidia Kepler GPU and an Intel Xeon Phi accelerator and find that FluidiCL obtains a geomean improvement of 6% in kernel execution time over the best of the three devices considered in each case.
18

Investigation of hierarchical deep neural network structure for facial expression recognition

Motembe, Dodi 01 1900 (has links)
Facial expression recognition (FER) is still a challenging concept, and machines struggle to comprehend effectively the dynamic shifts in facial expressions of human emotions. The existing systems, which have proven to be effective, consist of deeper network structures that need powerful and expensive hardware. The deeper the network is, the longer the training and the testing. Many systems use expensive GPUs to make the process faster. To remedy the above challenges while maintaining the main goal of improving the accuracy rate of the recognition, we create a generic hierarchical structure with variable settings. This generic structure has a hierarchy of three convolutional blocks, two dropout blocks and one fully connected block. From this generic structure we derived four different network structures to be investigated according to their performances. From each network structure case, we again derived six network structures in relation to the variable parameters. The variable parameters under analysis are the size of the filters of the convolutional maps and the max-pooling as well as the number of convolutional maps. In total, we have 24 network structures to investigate, and six network structures per case. After simulations, the results achieved after many repeated experiments showed in the group of case 1; case 1a emerged as the top performer of that group, and case 2a, case 3c and case 4c outperformed others in their respective groups. The comparison of the winners of the 4 groups indicates that case 2a is the optimal structure with optimal parameters; case 2a network structure outperformed other group winners. Considerations were done when choosing the best network structure, considerations were; minimum accuracy, average accuracy and maximum accuracy after 15 times of repeated training and analysis of results. All 24 proposed network structures were tested using two of the most used FER datasets, the CK+ and the JAFFE. After repeated simulations the results demonstrate that our inexpensive optimal network architecture achieved 98.11 % accuracy using the CK+ dataset. We also tested our optimal network architecture with the JAFFE dataset, the experimental results show 84.38 % by using just a standard CPU and easier procedures. We also compared the four group winners with other existing FER models performances recorded recently in two studies. These FER models used the same two datasets, the CK+ and the JAFFE. Three of our four group winners (case 1a, case 2a and case 4c) recorded only 1.22 % less than the accuracy of the top performer model when using the CK+ dataset, and two of our network structures, case 2a and case 3c came in third, beating other models when using the JAFFE dataset. / Electrical and Mining Engineering
19

On GPU Assisted Polar Decoding : Evaluating the Parallelization of the Successive Cancellation Algorithmusing Graphics Processing Units / Polärkodning med hjälp av GPU:er : En utvärdering av parallelliseringmöjligheterna av SuccessiveCancellation-algoritmen med hjälp av grafikprocessorer

Nordqvist, Siri January 2023 (has links)
In telecommunication, messages sent through a wireless medium often experience noise interfering with the signal in a way that corrupts the messages. As the demand for high throughput in the mobile network is increasing, algorithms that can detectand correct these corrupted messages quickly and accurately are of interest to the industry. Polar codes have been chosen by the Third Generation Partnership Project as the error correction code for 5G New Radio control channels. This thesis work aimed to investigate whether the polar code Successive Cancellation (SC) could be parallelized and if a graphics processing unit (GPU) can be utilized to optimize the execution time of the algorithm. The polar code Successive Cancellation was enhanced by implementing tree pruning and support for GPUs to leverage their parallelization. The difference in execution time between the concurrent and sequential versions of the SC algorithm with and without tree pruning was evaluated. The tree pruning SC algorithm almost always offered shorter execution times than the SC algorithm that did not employ treepruning. However, the support for GPUs did not reduce the execution time in these tests. Thus, the GPU is not certain to be able to improve this type of enhanced SC algorithm based on these results. / Meddelanden som överförs över ett mobilt nät utsätts ofta för brus som distorterar dem. I takt med att intresset ökat för hög genomströmning i mobilnätet har också intresset för algoritmer som snabbt och tillförlitligt kan upptäcka och korrigera distorderade meddelanden ökat. Polarkoder har valts av "Third Generation Partnership Project" som den klass av felkorrigeringskoder som ska användas för 5G:s radiokontrollkanaler. Detta examensarbete hade som syfte att undersöka om polarkoden "Successive Cancellation" (SC) skulle kunna parallelliseras och om en grafisk bearbetningsenhet (GPU) kan användas för att optimera exekveringstiden för algoritmen. SC utökades med stöd för trädbeskärning och parallellisering med hjälp av GPU:er. Skillnaden i exekveringstid mellan de parallella och sekventiella versionerna av SC-algoritmen med och utan trädbeskärning utvärderades. SC-algoritmen för trädbeskärning erbjöd nästan alltid kortare exekveringstider än SC-algoritmen som inte använde trädbeskärning. Stödet för GPU:er minskade dock inte exekveringstiden. Således kan man med dessa resultat inte med säkerhet säga att GPU-stöd skulle gynna SC-algoritmen.
20

Implementation and optimization of LDPC decoding algorithms tailored for Nvidia GPUs in 5G / Implementering och optimering av LDPC avkodningsalgoritmer anpassat för Nvidia GPU:er i 5G

Salomonsson, Benjamin January 2022 (has links)
Low-Density Parity-Check (LDPC) codes are linear error-correcting codes used to establish reliable communication between units on a noisy transmission channel in mobile telecommunications. LDPC algorithms detect and recover altered or corrupted message bits using sparse parity-check matrices in order to decipher messages correctly. LDPC codes have been shown to be fitting coding schemes for the fifth generation (5G) New Radio (NR), according to the third generation partnership project (3GPP).  TietoEvry, a consultant in telecom, has discovered that optimizations of LDPC decoding algorithms can be achieved/obtained with the use of a parallel computing platform called Compute Unified Device Architecture (CUDA), developed by NVIDIA. This platform utilizes the capabilities of a graphics processing unit (GPU) rather than a central processing unit (CPU), which in turn provides parallel computing. An optimized version of an LDPC decoding algorithm, the Min-Sum Algorithm (MSA), is implemented in CUDA and in C++ for comparison in terms of CPU execution time, to explore the capabilities that CUDA offers. The testing is done with a set of 12 sparse parity-check matrices and input-channel messages with different sizes. As a result, the CUDA implementation executes approximately 55% faster than a standard, unoptimized C++ implementation.

Page generated in 0.4791 seconds