41 |
Evaluating a GPU based TRNG in an entropy starved virtual linux environmentPlesiuk, Christopher 14 April 2016 (has links)
A secure system requires cryptography and effective cryptography requires high quality system entropy. Within a virtualized Linux environment the quality and the amount of system entropy can be over overestimated. These virtualized environments can also have difficulty generating entropy data.
To address the problems with entropy in virtualized Linux environments, my thesis investigates and evaluates exposing a unique true random number generator via an entropy-sharing tool called Entropy Broker. Entropy Broker distributes entropy data generated by the true random number generator to several virtualized Linux guest systems to increase the entropy of each system and in turn, increase the security of their cryptographic libraries.
Entropy Broker and the true random number generator are evaluated against the Linux pseudo random number generator, the Haveged pseudo random number generator, and an on chip random number generator developed by Intel. / May 2016
|
42 |
Design and Implementation of C Programming Language Extension for Parallel GPU ComputingYang, Yu-Wei 27 July 2010 (has links)
NVIDIA developed a technique of executing general program on GPU, named CUDA (Compute Unified Device Architecture), in 2006. The CUDA programming model allows a group of same instructions to execute on multi-thread simultaneously, which has advantage of parallel programs in reducing the execution time significantly. Although CUDA provides a series of C-like APIs (Application Programming Interface) so that programmers can easy use CUDA language, it still costs certain efforts to be familiar with the development. In this thesis, we propose a tool to automatically translate C programs into corresponding CUDA programs which reduce program development time effectively.
|
43 |
Performance Analysis of Graph Algorithms using Graphics Processing UnitsWeng, Hui-Ze 02 September 2010 (has links)
The GPU significantly improves the computing power by increasing the number of cores in recent years.
The design principle of GPU focuses on the parallism of data processing.
Therefore, there is some limitation of GPU application for the better computing power.
For example, the processing of highly dependent data could not be well-paralleled.
Consequently, it could not take the advantage of the computing power improved by GPU.
Most of researches in GPU have discussed the improvement of computing power.
Therefore, we try to study the cost effectiveness by the comparison between GPU and Multi-Core CPU.
By well-applying the different hardware architectures of GPU and Multi-Core CPU,
we implement some typical algorithms, respectively, and show the experimental result.
Furthermore, the analysis of cost effectiveness, including time and money spending, is also well discussed in this paper.
|
44 |
Architectures and limits of GPU-CPU heterogeneous systemsWong, Henry Ting-Hei 11 1900 (has links)
As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain.
Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads.
Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms.
By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x.
This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies.
As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs.
|
45 |
Evaluation of Computer Vision Algorithms Optimized for Embedded GPU:s. / Utvärdering av bildbehandlingsalgoritmer optimerade för inbyggda GPU:erNilsson, Mattias January 2014 (has links)
The interest of using GPU:s as general processing units for heavy computations (GPGPU) has increased in the last couple of years. Manufacturers such as Nvidia and AMD make GPU:s powerful enough to outrun CPU:s in one order of magnitude, for suitable algorithms. For embedded systems, GPU:s are not as popular yet. The embedded GPU:s available on the market have often not been able to justify hardware changes from the current systems (CPU:s and FPGA:s) to systems using embedded GPU:s. They have been too hard to get, too energy consuming and not suitable for some algorithms. At SICK IVP, advanced computer vision algorithms run on FPGA:s. This master thesis optimizes two such algorithms for embedded GPU:s and evaluates the result. It also evaluates the status of the embedded GPU:s on the market today. The results indicates that embedded GPU:s perform well enough to run the evaluatedd algorithms as fast as needed. The implementations are also easy to understand compared to implementations for FPGA:s which are competing hardware.
|
46 |
GPGPU-Sim / A study on GPGPU-SimAndersson, Filip January 2014 (has links)
This thesis studies the impact of hardware features of graphics cards on performance of GPU computing using GPGPU-Sim simulation software tool. GPU computing is a growing topic in the world of computing, and could be an important milestone for computers. Therefore, such a study that seeks to identify the performance bottlenecks of the program with respect to hardware parameters of the devvice can be considered an important step towards tuning devices for higher efficiency. In this work we selected convolution algorithm - a typical GPGPU application - and conducted several tests to study different performance parameters. These tests were performed on two simulated graphics cards (NVIDIA GTX480, NVIDIA Tesla C2050), which are supported by GPGPU-Sim. By changing the hardware parameters of graphics card such as memory cache sizes, frequency and the number of cores, we can make a fine-grained analysis on the effect of these parameters on the performance of the program. A graphics card working on a picture convolution task releis on the L1 cache but has the worst performance with a small shared memory. Using this simulator to run performance tests on a theoretical GPU architecture could lead to better GPU design for embedded systems.
|
47 |
Architectures and limits of GPU-CPU heterogeneous systemsWong, Henry Ting-Hei 11 1900 (has links)
As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain.
Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads.
Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms.
By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x.
This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies.
As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs.
|
48 |
FWI multiescala: uma implementação em GPURamalho, Victor Koehene 05 March 2018 (has links)
Submitted by Júlio Leão Brandão (jlbrandao@ufba.br) on 2018-08-21T16:44:33Z
No. of bitstreams: 1
Dissert_Victor_Koehne_final.pdf: 41205493 bytes, checksum: 697711c5d1d92aa01956bacc668308fb (MD5) / Approved for entry into archive by NUBIA OLIVEIRA (nubia.marilia@ufba.br) on 2018-08-29T14:51:51Z (GMT) No. of bitstreams: 1
Dissert_Victor_Koehne_final.pdf: 41205493 bytes, checksum: 697711c5d1d92aa01956bacc668308fb (MD5) / Made available in DSpace on 2018-08-29T14:51:51Z (GMT). No. of bitstreams: 1
Dissert_Victor_Koehne_final.pdf: 41205493 bytes, checksum: 697711c5d1d92aa01956bacc668308fb (MD5) / A inversão completa da forma de onda (FWI - do inglês full-waveform inversion) é atualmente uma das principais ferramentas para determinar modelos de velocidades da subsuperfície com alta resolução. Nessa dissertação, a FWI no domínio do tempo é introduzida e desenvolvida sob a ótica da implementação, na seguinte sequência: modelagem sísmica, migração reversa no tempo (ou reverse time migration, RTM) e FWI. É mostrado que a RTM, do ponto de vista computacional, equivale a duas modelagens sísmicas, ou três modelagens se for usada a implementação com borda efetiva. A abordagem da FWI como um problema iterativo que visa minimizar o resíduo dos dados sísmicos mostra que o gradiente de cada iteração é, pelo método adjunto, equivalente à RTM do resíduo. Dessa maneira, usando RTM com borda efetiva e um método de estimação do passo, é mostrado que uma iteração da FWI é computacionalmente equivalente a quatro modelagens sísmicas. Como se sabe, a modelagem é um processo muito intensivo computacionalmente, e dentre as formas eficientes de resolver esse problema se destaca o uso de computação paralela. Nessa dissertação, se escolheu a paralelização utilizando placas gráficas (ou graphics processing unit, GPU) que possui alta capacidade de cálculo de pontos flutuantes, porém baixa eficiência na transferência de dados. Tais características se adequam muito bem ao problema de extrapolação de campos de onda no tempo, em especial no cálculo do Laplaciano da equação da onda acústica em cada ponto do modelo, que correspondem a quase todo o tempo computacional da modelagem, custo esse que, paralelizado em GPU, mais que compensa as transferências de memória CPU-GPU inerentes ao problema. A GPU, todavia, tem uma restrição de memória, tipicamente variando entre 2, 5 e 12 GB. Nessa dissertação, então se focou em técnicas que permitem a economia de memória em troca de processamento. Dentre essas implementações, destaca-se o uso de borda efetiva na RTM e o uso do método de expansão rápida (REM), que permite extrapolação a maiores intervalos de tempo, reduzindo o número total de amostras temporais que precisam ser armazenadas e transferidas na memória do equipamento utilizado. A implementação em GPU também permitiu testar em tempo hábil alguns dos fatores mais importantes que influenciam a FWI. Foram testados, numa malha regular: quatro operadores de modelagem - diferenças finitas (DF), pseudo-espectral (PS), REM-DF e REMPS-; as condições de borda absorvedora taper e perfectly matching layer (PML); e os métodos de inversão conjugado gradiente não linear (CGNL) e L-BFGS. Os resultados desses testes permitiram selecionar os melhores critérios para execução da FWI em modelos de velocidades sintéticos, porém de geologia complexa. O uso da metodologia multiescala, essencial para evitar convergência a mínimos locais, associado a uma modelagem extra para obtenção de um passo adequado, permitiu a obtenção de resultados finais de alta resolução para os três modelos testados. / Full-waveform inversion (FWI) is nowadays one of the main tools for estimating high resolutionsubsurfacevelocitymodels. Inthisdissertation, time-domainFWIisintroducedfroman algorithmic point of view: seismic modeling, reverse time migration (RTM), and FWI. It is shown that RTM, from a computational point of view, is equivalent to two seismic modeling processes, or three if the effective boundaries implementation is used. The approach of FWI as an iterative problem (which aims to minimize the seismic data residue) shows that the gradient of each iteration, using the adjoint-state method, is equivalent to the RTM of the residue. In this manner, using RTM with effective boundaries and a step length estimation method, it is shown in this thesis that one iteration of FWI is computationally equivalent of four seismic modelling processes.
Itisknownthatseismicmodelingisahighlyintensivecomputationalprocess, andamong the techniques to mitigate this cost the use of parallel computing stands out. In this dissertation we chose the parallelization using the graphics processing unit (GPU) which has high floating point computation capability, but low efficiency in data transfer. These characteristics fit very well to the problem of wave field extrapolation in time, especially in the calculation of the Laplacian of the acoustic wave equation at each point of the model, whose computational costs in GPU more than compensates the data transfers inherent to the problem.
The GPU, however, has a memory constraint, typically ranging from 2, 5 and 12 GB. In this thesis, we then focused in techniques that allowed memory savings in exchange of processing. Among these implementations, we highlight the use of effective boundaries in RTM and the rapid expansion method (REM) for time extrapolation, which allows marching at longer time steps, reducing the total time samples that need to be stored and transferred.
The GPU implementation also enabled to test in a timely manner some of the most important factors influencing FWI. We tested, in a regular grid: four modeling operators finite-differences (FD), pseudo-spectral (PS), REM-FD, REM-PS -; the absorbing boundary conditions taper and perfectly matching layer (PML); and the inversion methods L-BFGS and non-linear conjugate gradient (NLCG). The results of these tests allowed to select the best criteria for FWI execution in synthetic velocity models of complex geology. The use of the multiscale methodology, essential to avoid convergence to local minima, in conjunction with an extra modeling step to ensure an efficient step length, allowed achieving final results of high resolution for the three tested models.
|
49 |
Techniques of design optimisation for algorithms implemented in softwareHopson, Benjamin Thomas Ken January 2016 (has links)
The overarching objective of this thesis was to develop tools for parallelising, optimising, and implementing algorithms on parallel architectures, in particular General Purpose Graphics Processors (GPGPUs). Two projects were chosen from different application areas in which GPGPUs are used: a defence application involving image compression, and a modelling application in bioinformatics (computational immunology). Each project had its own specific objectives, as well as supporting the overall research goal. The defence / image compression project was carried out in collaboration with the Jet Propulsion Laboratories. The specific questions were: to what extent an algorithm designed for bit-serial for the lossless compression of hyperspectral images on-board unmanned vehicles (UAVs) in hardware could be parallelised, whether GPGPUs could be used to implement that algorithm, and whether a software implementation with or without GPGPU acceleration could match the throughput of a dedicated hardware (FPGA) implementation. The dependencies within the algorithm were analysed, and the algorithm parallelised. The algorithm was implemented in software for GPGPU, and optimised. During the optimisation process, profiling revealed less than optimal device utilisation, but no further optimisations resulted in an improvement in speed. The design had hit a local-maximum of performance. Analysis of the arithmetic intensity and data-flow exposed flaws in the standard optimisation metric of kernel occupancy used for GPU optimisation. Redesigning the implementation with revised criteria (fused kernels, lower occupancy, and greater data locality) led to a new implementation with 10x higher throughput. GPGPUs were shown to be viable for on-board implementation of the CCSDS lossless hyperspectral image compression algorithm, exceeding the performance of the hardware reference implementation, and providing sufficient throughput for the next generation of image sensor as well. The second project was carried out in collaboration with biologists at the University of Arizona and involved modelling a complex biological system – VDJ recombination involved in the formation of T-cell receptors (TCRs). Generation of immune receptors (T cell receptor and antibodies) by VDJ recombination is an enormously complex process, which can theoretically synthesize greater than 1018 variants. Originally thought to be a random process, the underlying mechanisms clearly have a non-random nature that preferentially creates a small subset of immune receptors in many individuals. Understanding this bias is a longstanding problem in the field of immunology. Modelling the process of VDJ recombination to determine the number of ways each immune receptor can be synthesized, previously thought to be untenable, is a key first step in determining how this special population is made. The computational tools developed in this thesis have allowed immunologists for the first time to comprehensively test and invalidate a longstanding theory (convergent recombination) for how this special population is created, while generating the data needed to develop novel hypothesis.
|
50 |
Architectures and limits of GPU-CPU heterogeneous systemsWong, Henry Ting-Hei 11 1900 (has links)
As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain.
Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads.
Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms.
By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x.
This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies.
As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs. / Applied Science, Faculty of / Electrical and Computer Engineering, Department of / Graduate
|
Page generated in 0.0329 seconds