Global ETD Search

11	Accelerating BGV Scheme of Fully Homomorphic Encryption Using GPUs Dong, Jiyang 27 April 2016 (has links) After the first plausible fully homomorphic encryption (FHE) scheme designed by Gentry, interests of a building a practical scheme in FHE has kept increasing. This paper presents an engineering study of accelerating the FHE with BGV scheme and proves the feasibility of implement certain parts of HElib on GPU. The BGV scheme is a RLWE-based FHE scheme, which introduces a set of algorithms in polynomial arithmetic. The encryption scheme is implemented in finite field. Therefore, acceleration of the large polynomial arithmetic with efficient modular reduction is the most crucial part of our research efforts. Note that our implementation does not include the noise management yet. Hence all the work is still in the stage of somewhat homomorphic encryption, namely SWHE. Finally, our implementation of the encryption procedure, when comparing with HElib compiled by 9.3.0 version NTL library on Xeon CPU, has achieved 3.4x speedup on the platform with GTX 780ti GPU. FHE， BGV， GPU， HElib
12	Background Estimation with GPU Speed Up Chen, Xida 11 1900 (has links) Given a set of images from the same viewpoint, in which occlusions are present, background estimation is to output an image with stationary objects in the scene only. Background estimation is an important step in many computer vision problems such as object detection and recognition. With the growing interest in more sophisticated video surveillance systems, the requirement for the accuracy of background estimation increases as well. In this thesis, we present two novel methods whose fundamental objectives are the same, namely, to estimate the background of a set of related images. In order to make our methods more general, we assume that the input images can be taken either from the same viewpoint or from different viewpoints. Both methods combine information from multiple input images by selecting the appropriate pixels to construct the background. Our first method is a scanline energy optimization method, and our second method is based on graph cuts optimization. We apply these two methods to datasets with different feature and the results are encouraging. Furthermore, we use the CUDA (Compute Unified Device Architecture) programming language to make full use of the GPU processing power. GPU stands for Graphics Processing Unit, which employs parallel processing and is more powerful than the CPU. In particular, we implement an efficient graph-based image segmentation algorithm as well as a linear blending method using the CUDA programming language for acceleration, both of which are used in our first method. The speedup of our GPU implementation can be 20 times faster. Background Estimation GPU
13	Perspective-Driven Radiosity on Graphics Hardware Bozalina, Justin Taylor 2011 May 1900 (has links) Radiosity is a global illumination algorithm used by artists, architects, and engineers for its realistic simulation of lighting. Since the illumination model is global, complexity and run time grow as larger environments are provided. Algorithms exist which generate an incremental result and provide weighting based on the user's view of the environment. This thesis introduces an algorithm for directing and focusing radiosity calculations relative to the user's point-of-view and within the user's field-of-view, generating visually interesting results for a localized area more quickly than a traditional global approach. The algorithm, referred to as perspective-driven radiosity, is an extension of the importance-driven radiosity algorithm, which itself is an extension of the progressive refinement radiosity algorithm. The software implemented during research into the point-of-view/field-of-view-driven algorithm can demonstrate both of these algorithms, and can generate results for arbitrary geometry. Parameters can be adjusted by the user to provide results that favor speed or quality. To take advantage of the scalability of programmable graphics hardware, the algorithm is implemented as an extension of progressive refinement radiosity on the GPU, using OpenGL and GLSL. Results from each of the three implemented radiosity algorithms are compared using a variety of geometry. Radiosity Global Illumination GPU
14	A hybrid fluid simulation on the Graphics Processing Unit (GPU) Flannery, Rebecca Lynn 10 October 2008 (has links) This thesis presents a method to implement a hybrid particle/grid uid simulation on graphics hardware. The goal is to speed up the simulation by exploiting the parallelism of the graphics processing unit, or GPU. The Fluid Implicit Particle method is adapted to the programming style of the GPU. The methods were implemented on a current generation graphics card. The GPU based program exhibited a small speedup over its CPU based counterpart. GPU fluid simulation
15	GPU parallelization of the Mishchenko method for solving Fredholm equations of the first kind Nordström, Johan January 2015 (has links) Fredholm integral equations of the first kind are known to be ill-posed and may be impossible to solve analytically. A. S. Mishchenko et al. have developed a method to generate numerical solutions to Fredholm equations which occurs in physics. Mischenko's method is a Monte Carlo method which can run in parallel. The purpose of this project was to investigate how a parallel version of the Mishchenko method can be implemented on a Graphics Processing Unit (GPU). The developed program uses the CUDA platform for GPU programming. The conclusion of the project is that it is definitely possible to implement the Mishchenko method on a GPU. However, some properties of the algorithm are not optimal for the GPU. A more thorough analysis of the implementation is needed to get a complete understanding of the performance and the bottlenecks. gpu fredholm mishchenko cuda
16	GPU based IP forwarding Blomquist, Linus, Engström, Hampus January 2015 (has links) This thesis was about investigating if it is feasible to implement an IP-forwarding data plane on a GPU. A GPU is energy efficient compared to other more powerful processors on the market today and should in theory be efficient to use for routing purposes. An IP-forwarding data plane consist of several things where we focused on some of the concepts. We have implemented IP-forwarding lookup operations, packet header changes, prioritization between different packets and a traffic shaper to restrict the packet throughput. To test these concepts we implemented a prototype, on a Tegra platform, in CUDA and evaluated its performance. We are able to forward 28 Mpackets/second with a best case latency of 27 µS given local simulated packets. The conclusions we can draw of this thesis work is that using a GPU for IP-forwarding purposes seems like an energy efficient solution compared to other routers on the market today. In the thesis we also tried the concept of only launching the GPU kernel once and let it be infinite which shows promising results for future work. IP forwarding GPU CUDA
17	Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors Li, Dong, active 21st century 10 July 2014 (has links) Throughput processors such as GPUs continue to provide higher peak arithmetic capability. Designing a high throughput memory system to keep the computational units busy is very challenging. Future throughput processors must continue to exploit data locality and utilize the on-chip and off-chip resources in the memory system more effectively to further improve the memory system throughput. This dissertation advocates orchestrating the thread scheduler with the cache management algorithms to alleviate GPU cache thrashing and pollution, avoid bandwidth saturation and maximize GPU memory system throughput. Based on this principle, this thesis work proposes three mechanisms to improve the cache efficiency and the memory throughput. This thesis work enhances the thread throttling mechanism with the Priority-based Cache Allocation mechanism (PCAL). By estimating the cache miss ratio with a variable number of cache-feeding threads and monitoring the usage of key memory system resources, PCAL determines the number of threads to share the cache and the minimum number of threads bypassing the cache that saturate memory system resources. This approach reduces the cache thrashing problem and effectively employs chip resources that would otherwise go unused by a pure thread throttling approach. We observe 67% improvement over the original as-is benchmarks and a 18% improvement over a better-tuned warp-throttling baseline. This work proposes the AgeLRU and Dynamic-AgeLRU mechanisms to address the inter-thread cache thrashing problem. AgeLRU prioritizes cache blocks based on the scheduling priority of their fetching warp at replacement. Dynamic-AgeLRU selects the AgeLRU algorithm and the LRU algorithm adaptively to avoid degrading the performance of non-thrashing applications. There are three variants of the AgeLRU algorithm: (1) replacement-only, (2) bypassing, and (3) bypassing with traffic optimization. Compared to the LRU algorithm, the above mentioned three variants of the AgeLRU algorithm enable increases in performance of 4%, 8% and 28% respectively across a set of cache-sensitive benchmarks. This thesis work develops the Reuse-Prediction-based cache Replacement scheme (RPR) for the GPU L1 data cache to address the intra-thread cache pollution problem. By combining the GPU thread scheduling priority together with the fetching Program Counter (PC) to generate a signature as the index of the prediction table, RPR identifies and prioritizes the near-reuse blocks and high-reuse blocks to maximize the cache efficiency. Compared to the AgeLRU algorithm, the experimental results show that the RPR algorithm results in a throughput improvement of 5% on average for regular applications, and a speedup of 3.2% on average across a set of cache-sensitive benchmarks. The techniques proposed in this dissertation are able to alleviate the cache thrashing, cache pollution and resource saturation problems effectively. We believe when these techniques are combined, they will synergistically further improve GPU cache efficiency and the overall memory system throughput. / text Throughput processors GPU Architecture
18	Interactive simulation and visualization of complex physics problems using the GPU Zhao, Cailu Unknown Date No description available. GPU Simulation Visualization
19	Background Estimation with GPU Speed Up Chen, Xida Unknown Date No description available. Background Estimation GPU
20	An in-depth performance analysis of irregular workloads on VLIW APU Doerksen, Matthew 26 May 2014 (has links) Heterogeneous multi-core architectures have a higher performance/power ratio than traditional homogeneous architectures. Due to their heterogeneity, these architectures support diverse applications but developing parallel algorithms on these architectures can be difficult. In implementing algorithms for heterogeneous systems, proprietary languages are often required, limiting portability. Although general purpose graphics processing units (GPUs) have shown great promise in accelerating the performance of throughput computing applications, it is still limited by the memory wall. The memory wall can greatly affect application performance for problems that incorporate amorphous parallelism or irregular workload. Now, AMD's Fusion series of Accelerated Processing Units (APUs) attempts to solve this problem. The APU is a radical change from the traditional systems of a few years ago. This design change enables consumers to have a capable CPU connected to a powerful, compute-capable GPU using a Very Long Instruction Word (VLIW) architecture. In this thesis, I address the suitability of irregular workload problems on APU architectures. I consider four scientific computing problems of varying characteristics and map them onto the architectural features of the APU. I develop several software optimizations for each problem by making effective use of VLIW static scheduling through techniques such as loop unrolling and vectorization. Using AMD's OpenCL profiler, I analyze the execution of the various optimizations and provide an in-depth performance analysis using metrics such as kernel occupancy, ALUFetchRatio, ALUBusy Percentage and ALUPacking. Finally, I show the effect of register pressure due to vectorization and the limitations associated with the APU architecture for irregular workloads. APU OpenCL irregular GPU

Search results