61 |
Global address spaces for efficient resource provisioning in the data centerYoung, Jeffrey Scott 13 January 2014 (has links)
The rise of large data sets, or "Big Data'', has coincided with the rise of clusters with large amounts of memory and GPU accelerators that can be used to process rapidly growing data footprints. However, the complexity and performance limitations of sharing memory and accelerators in a cluster limits the options for efficient management and allocation of resources for applications. The global address space model (GAS), and specifically hardware-supported GAS, is proposed as a means to provide a high-performance resource management platform upon which resource sharing between nodes and resource aggregation across nodes
can take place. This thesis builds on the initial concept of GAS with a model that is matched to "Big Data'' computing and its data transfer requirements.
The proposed model, Dynamic Partitioned Global Address Spaces (DPGAS), is implemented using a commodity converged interconnect, HyperTransport over Ethernet (HToE), and a software framework, the Oncilla runtime and API. The DPGAS model and associated hardware and software components are used to investigate two application spaces, resource sharing for time-varying workloads and
resource aggregation for GPU-accelerated data warehousing applications. This work demonstrates that hardware-supported GAS can be used improve the performance and power consumption of memory-intensive applications, and that it can be used to simplify host and accelerator resource management in the data center.
|
62 |
Cellular GPU Models to Euclidean Optimization Problems : Applications from Stereo Matching to Structured Adaptive Meshing and Traveling Salesman ProblemZHANG, Naiyu 02 December 2013 (has links) (PDF)
The work presented in this PhD studies and proposes cellular computation parallel models able to address different types of NP-hard optimization problems defined in the Euclidean space, and their implementation on the Graphics Processing Unit (GPU) platform. The goal is to allow both dealing with large size problems and provide substantial acceleration factors by massive parallelism. The field of applications concerns vehicle embedded systems for stereovision as well as transportation problems in the plane, as vehicle routing problems. The main characteristic of the cellular model is that it decomposes the plane into an appropriate number of cellular units, each responsible of a constant part of the input data, and such that each cell corresponds to a single processing unit. Hence, the number of processing units and required memory are with linear increasing relationship to the optimization problem size, which makes the model able to deal with very large size problems.The effectiveness of the proposed cellular models has been tested on the GPU parallel platform on four applications. The first application is a stereo-matching problem. It concerns color stereovision. The problem input is a stereo image pair, and the output a disparity map that represents depths in the 3D scene. The goal is to implement and compare GPU/CPU winner-takes-all local dense stereo-matching methods dealing with CFA (color filter array) image pairs. The second application focuses on the possible GPU improvements able to reach near real-time stereo-matching computation. The third and fourth applications deal with a cellular GPU implementation of the self-organizing map neural network in the plane. The third application concerns structured mesh generation according to the disparity map to allow 3D surface compressed representation. Then, the fourth application is to address large size Euclidean traveling salesman problems (TSP) with up to 33708 cities.In all applications, GPU implementations allow substantial acceleration factors over CPU versions, as the problem size increases and for similar or higher quality results. The GPU speedup factor over CPU was of 20 times faster for the CFA image pairs, but GPU computation time is about 0.2s for a small image pair from Middlebury database. The near real-time stereovision algorithm takes about 0.017s for a small image pair, which is one of the fastest records in the Middlebury benchmark with moderate quality. The structured mesh generation is evaluated on Middlebury data set to gauge the GPU acceleration factor and quality obtained. The acceleration factor for the GPU parallel self-organizing map over the CPU version, on the largest TSP problem with 33708 cities, is of 30 times faster.
|
63 |
Contributions to parallel stochastic simulation : application of good software engineering practices to the distribution of pseudorandom streams in hybrid Monte Carlo simulations / Contributions à la simulation stochastique parallèle : architectures logicielles pour la distribution de flux pseudo-aléatoires dans les simulations Monte Carlo sur CPU/GPUPasserat-Palmbach, Jonathan 11 October 2013 (has links)
Résumé non disponible / The race to computing power increases every day in the simulation community. A few years ago, scientists have started to harness the computing power of Graphics Processing Units (GPUs) to parallelize their simulations. As with any parallel architecture, not only the simulation model implementation has to be ported to the new parallel platform, but all the tools must be reimplemented as well. In the particular case of stochastic simulations, one of the major element of the implementation is the pseudorandom numbers source. Employing pseudorandom numbers in parallel applications is not a straightforward task, and it has to be done with caution in order not to introduce biases in the results of the simulation. This problematic has been studied since parallel architectures are available and is called pseudorandom stream distribution. While the literature is full of solutions to handle pseudorandom stream distribution on CPU-based parallel platforms, the young GPU programming community cannot display the same experience yet.In this thesis, we study how to correctly distribute pseudorandom streams on GPU. From the existing solutions, we identified a need for good software engineering solutions, coupled to sound theoretical choices in the implementation. We propose a set of guidelines to follow when a PRNG has to be ported to GPU, and put these advice into practice in a software library called ShoveRand. This library is used in a stochastic Polymer Folding model that we have implemented in C++/CUDA. Pseudorandom streams distribution on manycore architectures is also one of our concerns. It resulted in a contribution named TaskLocalRandom, which targets parallel Java applications using pseudorandom numbers and task frameworks.Eventually, we share a reflection on the methods to choose the right parallel platform for a given application. In this way, we propose to automatically build prototypes of the parallel application running on a wide set of architectures. This approach relies on existing software engineering tools from the Java and Scala community, most of them generating OpenCL source code from a high-level abstraction layer.
|
64 |
Enhancing GPGPU Performance through Warp Scheduling, Divergence Taming and Runtime Parallelizing TransformationsAnantpur, Jayvant P January 2017 (has links) (PDF)
There has been a tremendous growth in the use of Graphics Processing Units (GPU) for the acceleration of general purpose applications. The growth is primarily due to the huge computing power offered by the GPUs and the emergence of programming languages such as CUDA and OpenCL. A typical GPU consists of several 100s to a few 1000s of Single Instruction Multiple Data (SIMD) cores, organized as 10s of Streaming Multiprocessors (SMs), each having several SIMD cores which operate in a lock-step manner, o ering a few TeraFLOPS of performance in a single socket. SMs execute instructions from a group of consecutive threads, called warps. At each cycle, an SM schedules a warp from a group of active warps and can context switch among the active warps to hide various stalls. However, various factors, such as global memory latency, divergence among warps of a thread block (TB), branch divergence among threads of a warp (Control Divergence), number of active warps, etc., can significantly impact the ability of a warp scheduler to hide stalls. This reduces the speedup of applications running on the GPU. Further, applications containing loops with potential cross iteration dependences, do not utilize the available resources (SIMD cores) effectively and hence su er in terms of performance. In this thesis, we propose several mechanisms which address the above issues and enhance the performance of GPU applications through efficient warp scheduling, taming branch and warp divergence, and runtime parallelization.
First, we propose RLWS, a Reinforcement Learning (RL) based Warp Scheduler which uses unsupervised learning to schedule warps based on the current state of the core and the long-term benefits of scheduling actions. As the design space involving the state variables used by the RL and the RL parameters (such as learning and exploration rates, reward and penalty values, etc.) is large, we use a Genetic Algorithm to identify the useful subset of state variables and RL parameter values. We evaluated the proposed RL based scheduler using the GPGPU-SIM simulator on a large number of applications from the Rodinia, Parboil, CUDA-SDK and GPGPU-SIM benchmark suites. Our RL based implementation achieved an average speedup of 1.06x over the Loose Round Robin (LRR) strategy and 1.07x over the Two-Level (TL) strategy. A salient feature of RLWS is that it is robust, i.e., performs nearly as well as the best performing warp scheduler, consistently across a wide range of applications. Using the insights obtained from RLWS, we designed PRO, a heuristic warp scheduler which in addition to hiding the long latencies of certain operations, reduces the waiting time of warps at synchronization points. Evaluation of the proposed algorithm using the GPGPU-SIM simulator on a diverse set of applications showed an average speedup of 1.07x over the LRR warp scheduler and 1.08x over the TL warp scheduler.
In the second part of the thesis, we address problems due to warp and branch divergences. First, many GPU kernels exhibit warp divergence due to various reasons such as, different amounts of work, cache misses, and thread divergence. Also, we observed that some kernels contain code which is redundant across TBs, i.e., all TBs will execute the code identically and hence compute the same results. To improve performance of such kernels, we propose a solution based on the concept of virtual TBs and loop independent code motion. We propose necessary code transformations which enable one virtual TB to execute the kernel code for multiple real TBs. We evaluated this technique using the GPGPU-SIM simulator on a diverse set of applications and observed an average improvement of 1.08x over the LRR and 1.04x over the Greedy Then Old (GTO) warp scheduling algorithms. Second, branch divergence causes execution of diverging branches to be serialized to execute only one control ow path at a time. Existing stack based hardware mechanism to reconverge threads causes duplicate execution of code for unstructured control ow graphs (CFG). We propose a simple and elegant transformation to convert an unstructured CFG to a structured CFG. The transformation eliminates duplicate execution of user code while incurring only a linear increase in the number of basic blocks and also the number of instructions. We implemented the proposed transformation at the PTX level using the Ocelot compiler infrastructure and demonstrate that the pro-posed technique is effective in handling the performance problem due to divergence in unstructured CFGs.
Our third proposal is to enable efficient execution of loops with indirect memory accesses that can potentially cause cross iteration dependences. Such dependences are hard to detect using existing compilation techniques. We present an algorithm to compute at run-time, the cross iteration dependences in such loops, using both the CPU and the GPU. It effectively uses the compute capabilities of the GPU to collect the memory accesses performed by the iterations. Using the dependence information, the loop iterations are levelized such that each level contains independent iterations which can be executed in parallel. Experimental evaluation on real hardware (NVIDIA GPUs) reveals that the proposed technique can achieve an average speedup of 6.4x on loops with a reasonable number of cross iteration dependences.
|
65 |
Hybrid Simulation Methods for Systems in Condensed PhaseFeldt, Jonas 08 March 2018 (has links)
No description available.
|
66 |
Application specific programmable processors for reconfigurable self-powered devicesNyländen, T. (Teemu) 27 April 2018 (has links)
Abstract
The current Internet of Things solutions for simple measurement and monitoring tasks are evolving into ubiquitous sensor networks that are constantly observing both our well being and the conditions of our living environment. The oncoming omnipresent wireless infrastructure is expected to feature artificial intelligence capabilities that can interpret human actions, gestures and even needs. All of this will require processing power on a par with and energy efficiency far beyond that of the current mobile devices.
The current Internet of Things devices rely mostly on commercial low power off-the-shelf micro-controllers. Optimized solely for low power, while paying little attention to computing performance, the present solutions are far from achieving the energy efficiency, let alone, the compute capability requirements of the future Internet of Things solutions. Since this domain is application specific by nature, the use of general purpose processors for signal processing tasks is counterintuitive. Instead, dedicated accelerator based solutions are more likely to be able to meet these strict demands.
This thesis proposes one potential solution for achieving the necessary low energy, as well as the flexibility and performance requirements of the Internet of Things domain in a cost effective manner using reconfigurable heterogeneous processing solutions. A novel graphics processing unit-style accelerator for the Internet of Things application domain is presented. Since the accelerator can be reconfigured, it can be used for most applications of the Internet of Things domain, as well as other application domains.
The solution is assessed using two computer vision applications, and is demonstrated to achieve an excellent combination of performance and energy efficiency. The accelerator is designed using an efficient and rapid co-design flow of software and hardware, featuring ease of development characteristics close to commercial off-the-shelf solutions, which also enables cost-efficient design flow. / Tiivistelmä
Esineiden internet tulee muuttamaan tulevaisuudessa elinympäristömme täysin. Se tulee mahdollistamaan interaktiiviset ympäristöt nykyisten passiivisten ympäristöjen sijaan. Lisäksi elinympäristömme tulee reagoimaan tekoihimme ja puheeseemme sekä myös tunteisiimme. Tämä kaikkialla läsnä olevan langaton infrastruktuuri tulee vaatimaan ennennäkemätöntä laskentatehokkuutta yhdistettynä äärimmäiseen energiatehokkuuteen.
Nykyiset esineiden internet ratkaisut nojaavat lähes täysin kaupallisiin "suoraan hyllyltä" saataviin yleiskäyttöisiin mikrokontrollereihin. Ne ovat kuitenkin optimoituja pelkästään matalan tehonkulutuksen näkökulmasta, eivätkä niinkään energiatehokkuuden, saati tulevaisuuden esineiden internetin vaatiman laskentatehon suhteen. Kuitenkin esineiden internet on lähtökohtaisesti sovelluskohtaista laskentaa vaativa, joten yleiskäyttöisten prosessoreiden käyttö signaalinkäsittelytehtäviin on epäloogista. Sen sijaan sovelluskohtaisten kiihdyttimien käyttö laskentaan, todennäköisesti mahdollistaisi tavoitellun vaatimustason saavuttamisen.
Tämä väitöskirja esittelee yhden mahdollisen ratkaisun matalan energian kulutuksen, korkean suorituskyvyn ja joustavuuden yhdenaikaiseen saavuttamiseen kustannustehokkaalla tavalla, käyttäen uudelleenkonfiguroitavia heterogeenisiä prosessoriratkaisuja. Työssä esitellään uusi grafiikkaprosessori-tyylinen uudelleen konfiguroitava kiihdytin esineiden internet sovellusalueelle, jota pystytään hyödyntämään useimpien laskentatehoa vaativien sovellusten kanssa.
Ehdotetun kiihdyttimen ominaisuuksia arvioidaan kahta konenäkösovellusta esimerkkinä käyttäen ja osoitetaan sen saavuttavan loistavan yhdistelmän energia tehokkuutta ja suorituskykyä. Kiihdytin suunnitellaan käyttäen tehokasta ja nopeaa ohjelmiston ja laitteiston yhteissuunnitteluketjua, jolla voidaan saavuttaa lähestulkoon kaupallisten "suoraan hyllyltä" saatavien prosessoreiden kehitystyön helppous, joka puolestaan mahdollistaa kustannustehokkaan kehitys- ja suunnittelutyön.
|
67 |
Využití GPU pro akcelerované zpracování obrazu / Image Processing on GPUsBačík, Ladislav January 2008 (has links)
This master thesis deals with modern technologies in graphic hardware and using their for general purpose computing. It is primary focused on architecture of unified processors and algorithm implementation via CUDA programming interface. Thesis base is to choose suited algorithm for GPU horsepower demonstration. Main aim of this work is implementation of multiplatform library offering algorithms for discrete volumetric data vectorization. For this purpose was chosen algorithm Marching cubes that is able to find surface of processed object. In created library will be contained algorithm runnable on graphic device and also one runnable on CPU. Finally we compare both variants and discuss their pros and cons.
|
68 |
An Experimental Evaluation of Probabilistic Deep Networks for Real-time Traffic Scene Representation using Graphical Processing UnitsEl-Shaer, Mennat Allah 03 September 2019 (has links)
No description available.
|
69 |
Using GPU-aware message passing to accelerate high-fidelity fluid simulations / Användning av grafikprocessormedveten meddelandeförmedling för att accelerera nogranna strömningsmekaniska datorsimuleringarWahlgren, Jacob January 2022 (has links)
Motivated by the end of Moore’s law, graphics processing units (GPUs) are replacing general-purpose processors as the main source of computational power in emerging supercomputing architectures. A challenge in systems with GPU accelerators is the cost of transferring data between the host memory and the GPU device memory. On supercomputers, the standard for communication between compute nodes is called Message Passing Interface (MPI). Recently, many MPI implementations support using GPU device memory directly as communication buffers, known as GPU-aware MPI. One of the most computationally demanding applications on supercomputers is high-fidelity simulations of turbulent fluid flow. Improved performance in high-fidelity fluid simulations can enable cases that are intractable today, such as a complete aircraft in flight. In this thesis, we compare the MPI performance with host memory and GPU device memory, and demonstrate how GPU-aware MPI can be used to accelerate high-fidelity incompressible fluid simulations in the spectral element code Neko. On a test system with NVIDIA A100 GPUs, we find that MPI performance is similar using host memory and device memory, except for intra-node messages in the range of 1-64 KB which is significantly slower using device memory, and above 1 MB which is faster using device memory. We also find that the performance of high-fidelity simulations in Neko can be improved by up to 2.59 times by using GPU-aware MPI in the gather–scatter operation, which avoids several transfers between host and device memory. / Motiverat av slutet av Moores lag så har grafikprocessorer (GPU:er) börjat ersätta konventionella processorer som den huvudsakliga källan till beräkningingskraft i superdatorer. En utmaning i system med GPU-acceleratorer är kostnaden att överföra data mellan värdminnet och acceleratorminnet. På superdatorer är Message Passing Interface (MPI) en standard för kommunikation mellan beräkningsnoder. Nyligen stödjer många MPI-implementationer direkt användning av acceleratorminne som kommunikationsbuffertar, vilket kallas GPU-aware MPI. En av de mest beräkningsintensiva applikationerna på superdatorer är nogranna datorsimuleringar av turbulenta flöden. Förbättrad prestanda i nogranna flödesberäkningar kan möjliggöra fall som idag är omöjliga, till exempel ett helt flygplan i luften. I detta examensarbete jämför vi MPI-prestandan med värdminne och acceleratorminne, och demonstrerar hur GPU-aware MPI kan användas för att accelerera nogranna datorsimuleringar av inkompressibla flöden i spektralelementkoden Neko. På ett testsystem med NVIDIA A100 GPU:er finner vi att MPI-prestandan är liknande med värdminne och acceleratorminne. Detta gäller dock inte för meddelanden inom samma beräkningsnod i intervallet 1-64 KB vilka är betydligt långsammare med acceleratorminne, och över 1 MB vilka är betydligt snabbare med acceleratorminne. Vi finner också att prestandan av nogranna datorsimuleringar i Neko kan förbättras upp till 2,59 gånger genom användning av GPU-aware MPI i den så kallade gather– scatter-operationen, vilket undviker flera överföringar mellan värdminne och acceleratorminne.
|
70 |
Efficient generation and rendering of tube geometry in Unreal Engine : Utilizing compute shaders for 3D line generation / Effektiv generering och rendering av tubgeometri i Unreal Engine : Generering av 3D-linjer med compute shadersWoxler, Platon January 2021 (has links)
Massive graph visualization in an immersive environment, such as virtual reality (VR) or Augmented Reality (AR), has the possibility to improve users’ understanding when exploring data in new ways. To make the most of a visualization, such as this, requires interactive components that are fast enough to accommodate interactivity. By rendering the edges of the graph as shaded lines that imitate three‑dimensional (3D) lines or tubes, one can circumvent technical limitations. This method works well enough when using traditional two‑dimensional (2D) monitors, but representing tubes as flat lines in a virtual environment (VE) makes for a less immersive user experience as opposed to visualizing true 3D geometry. In order to accommodate for these requirements i.e., speed and visual fidelity, we need a time efficient way of producing tubular meshes. This thesis project explores how one can generate tubular geometry utilizing compute shaders in the modern game engine, Unreal Engine (UE). Exploiting the parallel computing power of the graphical processing unit (GPU) we use compute shaders to generate a tubular mesh following a predetermined path. The result from the project is an open source plugin for UE, able to generate tubular geometry at rapid rates. While not giving any major advantages when generating smaller models, comparing it to a sequential implementation, the compute shader implementation create and render models > 40× faster when generating 106 tube segments. A secondary effect of generating most of the data on the GPU, is that we avoid bottlenecks that can occur when surpassing the bandwidth of the central processing unit (CPU) to GPU data transfer. Using this tool researches can more easily explore information visualization in a VE. Furthermore, this thesis promotes extended development of mesh generation, using compute shaders in UE. / Att visualisera stora grafer i en immersiv miljö, såsom VR eller AR, kan förbättra en användares förståelse när de utforskar data på nya sätt. För att få ut det mesta av denna typen av visualiseringar krävs interaktiva komponenter som är tillräckligt snabba för att tillgodose interaktivitet. Genom att visa de linjer, som binder samman en grafs noder, som plana linjer som imiterar 3Dlinjer eller rör, kan man undvika att slå i det tak som tekniska begränsningar medför. Denna metoden är acceptabel vid användning av traditionella 2Dskärmar, men att representera rör som plana linjer i VE ger en mindre immersiv användarupplevelse, i kontrast till att visualisera sann 3D -geometri. För att tillgodose dessa krav dvs, tidseffektivitet och visuella kvaliteter, behöver vi ett effektivt sätt att producera 3D-linjer. Denna uppsats undersöker hur man kan generera rörformad geometri med hjälp av compute shaders i den moderna spelmotorn Unreal Engine (UE). Genom att använda compute shaders kan vi utnyttja den parallella beräkningskraften hos en GPU, kan vi generera ett rörformat mesh som följer en förutbestämd bana. Resultatet från projektet är ett open source-plugin för UE, som kan generera rörformad geometri i höga hastigheter. Även om det inte kan visas ge några större fördelar när man genererar mindre modeller, jämfört med en sekventiell implementering, skapar och renderar implementeringen av compute Shaders modeller > 40× snabbare, när de genererar 106 rörsegment. I och med att den större delen av datan skapas på GPU kan vi också undvika den flaskhals som kan uppstå när vi överskrider bandbredden mellan CPU och GPU. Med hjälp av verktyget som skapats i samband med denna uppsats kan människor lättare utforska informationsvisualisering i VE. Dessutom främjar denna uppsats utökad utveckling av mesh-generering med hjälp av compute shaders i UE.
|
Page generated in 0.0673 seconds