• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 475
  • 88
  • 87
  • 56
  • 43
  • 21
  • 14
  • 14
  • 11
  • 5
  • 5
  • 3
  • 3
  • 3
  • 3
  • Tagged with
  • 990
  • 321
  • 204
  • 184
  • 169
  • 165
  • 155
  • 138
  • 124
  • 104
  • 97
  • 95
  • 93
  • 88
  • 83
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
601

Real-Time 2D Digital Image Correlation to Measure Surface Deformation on Graphics Processing Unit using CUDA C

vechalapu, uday bhaskar 05 June 2018 (has links)
No description available.
602

Matrix Multiplications on Apache Spark through GPUs / Matrismultiplikationer på Apache Spark med GPU

Safari, Arash January 2017 (has links)
In this report, we consider the distribution of large scale matrix multiplications across a group of systems through Apache Spark, where each individual system utilizes Graphical Processor Units (GPUs) in order to perform the matrix multiplication. The purpose of this thesis is to research whether the GPU's advantage in performing parallel work can be applied to a distributed environment, and whether it scales noticeably better than a CPU implementation in a distributed environment. This question was resolved by benchmarking the different implementations at their peak. Based on these benchmarks, it was concluded that GPUs indeed do perform better as long as single precision support is available in the distributed environment. When single precision operations are not supported, GPUs perform much worse due to the low double precision performance of most GPU devices. / I denna rapport betraktar vi fördelningen av storskaliga matrismultiplikationeröver ett Apache Spark kluster, där varje system i klustret delegerar beräkningarnatill grafiska processorenheter (GPU). Syftet med denna avhandling är attundersöka huruvida GPU:s fördel vid parallellt arbete kan tillämpas på en distribuerad miljö, och om det skalar märkbart bättre än en CPU-implementationi en distribuerad miljö. Detta gjordes genom att testa de olika implementationerna i en miljö däroptimal prestanda kunde förväntas. Baserat på resultat ifrån dessa tester drogsslutsatsen att GPU-enheter preseterar bättre än CPU-enheter så länge ramverkethar stöd för single precision beräkningar. När detta inte är fallet så presterar deflesta GPU-enheterna betydligt sämre på grund av deras låga double-precisionprestanda.
603

ART vs. NDK vs. GPU acceleration: A study of performance of image processing algorithms on Android / ART, NDK eller GPU acceleration: En prestandastudie av bildbehandlingsalgoritmer på Android

Pålsson, Andreas January 2017 (has links)
The Android ecosystem contains three major platforms for execution suitable for different purposes. Android applications are normally written in the Java programming language, but computationally intensive parts of Android applications can be sped up by choosing to use a native language or by utilising the parallel architecture found in graphics processing units (GPUs). The experiments conducted in this thesis measure the performance benefits by switching from Java to C++ or RenderScript, Google’s GPU acceleration framework. The experiments consist of often-done tasks in image processing. For some of these tasks, optimized libraries and implementations already exist. The performance of the implementations provided by third parties are compared to our own. Our results show that for advanced image processing on large images, the benefits are large enough to warrant C++ or RenderScript usage instead of Java in modern smartphones. However, if the image processing is conducted on very small images (e.g. thumbnails) or the image processing task contains few calculations, moving to a native language or RenderScript is not worth the added development time and static complexity. RenderScript is the best choice if the GPU vendors provide an optimized implementation of the processing task. If there is no such implementation provided, both C++ and RenderScript are viable choices. If full precision is required in the floating point arithmetic, a C++ implementation is the recommended. If it is possible to achieve the desired effect without compliance with IEEE Floating Point Arithmetic standard, RenderScript provides better run time performance. / Android-ekosystemet innehåller tre exekveringsplattformer passande för olika syften. Android-applikationer är vanligtvis skrivna i programmeringsspråket Java, men beräkningsintensiva delar av en Android-applikation kan snabbas upp genom att använda en statiskt kompilerat språk eller genom att utnyttja den parallella arkitekturen som hittas i grafikprocessorer. Experimenten utförda i det här projektet ämnar mäta prestandasförbättringar som kan uppnås genom att byta från Java till C++ eller RenderScript, Googles grafikaccelerationsramverk. Experimenten består av ofta använda algoritmer inom bildhantering. För några av dessa finns det optimerade bibliotek och övriga färdiga implementationer. Prestandan av tredjepartsbiblioteken jämförs med våra implementationer. Våra resultat visar att för avancerad bildhantering är prestandaförbättringarna tillräckligt bra för att använda C++ eller RenderScript istället för Java på moderna smartphones. I de fall bildhanteringen görs på väldigt små bilder eller innehåller få beräkningar (exempelvis miniatyrbilder) är bytet från Java till RenderScript eller C++ inte värt den extra utvecklingstiden samt den statiska kodkomplexiteten. RenderScript är det bästa valet då grafikprocessortillverkarna tillhandahåller implementationer av algoritmen som ska köras. Om det inte finns någon sådan implementation är både C++ och RenderScript tillämpbara val. Om noggrann precision krävs rekommenderas en C++-implementation. Däremot om full precision inte behövs vid flyttalsberäkningar rekommenderas istället RenderScript.
604

Modeling, Simulation, And Visualization Of 3d Lung Dynamics

Santhanam, Anand 01 January 2006 (has links)
Medical simulation has facilitated the understanding of complex biological phenomenon through its inherent explanatory power. It is a critical component for planning clinical interventions and analyzing its effect on a human subject. The success of medical simulation is evidenced by the fact that over one third of all medical schools in the United States augment their teaching curricula using patient simulators. Medical simulators present combat medics and emergency providers with video-based descriptions of patient symptoms along with step-by-step instructions on clinical procedures that alleviate the patient's condition. Recent advances in clinical imaging technology have led to an effective medical visualization by coupling medical simulations with patient-specific anatomical models and their physically and physiologically realistic organ deformation. 3D physically-based deformable lung models obtained from a human subject are tools for representing regional lung structure and function analysis. Static imaging techniques such as Magnetic Resonance Imaging (MRI), Chest x-rays, and Computed Tomography (CT) are conventionally used to estimate the extent of pulmonary disease and to establish available courses for clinical intervention. The predictive accuracy and evaluative strength of the static imaging techniques may be augmented by improved computer technologies and graphical rendering techniques that can transform these static images into dynamic representations of subject specific organ deformations. By creating physically based 3D simulation and visualization, 3D deformable models obtained from subject-specific lung images will better represent lung structure and function. Variations in overall lung deformations may indicate tissue pathologies, thus 3D visualization of functioning lungs may also provide a visual tool to current diagnostic methods. The feasibility of medical visualization using static 3D lungs as an effective tool for endotracheal intubation was previously shown using Augmented Reality (AR) based techniques in one of the several research efforts at the Optical Diagnostics and Applications Laboratory (ODALAB). This research effort also shed light on the potential usage of coupling such medical visualization with dynamic 3D lungs. The purpose of this dissertation is to develop 3D deformable lung models, which are developed from subject-specific high resolution CT data and can be visualized using the AR based environment. A review of the literature illustrates that the techniques for modeling real-time 3D lung dynamics can be roughly grouped into two categories: Geometrically-based and Physically-based. Additional classifications would include considering a 3D lung model as either a volumetric or surface model, modeling the lungs as either a single-compartment or a multi-compartment, modeling either the air-blood interaction or the air-blood-tissue interaction, and considering either a normal or pathophysical behavior of lungs. Validating the simulated lung dynamics is a complex problem and has been previously approached by tracking a set of landmarks on the CT images. An area that needs to be explored is the relationship between the choice of the deformation method for the 3D lung dynamics and its visualization framework. Constraints on the choice of the deformation method and the 3D model resolution arise from the visualization framework. Such constraints of our interest are the real-time requirement and the level of interaction required with the 3D lung models. The work presented here discusses a framework that facilitates a physics-based and physiology-based deformation of a single-compartment surface lung model that maintains the frame-rate requirements of the visualization system. The framework presented here is part of several research efforts at ODALab for developing an AR based medical visualization framework. The framework consists of 3 components, (i) modeling the Pressure-Volume (PV) relation, (ii) modeling the lung deformation using a Green's function based deformation operator, and (iii) optimizing the deformation using state-of-art Graphics Processing Units (GPU). The validation of the results obtained in the first two modeling steps is also discussed for normal human subjects. Disease states such as Pneumothorax and lung tumors are modeled using the proposed deformation method. Additionally, a method to synchronize the instantiations of the deformation across a network is also discussed.
605

Towards Scalable Nanomanufacturing: Modeling The Interaction Of Charged Droplets From Electrospray Using Gpu

Yang, Weiwei 01 January 2012 (has links)
Electrospray is an atomization method subject to intense study recently due to its monodispersity and the wide size range of droplets it can produce, from nanometers to hundreds of micrometers. This thesis focuses on the numerical and theoretical modeling of the interaction of charged droplets from the single and multiplexed electrospray. We studied two typical scenarios: large area film depositions using multiplexed electrospray and fine pattern printings assisted by linear electrostatic quadrupole focusing. Due to the high computation power requirement in the unsteady n-body problem, graphical processing unit (GPU) which delivers 10 Tera flops in computation power is used to dramatically speed up the numerical simulation both efficiently and with low cost. For large area film deposition, both the spray profile and deposition number density are studied for different arrangements of electrospray and electrodes. Multiplexed electrospray with hexagonal nozzle configuration can not give us uniform deposition though it has the highest packing density. Uniform film deposition with variation < 5% in thickness was observed with the linear nozzle configuration combined with relative motion between ES source and deposition substrate. For fine pattern printing, linear quadrupole is used to focus the droplets in the radial direction while maintaining a constant driving field at the axial direction. Simulation shows that the linear quadrupole can focus the droplets to a resolution of a few nanometers quickly when the interdroplet separation is larger than a certain value. Resolution began to deteriorate drastically when the inter-droplet separation is smaller than that value. This study will shed light on using electrospray as a scalable nanomanufacturing approach.
606

GPU-Accelerated Point-Based Color Bleeding

Schmitt, Ryan Daniel 01 June 2012 (has links) (PDF)
Traditional global illumination lighting techniques like Radiosity and Monte Carlo sampling are computationally expensive. This has prompted the development of the Point-Based Color Bleeding (PBCB) algorithm by Pixar in order to approximate complex indirect illumination while meeting the demands of movie production; namely, reduced memory usage, surface shading independent run time, and faster renders than the aforementioned lighting techniques. The PBCB algorithm works by discretizing a scene’s directly illuminated geometry into a point cloud (surfel) representation. When computing the indirect illumination at a point, the surfels are rasterized onto cube faces surrounding that point, and the constituent pixels are combined into the final, approximate, indirect lighting value. In this thesis we present a performance enhancement to the Point-Based Color Bleeding algorithm through hardware acceleration; our contribution incorporates GPU-accelerated rasterization into the cube-face raster phase. The goal is to leverage the powerful rasterization capabilities of modern graphics processors in order to speed up the PBCB algorithm over standard software rasterization. Additionally, we contribute a preprocess that generates triangular surfels that are suited for fast rasterization by the GPU, and show that new heterogeneous architecture chips (e.g. Sandy Bridge from Intel) simplify the code required to leverage the power of the GPU. Our algorithm reproduces the output of the traditional Monte Carlo technique with a speedup of 41.65x, and additionally achieves a 3.12x speedup over software-rasterized PBCB.
607

PARIS: A PArallel RSA-Prime InSpection Tool

White, Joseph R. 01 June 2013 (has links) (PDF)
Modern-day computer security relies heavily on cryptography as a means to protect the data that we have become increasingly reliant on. As the Internet becomes more ubiquitous, methods of security must be better than ever. Validation tools can be leveraged to help increase our confidence and accountability for methods we employ to secure our systems. Security validation, however, can be difficult and time-consuming. As our computational ability increases, calculations that were once considered “hard” due to length of computation, can now be done in minutes. We are constantly increasing the size of our keys and attempting to make computations harder to protect our information. This increase in “cracking” difficulty often has the unfortunate side-effect of making validation equally as difficult. We can leverage massive-parallelism and the computational power that is granted by today’s commodity hardware such as GPUs to make checks that would otherwise be impossible to perform, attainable. Our work presents a practical tool for validating RSA keys for poor prime numbers: a fundamental problem that has led to significant security holes, despite the RSA algorithm’s mathematical soundness. Our tool, PARIS, leverages NVIDIA’s CUDA framework to perform a complete set of greatest common divisor calculations between all keys in a provided set. Our implementation offers a 27.5 times speedup using a GTX 480 and 33.9 times speedup using a Tesla K20Xm: both compared to a reference sequential implementation for sets of less than 200000 keys. This level of speedup brings this validation into the realm of practicality due to decreased runtime.
608

Out-of-Core GPU Path Tracing on Large Instanced Scenes via Geometry Streaming

Berchtold, Jeremy 01 June 2022 (has links) (PDF)
We present a technique for out-of-core GPU path tracing of arbitrarily large scenes that is compatible with hardware-accelerated ray-tracing. Our technique improves upon previous works by subdividing the scene spatially into streamable chunks that are loaded using a priority system that maximizes ray throughput and minimizes GPU memory usage. This allows for arbitrarily large scaling of scene complexity. Our system required under 19 minutes to render a solid color version of Disney's Moana Island scene (39.3 million instances, 261.1 million unique quads, and 82.4 billion instanced quads at a resolution of 1024x429 and 1024spp on an RTX 5000 (24GB memory total, 22GB used, 13GB geometry cache, with the remainder for temporary buffers and storage) (Wald et al.). As a scalability test, our system rendered 26 Moana Island scenes without multi-level instancing (1.02 billion instances, 2.14 trillion instanced quads, ~230GB if all resident) in under 1h:28m. Compared to state-of-the-art hardware-accelerated renders of the Moana Island scene, our system can render larger scenes on a single GPU. Our system is faster than the previous out-of-core approach and is able to render larger scenes than previous in-core approaches given the same memory constraints (Hellmuth, Zellman et al, Wald).
609

Implementing Streaming Parallel Decision Trees on Graphic Processing Units / En implementering av Streaming Parallel Decision Trees på grafikkort

Svantesson, David January 2018 (has links)
Decision trees have long been a prevalent area within machine learning. With streaming data environments as well as large datasets becoming increasingly common, researchers have developed decision tree algorithms adapted to streaming data. One such algorithm is SPDT, which approaches the streaming data problem by making use of workers on a network combined with a dynamic histogram approximation of the data. There exist several implementations for decision trees on GPU, but those are uncommon in a streaming data setting. In this research, conducted at RISE SICS, the possibilities of accelerating the SPDT algorithm on GPU is investigated. An implementation is successfully created using the CUDA platform. The implementation uses a set number of data samples per layer to better fit the GPU platform. Experiments were conducted to investigate the impact on both accuracy and speed. It is found that the GPU implementation performs as well as the CPU implementation in terms of accuracy, suggesting that using small subsets of the data in each layer is sufficient for making accurate split decisions. The GPU implementation is found to be up to 113 times faster than the reference Scala CPU implementation for one of the tested datasets, and 13 times faster on average over all the tested datasets. Weak parts of the implementation are identified, and further improvements are suggested to increase both accuracy and runtime performance. / Beslutsträd har länge varit ett betydande område inom maskininlärning. Strömmandedata och stora dataset har blivit allt vanligare, vilket har lett till att forskare utvecklat algoritmer för beslutsträd anpassade till dessa miljöer. En av dessa algoritmer är SPDT. Denna algoritm använder sig av flera arbetare i ett nätverk kombinerat med en dynamisk histogram-representation av data. Det existerar flera implementationer av beslutsträd på grafikkort, men inte många för strömmande data. I detta forskningsarbete, utfört på RISE SICS, undersöks möjligheten att snabba upp SPDT genom att accelerera beräkningar med hjälp av grafikkort. En lyckad implementation skriven i CUDA beskrivs. Implementationen anpassar sig till grafikkortsplattformen genom att använda sig utav ett bestämt antal datapunkter per lager. Experiment som undersöker effekten på noggrannhet och hastighet har genomförts. Resultaten visar att GPU-implementationen presterar lika väl som CPU-implementationen vad gäller noggrannhet, vilket påvisar att användandet av en mindre del av data i varje lager är tillräckligt för goda resultat. GPU-implementationen är upp till 113 gånger snabbare jämfört med en existerande CPU-implementation skriven i Scala, och är i medel 13 gånger snabbare. Svagheter i implementationen identifieras, och vidare förbättringar till implementationen föreslås för att förbättra både noggrannhet och hastighetsprestanda.
610

<b>Accelerating Physical design Algorithms using CUDA</b>

Abhinav Agarwal (17623890) 13 December 2023 (has links)
<p dir="ltr">The intricate domain of chip design encompasses the creation of intricate blueprints for integrated circuits (ICs). Algorithms, pivotal in this realm, assume the role of optimizing IC performance and functionality. This thesis delves into the utilization of algorithms within chip design, spotlighting their potential to amplify design process efficiency and efficacy. Notably, this study undertakes a comprehensive comparison of algorithmic performances on both Central Processing Units (CPUs) and Graphics Processing Units (GPUs). A cornerstone application of algorithms in chip design lies in logic synthesis, which transmutes a high-level circuit description into a silicon-compatible, low-level representation. By minimizing gate requisites, curtailing power consumption, and bolstering performance, algorithms serve as architects of optimized logic synthesis. Furthermore, the arena of physical design harnesses algorithms to translate logical designs into physically realizable layouts on silicon wafers. This involves meticulous considerations like routing congestion and power efficiency. Furthermore, this thesis adopts a thorough approach by extensively exploring the implementation intricacies of two pivotal physical design algorithms. The Kernighan-Lin Partitioning Algorithm is prominently featured for optimizing Placement and Partitioning, while Lee’s Algorithm provides valuable insights for enhancing Routing. Through a meticulous comparison of dataset efficiency and run-time across both hardware platforms, noteworthy insights have emerged. In KL Algorithm, datasets categorized as small (with sizes < 105), the CPU demonstrates a 1.2X faster processing speed compared to the GPU. However, as dataset sizes surpass this threshold, a distinct trend emerges: while GPU run times remain relatively consistent, CPU run times undergo a threefold increase at select points. In the case of Lee’s Algorithm, the CPU demonstrated superior execution time despite having fewer cores and threads than the GPU. This can be attributed to the inherently sequential nature of Lee’s Algorithm, where each element depends on the preceding one, aligning with the CPU's strength in handling sequential tasks. This thesis embarks on a comprehensive analytical journey, delving into the nuanced interplay between these contrasting aspects.</p>

Page generated in 0.0275 seconds