• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 29
  • 8
  • 4
  • 4
  • 3
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 73
  • 73
  • 73
  • 44
  • 20
  • 19
  • 15
  • 12
  • 12
  • 12
  • 11
  • 11
  • 8
  • 7
  • 7
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
21

Efficient Dynamic Automatic Memory Management And Concurrent Kernel Execution For General-Purpose Programs On Graphics Processing Units

Pai, Sreepathi 11 1900 (has links) (PDF)
Modern supercomputers now use accelerators to achieve their performance with the most widely used accelerator being the Graphics Processing Unit (GPU). However, achieving the performance potential of systems that combine a GPU and CPU is an arduous task which could be made easier with the assistance of the compiler or runtime. In particular, exploiting two features of GPU architectures -- distributed memory and concurrent kernel execution -- is critical to achieve good performance, but in current GPU programming systems, programmers must exploit them manually. This can lead to poor performance. In this thesis, we propose automatic techniques that: i) perform data transfers between the CPU and GPU, ii) allocate resources for concurrent kernels, and iii) schedule concurrent kernels efficiently without programmer intervention. <p>Most GPU programs access data in GPU memory for performance. Manually inserting data transfers that move data to and from this GPU memory is an error-prone and tedious task. In this work, we develop a software coherence mechanism to fully automate all data transfers between the CPU and GPU without any assistance from the programmer. Our mechanism uses compiler analysis to identify potential stale data accesses and uses a runtime to initiate transfers as necessary. This avoids redundant transfers that are exhibited by all other existing automatic memory management proposals for general purpose programs. We integrate our automatic memory manager into the X10 compiler and runtime, and find that it not only results in smaller and simpler programs, but also eliminates redundant memory transfers. Tested on eight programs ported from the Rodinia benchmark suite it achieves (i) a 1.06x speedup over hand-tuned manual memory management, and (ii) a 1.29x speedup over another recently proposed compiler--runtime automatic memory management system. Compared to other existing runtime-only (ADSM) and compiler-only (OpenMPC) proposals, it also transfers 2.2x to 13.3x less data on average. <p>Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite. Current GPUs therefore allow concurrent execution of kernels to improve utilization. We study concurrent execution of GPU kernels using multiprogrammed workloads on current NVIDIA Fermi GPUs. On two-program workloads from Parboil2 we find concurrent execution is often no better than serialized execution. We identify lack of control over resource allocation to kernels as a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware runtime concurrency policies that offer significantly better performance and concurrency than the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads over the current CUDA concurrency implementation. <p>Recent NVIDIA GPUs use a FIFO policy in their thread block scheduler (TBS) to schedule thread blocks of concurrent kernels. We show that FIFO leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the Shortest Remaining Time First (SRTF) policy instead. Since SRTF requires an estimate of runtime (i.e. execution time), we introduce Structural Runtime Prediction that uses the grid structure of GPU programs for predicting runtimes. Using a novel Staircase model of GPU kernel execution, we show that kernel runtime can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench and find that predictions made after the execution of single thread block are between 0.48x to 1.08x of actual runtime. %Next, we design a thread block scheduler that is both concurrent kernel-aware and incorporates this predictor. We implement the SRTF policy for concurrent kernels that uses this predictor and evaluate it on two-program workloads from ERCBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. Compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by 2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of SRTF achieves STP to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF.
22

Multi-level Parallelism with MPI and OpenACC for CFD Applications

McCall, Andrew James 14 June 2017 (has links)
High-level parallel programming approaches, such as OpenACC, have recently become popular in complex fluid dynamics research since they are cross-platform and easy to implement. OpenACC is a directive-based programming model that, unlike low-level programming models, abstracts the details of implementation on the GPU. Although OpenACC generally limits the performance of the GPU, this model significantly reduces the work required to port an existing code to any accelerator platform, including GPUs. The purpose of this research is twofold: to investigate the effectiveness of OpenACC in developing a portable and maintainable GPU-accelerated code, and to determine the capability of OpenACC to accelerate large, complex programs on the GPU. In both of these studies, the OpenACC implementation is optimized and extended to a multi-GPU implementation while maintaining a unified code base. OpenACC is shown as a viable option for GPU computing with CFD problems. In the first study, a CFD code that solves incompressible cavity flows is accelerated using OpenACC. Overlapping communication with computation improves performance for the multi-GPU implementation by up to 21%, achieving up to 400 times faster performance than a single CPU and 99% weak scalability efficiency with 32 GPUs. The second study ports the execution of a more complex CFD research code to the GPU using OpenACC. Challenges using OpenACC with modern Fortran are discussed. Three test cases are used to evaluate performance and scalability. The multi-GPU performance using 27 GPUs is up to 100 times faster than a single CPU and maintains a weak scalability efficiency of 95%. / Master of Science
23

Comparative Study of CPU and GPGPU Implementations of the Sievesof Eratosthenes, Sundaram and Atkin

Månsson, Jakob January 2021 (has links)
Background. Prime numbers are integers divisible only on 1 and themselves, and one of the oldest methods of finding them is through a process known as sieving. A prime number sieving algorithm produces every prime number in a span, usually from the number 2 up to a given number n. In this thesis, we will cover the three sieves of Eratosthenes, Sundaram, and Atkin. Objectives. We shall compare their sequential CPU implementations to their parallel GPGPU (General Purpose Graphics Processing Unit) counterparts on the matter of performance, accuracy, and suitability. GPGPU is a method in which one utilizes hardware indented for graphics rendering to achieve a high degree of parallelism. Our goal is to establish if GPGPU sieving can be more effective than the sequential way, which is currently commonplace.   Method. We utilize the C++ and CUDA programming languages to implement the algorithms, and then extract data regarding their execution time and accuracy. Experiments are set up and run at several sieving limits, with the upper bound set by the memory capacity of available GPU hardware. Furthermore, we study each sieve to identify what characteristics make them fit or unfit for a GPGPU approach. Results. Our results show that the sieve of Eratosthenes is slow and ill-suited for GPGPU computing, that the sieve of Sundaram is both efficient and fit for parallelization, and that the sieve of Atkin is the fastest but suffers from imperfect accuracy.   Conclusions. Finally, we address how the lesser concurrent memory capacity available for GPGPU limits the ranges that can be sieved, as compared to CPU. Utilizing the beneficial characteristics of the sieve of Sundaram, we propose a batch-divided implementation that would allow the GPGPU sieve to cover an equal range of numbers as any of the CPU variants.
24

Toward Reliable, Secure, and Energy-Efficient Multi-Core System Design

Basu, Prabal 01 August 2019 (has links)
Computer hardware researchers have perennially focussed on improving the performance of computers while stipulating the energy consumption under a strict budget. While several innovations over the years have led to high performance and energy efficient computers, more challenges have also emerged as a fallout. For example, smaller transistor devices in modern multi-core systems are afflicted with several reliability and security concerns, which were inconceivable even a decade ago. Tackling these bottlenecks happens to negatively impact the power and performance of the computers. This dissertation explores novel techniques to gracefully solve some of the pressing challenges of the modern computer design. Specifically, the proposed techniques improve the reliability of on-chip communication fabric under a high power supply noise, increase the energy-efficiency of low-power graphics processing units, and demonstrate an unprecedented security loophole of the low-power computing paradigm through rigorous hardware-based experiments.
25

Generalizing the Utility of Graphics Processing Units in Large-Scale Heterogeneous Computing Systems

Xiao, Shucai 03 July 2013 (has links)
Today, heterogeneous computing systems are widely used to meet the increasing demand for high-performance computing. These systems commonly use powerful and energy-efficient accelerators to augment general-purpose processors (i.e., CPUs). The graphic processing unit (GPU) is one such accelerator. Originally designed solely for graphics processing, GPUs have evolved into programmable processors that can deliver massive parallel processing power for general-purpose applications. Using SIMD (Single Instruction Multiple Data) based components as building units; the current GPU architecture is well suited for data-parallel applications where the execution of each task is independent. With the delivery of programming models such as Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), programming GPUs has become much easier than before. However, developing and optimizing an application on a GPU is still a challenging task, even for well-trained computing experts. Such programming tasks will be even more challenging in large-scale heterogeneous systems, particularly in the context of utility computing, where GPU resources are used as a service. These challenges are largely due to the limitations in the current programming models: (1) there are no intra-and inter-GPU cooperative mechanisms that are natively supported; (2) current programming models only support the utilization of GPUs installed locally; and (3) to use GPUs on another node, application programs need to explicitly call application programming interface (API) functions for data communication. To reduce the mapping efforts and to better utilize the GPU resources, we investigate generalizing the utility of GPUs in large-scale heterogeneous systems with GPUs as accelerators. We generalize the utility of GPUs through the transparent virtualization of GPUs, which can enable applications to view all GPUs in the system as if they were installed locally. As a result, all GPUs in the system can be used as local GPUs. Moreover, GPU virtualization is a key capability to support the notion of "GPU as a service." Specifically, we propose the virtual OpenCL (or VOCL) framework for the transparent virtualization of GPUs. To achieve good performance, we optimize and extend the framework in three aspects: (1) optimize VOCL by reducing the data transfer overhead between the local node and remote node; (2) propose GPU synchronization to reduce the overhead of switching back and forth if multiple kernel launches are needed for data communication across different compute units on a GPU; and (3) extend VOCL to support live virtual GPU migration for quick system maintenance and load rebalancing across GPUs. With the above optimizations and extensions, we thoroughly evaluate VOCL along three dimensions: (1) show the performance improvement for each of our optimization strategies; (2) evaluate the overhead of using remote GPUs via several microbenchmark suites as well as a few real-world applications; and (3) demonstrate the overhead as well as the benefit of live virtual GPU migration. Our experimental results indicate that VOCL can generalize the utility of GPUs in large-scale systems at a reasonable virtualization and migration cost. / Ph. D.
26

Performance Modeling, Optimization, and Characterization on Heterogeneous Architectures

Panwar, Lokendra Singh 21 October 2014 (has links)
Today, heterogeneous computing has truly reshaped the way scientists think and approach high-performance computing (HPC). Hardware accelerators such as general-purpose graphics processing units (GPUs) and Intel Many Integrated Core (MIC) architecture continue to make in-roads in accelerating large-scale scientific applications. These advancements, however, introduce new sets of challenges to the scientific community such as: selection of best processor for an application, effective performance optimization strategies, maintaining performance portability across architectures etc. In this thesis, we present our techniques and approach to address some of these significant issues. Firstly, we present a fully automated approach to project the relative performance of an OpenCL program over different GPUs. Performance projections can be made within a small amount of time, and the projection overhead stays relatively constant with the input data size. As a result, the technique can help runtime tools make dynamic decisions about which GPU would run faster for a given kernel. Usage cases of this technique include scheduling or migrating GPU workloads over a heterogeneous cluster with different types of GPUs. We then present our approach to accelerate a seismology modeling application that is based on the finite difference method (FDM), using MPI and CUDA over a hybrid CPU+GPU cluster. We describe the generic computational complexities involved in porting such applications to the GPUs and present our strategy of efficient performance optimization and characterization. We also show how performance modeling can be used to reason and drive the hardware-specific optimizations on the GPU. The performance evaluation of our approach delivers a maximum speedup of 23-fold with a single GPU and 33-fold with dual GPUs per node over the serial version of the application, which in turn results in a many-fold speedup when coupled with the MPI distribution of the computation across the cluster. We also study the efficacy of GPU-integrated MPI, with MPI-ACC as an example implementation, in a seismology modeling application and discuss the lessons learned. / Master of Science
27

Coping Uncertainty in Wireless Network Optimization

Li, Shaoran 24 October 2022 (has links)
Network optimization plays an important role in 5G/next-G networks, which requires knowledge of network parameters (e.g., channel state information). The majority of existing works assume that all network parameters are either given a prior or can be accurately estimated. However, in many practical scenarios, some parameters are uncertain at the time of allocating resources and can only be modeled by random variables. Further, we only have limited knowledge of those uncertain parameters. For instance, channel gains are not exactly known due to channel estimation errors, network delay, limited feedback, and a lack of cooperation (between networks). Therefore, a practical solution to network optimization must address such uncertainty inside wireless networks. There are three approaches to address such a network uncertainty: stochastic programming, worst-case optimization, and chance-constrained programming (CCP). Among the three, CCP has some unique benefits compared to the other two approaches. Stochastic programming explicitly requires full distribution knowledge, which is usually unavailable in practice. In comparison, CCP can work with various settings of available knowledge such as first and second order statistics, symmetric properties, or limited data samples. Therefore, CCP is more flexible to handle different network settings, which is important to address problems in 5G/next-G networks. Further, worst-case optimization assumes upper or lower bounds (i.e., worst cases) for the uncertain parameters and it is known to be conservative due to its focus on extreme cases. In contrast, CCP allows occasional and controllable violations for some constraints and thus offers much better performance in resource utilization compared to worst-case optimization. The only drawback of CCP is that it may lead to intractability due to its probabilistic formulation and limited knowledge of the underlying random variables. To date, CCP has not been well utilized in the wireless communication and networking community. The goal of this dissertation is to extend the state-of-the-art of CCP techniques and address a number of challenging network optimization problems. This dissertation is correspondingly organized into two parts. In the first part, we assume the uncertain parameters are only known by their mean and covariance (without distribution knowledge). We assume these statistics are rather stationary (i.e., time-invariant for a sufficiently long time) and thus can be accurately estimated. In this setting, we introduce a novel reformulation technique based on the mean and covariance to derive a solution. In the second part, we assume these statistics are time-varying and thus cannot be accurately estimated.In this setting, we employ limited data samples that are collected in a small time window and use them to derive a solution. For the first part, we investigate four research problems based on the mean and covariance of the uncertain parameters: - In the first problem, we study how to maximize spectrum efficiency in underlay coexistence.The interference from all secondary users to each primary user must be kept below a given threshold. However, there is much uncertainty about the channel gains between the primary users and the second users due to a lack of cooperation between them. We formulate probabilistic interference constraints using CCP for the primary users. For tractability, we introduce a novel and powerful reformulation technique called Exact Conic Reformulation (ECR). With limited knowledge of mean and covariance, ECR offers an equivalent reformulation for the intractable chance constraints with tractable deterministic constraints without relaxation errors. After reformulation, we employ linearization techniques to the mixed-integer non-linear problem to reduce the computation complexity. We show that our proposed approach can achieve near-optimal performance and stands as a performance benchmark for the underlay coexistence problem. - To find a solution for the same underlay coexistence problem that can be used in the real world, we need to find a solution in "real-time". The real-time requirement here refers to finding a solution in 125 us (the minimum time slot for small cells in 5G). Our proposed solution has three steps. First, it employs ECR to reformulate the original CCP into a deterministic optimization problem. Then it decomposes the problem and narrows down the search space into a smaller but promising one. By random sampling inside the promising search space and through local search, our proposed solution can meet the 125 us requirement in 5G while achieving 90% optimality on average. - We further apply CCP, predicated on the reformulation technique ECR, to two other problems. * We study the problem of power control in concurrent transmissions. Our objective is to maximize energy efficiency for all transmitter-receiver pairs with capacity requirements. This problem is challenging due to mutual interference among different transmitter-receiver pairs and the uncertain channel gain between any transmitter and receiver. We formulate a CCP and reformulate it into a deterministic problem using ECR. Then we employ Geometric Programming (GP) with a tight approximation to derive a near-optimal solution. * We study task offloading in Mobile Edge Computing (MEC) where the number of processing cycles of a task is unknown until completion. The goal is to minimize the energy consumption of the users while meeting probabilistic deadlines for the tasks. We formulate the probabilistic deadlines into chance constraints and then use ECR to reformulate them into deterministic constraints. We propose a solution that consists of periodic scheduling and schedule updates to choose the offloaded tasks and task-to-processor assignments at the base station. In the second part, we investigate two research problems based on limited data samples of the uncertain parameters: - We study MU-MIMO beamforming based on Channel State Information (CSI). The goal is to derive a beamforming solution---minimizing power consumption at the BS while meeting the probabilistic data rate requirements of the users---by using very limited CSI data samples. For our CCP formulation, we explore the idea of Wasserstein ambiguity set to quantify the distance between the true (but unknown) distribution and the empirical distribution based on the limited data samples. Our proposed solution---Data-Driven Beamforming (D^2BF)---reformulates the CCP into a non-convex deterministic optimization problem based on the properties of Wasserstein ambiguity set. Then D^2BF employs a novel convex approximation to the non-convex deterministic problem, which can be directly solved by commercial solvers. - For a solution to the MU-MIMO beamforming to be useful in the real world, it must meet the "real-time" requirement. Here, the real-time requirement refers to 1 ms, which is one transmission time interval (TTI) under 5G numerology 0. We present ReDBeam---a Real-time Data-driven Beamforming solution for the MU-MIMO beamforming problem (minimizing power consumption while offering probabilistic data rate guarantees to the users) with limited CSI data samples. RedBeam is a parallel algorithm and is purposefully designed to take advantage of the vast parallel processing capability offered by GPU. ReDBeam generates a large number of initial solutions from a promising search space and then refines each solution by a local search. We show that ReDBeam meets the 1 ms real-time requirement on a commercial GPU and is orders of magnitude faster than other state-of-the-art algorithms for the same problem. / Doctor of Philosophy / Network optimization plays an important role in 5G/next-G networks. In a wireless network optimization problem, we typically want to maximize or minimize an objective function under a set of performance or resource constraints. Knowledge of network parameters is typically required in these problems. The majority of existing works assume that all network parameters are either given a prior or can be accurately estimated. However, in many practical scenarios, some parameters are uncertain in nature and cannot be accurately estimated beforehand. This dissertation addresses uncertainty in wireless network optimizations using chance-constrained programming (CCP). CCP can work with limited knowledge of uncertain parameters such as statistics or data samples, instead of full distribution information. In a CCP formulation, violations of certain target performance or requirement thresholds are expressed as probabilistic constraints and the frequency of such violations is bounded through a risk parameter. By changing this risk level, CCP offers a unique trade-off between the guaranteed threshold violation probabilities and the achieved objective value. The only drawback of CCP is that it may lead to intractability due to its probabilistic formulation and limited knowledge of the underlying random variables. The goal of this dissertation is to extend the state-of-the-art of CCP techniques to address a number of challenging network optimization problems. This dissertation is organized into two parts. In the first part, the mean and covariance of the uncertain parameters are assumed to be stationary and thus can be accurately estimated. Our main contribution is a novel reformulation technique for CCP called Exact Conic Reformulation (ECR). Based on knowledge of mean and covariance, ECR is able to offer an equivalent reformulation for the intractable chance constraints with tractable deterministic constraints without relaxation errors. We apply CCP, predicated on ECR, to address three problems: (i) scheduling and power control in underlay coexistence; (ii) power control in concurrent transmissions, and (iii) task offloading in Mobile Edge Computing (MEC). For the first problem, we further address the "real-time" requirement in a solution and propose a solution that can meet the stringent timing requirement. In the second part, when the uncertain parameters are non-stationary and their statistics cannot be accurately estimated, we propose to employ limited data samples that are collected over a small window and use them to develop a solution. To demonstrate the efficacy of this approach, we investigate the MU-MIMO beamforming problem that minimizes the power consumption of the base station while providing probabilistic guarantees to users' data rates. We further address the timing requirement for such a solution in practice, and present a real-time data-driven beamforming solution for MU-MIMO.
28

Otimização de multidões em jogos digitais utilizando CUDA

Bardella, Tiago Ungaro 19 October 2015 (has links)
Made available in DSpace on 2016-03-15T19:38:03Z (GMT). No. of bitstreams: 1 TIAGO UNGARO BARDELLA.pdf: 2553991 bytes, checksum: f8e6ba33f7c930ee81f6b64116f495ff (MD5) Previous issue date: 2015-10-19 / The history of digital games shows, since the beginning, games which uses many types of enemy models to confront and many types of characters to control, like Real-Time Strategy games, for example. These huge amount of models into an important scene are called crowds. The crowds needs a high computer performance and specific algorithms in their interaction control to avoid immersion loss into a game by problems which may happen if the crowds are not treated accordingly. With the popularization of graphic board languages like NVIDIA CUDA, new algorithms were created to easily increase the performance of crowds in digital games and their overwhelming superiority compared to the methods used in linear programming were proved in many researches. The goal of this work is to use these GPU techniques as base to implement a new API using CUDA language that will present better performance and simplicity compared to the others algorithms on the area of crowds in digital games. After the project conclusion, the created API turned easier the crowd treatment to digital game developers using Unity3D integrated with API TBX, that now only need to include a DLL in the project instead creating na algorithm for crowd treatment from the beginning, which takes a huge amount of time from development. / O histórico dos jogos digitais apresenta, desde seu princípio, jogos que utilizam diversos modelos de inimigos para enfrentar ou diversos modelos de personagens para controlar, como os jogos Real-Time Strategy por exemplo. Essas grandes quantidades de modelos que compõem uma cena importante são chamadas de multidões. As multidões necessitam de um alto poder computacional e algoritmos específicos para seu tratamento para evitar a perda de imersão dentro de um jogo pelos problemas que podem acontecer caso as multidões não sejam tratadas adequadamente. Com o surgimento de linguagens de placas gráficas como a NVIDIA CUDA, novos algoritmos foram criados para melhor trabalhar com o desempenho de multidões em jogos digitais e sua superioridade em comparação com os métodos utilizados em programação sequencial foi comprovada em diversos estudos. O objetivo deste trabalho é se basear nestas técnicas de GPU para implementar uma nova API usando tecnologia CUDA que visa melhorar os algoritmos existentes para tratamento de multidões em jogos digitais em termos de desempenho e simplicidade de implementação. Com a conclusão do projeto, a API criada facilitou o tratamento de multidões para desenvolvedores de jogos digitais com a game engine Unity3D integrada com a API TBX de simulação de multidões, que agora apenas precisam incluir uma DLL em seu projeto ao invés de criar um algoritmo próprio de tratamento de multidões do início, o que demanda tempo de desenvolvimento.
29

Ray Tracing Bézier Surfaces on GPU

Löw, Joakim January 2006 (has links)
<p>In this report, we show how to implement direct ray tracing of B´ezier surfaces on graphics processing units (GPUs), in particular bicubic rectangular Bézier surfaces and nonparametric cubic Bézier triangles. We use Newton’s method for the rectangular case and show how to use this method to find the ray-surface intersection. For Newton’s method to work we must build a spatial partitioning hierarchy around each surface patch, and in general, hierarchies are essential to speed up the process of ray tracing. We have chosen to use bounding box hierarchies and show how to implement stackless traversal of such a structure on a GPU. For the nonparametric triangular case, we show how to find the wanted intersection by simply solving a cubic polynomial. Because of the limited precision of current GPUs, we also propose a numerical approach to solve the problem, using a one-dimensional Newton search.</p>
30

Ray Tracing Bézier Surfaces on GPU

Löw, Joakim January 2006 (has links)
In this report, we show how to implement direct ray tracing of B´ezier surfaces on graphics processing units (GPUs), in particular bicubic rectangular Bézier surfaces and nonparametric cubic Bézier triangles. We use Newton’s method for the rectangular case and show how to use this method to find the ray-surface intersection. For Newton’s method to work we must build a spatial partitioning hierarchy around each surface patch, and in general, hierarchies are essential to speed up the process of ray tracing. We have chosen to use bounding box hierarchies and show how to implement stackless traversal of such a structure on a GPU. For the nonparametric triangular case, we show how to find the wanted intersection by simply solving a cubic polynomial. Because of the limited precision of current GPUs, we also propose a numerical approach to solve the problem, using a one-dimensional Newton search.

Page generated in 0.5011 seconds