Global ETD Search

21	Um ambiente para exploração de paralelismo na programação em lógica / A environment to explotation of parallelism in the logic programming Yamin, Adenauer Correa January 1994 (has links) Este trabalho e dedicado ao estudo da exploração de paralelismo na Programação em Lógica. O aspecto declarativo das linguagens de Programação em Lógica permite uma exploração eficiente do paralelismo implícito no código, de forma mais simples que as linguagens imperativas. Ao mesmo tempo, o paralelismo tem-se mostrado uma forte opção para procura de aumentos significativos do desempenho dos computadores. Como conseqüência, nos últimos anos, diversas maquinas paralelas tem surgido no mercado. No entanto, a sua efetiva utilização ainda ressente-se de uma dificuldade de programação maior que a das maquinas sequênciais. Por outro lado, o alto nível das linguagens de Programação em Lógica permite o desenvolvimento de programas de forma mais rápida e concisa do que as linguagens tradicionais (imperativas). Porem, apesar dos importantes progressos nas técnicas de compilação destas linguagens, elas permanecem menos eficientes que as linguagens imperativas. 0 aumento na eficiência de execução da Programação em Lógica, com o use do paralelismo, certamente estenderá o seu emprego. Em função disto, a unido da Programação em Lógica e maquinas paralelas tem sido proposta como uma alternativa para facilitar a programação das maquinas paralelas, bem como para aumentar o desempenho na Programação em Lógica. O ponto central do trabalho e a concepção de um modelo para exploração do paralelismo E Restrito na execução de Prolog, voltado para arquiteturas multiprocessadoras sem memória comum. Como ponto de partida foi utilizado o modelo já definido para exploração do paralelismo OU do projeto OPERA, do Instituto de Informática da UFRGS, de maneira que o modelo de paralelismo E proposto possa vir a compor, com aquele, uma plataforma que integre a exploração simultânea dos paralelismos E e OU. O modelo concebido compreende uma proposta de compilação e um ambiente de execução. A detecção e o controle do paralelismo é iniciado na compilação. Nesta fase, a gerada uma Expressão Condicional de Execução para cada clausula do programa Prolog, cuja avaliação em tempo de processamento determina a execução, em paralelo ou não, dos literais que compõem a clausula. A Maquina Abstrata Prolog, projetada para o emulador paralelo, é baseada na WAM (Warren Abstract Machine), uma das mais eficientes e difundidas técnicas para compilação Prolog. Isto, dentre outros aspectos, confere uma boa portabilidade ao modelo. O ambiente de execução compreende a concepção de uma arquitetura de processos formada por trabalhadores OPERA, uma filosofia de escalonamento de serviço entre estes trabalhadores, uma política para gerencia de sua memória e uma estratégia para as comunicações. Para validar o modelo proposto para exploração do paralelismo E, o mesmo foi implementado em rede local de estações Unix, obtendo bons resultados. / This work is devoted to the study of the exploration of parallelism in Logic Programming. The declarative aspect of the Logic Programming languages allows an efficient exploration of the implicit parallelism in the code, in a simpler form than the imperative languages. At the same time, parallelism has been shown as a strong option to the search for significant increases in the performance of the computers. As a consequence, in the last years, several parallel machines have been sprung up into the market. Nevertheless, their effective usefulness still undergoes some difficulties in programming which are greater than those of the sequential machines. On the other hand, the high level of Logic Programming languages allows programs development to be faster and concise than in the traditional languages (imperatives). However, despite the important progress in compiling techniques for these languages, they remain less efficient than the imperatives languages. The increase in execution efficiency of logic programs, with the use of parallelism, will probabily extend their use. Having this in mind, the union of the Logic Programming and parallel machines has been proposed as an alternative to make programming of the parallel machines easier, as well as to increase the performance of Logic Programming. The central aspect of the work is the conception of a model to explore the Restricted AND Parallelism in the execution of Prolog, turned to multiprocessing architectures without a common memory. As a starting point, the already defined model for exploring OR parallelism of the OPERA project, from the Instituto de Informatica da UFRGS was used. This happened so that the proposed model of AND parallelism can make up a plataform with that one to integrate the simultaneous exploration of the AND and OR parallelisms. The conceived model holds a proposal of compilation and execution environment. The detection and the control of the parallelism is started in the compilation. A Conditional Expression of Execution to each clause of the Prolog program is generated on this phase. Its evaluation, during the time of processing, determines the execution, whether or not in parallel, of the literals that constitute the clause. The Abstract Prolog Machine, projected for the parallel emulator, is based on the WAM (Warren Abstract Machine) which is one of the most efficient and spread techniques for Prolog compilation. This aspects, among others, gives a good portability to the model. The environmente of execution comprises the conception of an architecture of processes formed by OPERA workers and a philosophy of scheduling service among these workers; it also comprise a policy to manage its memory and a strategy for the communications. So that the proposed model for the exploitation of AND parallelism got validated, it was implemented on a local net of Unix workstations, obtaining good results. Linguagens : Programacao Prolog Processamento paralelo Programacao em logica Parallel processing Parallelism in logic programming AND parallelism OR parallelism Parallel prolog Computers architecture
22	Um ambiente para exploração de paralelismo na programação em lógica / A environment to explotation of parallelism in the logic programming Yamin, Adenauer Correa January 1994 (has links) Este trabalho e dedicado ao estudo da exploração de paralelismo na Programação em Lógica. O aspecto declarativo das linguagens de Programação em Lógica permite uma exploração eficiente do paralelismo implícito no código, de forma mais simples que as linguagens imperativas. Ao mesmo tempo, o paralelismo tem-se mostrado uma forte opção para procura de aumentos significativos do desempenho dos computadores. Como conseqüência, nos últimos anos, diversas maquinas paralelas tem surgido no mercado. No entanto, a sua efetiva utilização ainda ressente-se de uma dificuldade de programação maior que a das maquinas sequênciais. Por outro lado, o alto nível das linguagens de Programação em Lógica permite o desenvolvimento de programas de forma mais rápida e concisa do que as linguagens tradicionais (imperativas). Porem, apesar dos importantes progressos nas técnicas de compilação destas linguagens, elas permanecem menos eficientes que as linguagens imperativas. 0 aumento na eficiência de execução da Programação em Lógica, com o use do paralelismo, certamente estenderá o seu emprego. Em função disto, a unido da Programação em Lógica e maquinas paralelas tem sido proposta como uma alternativa para facilitar a programação das maquinas paralelas, bem como para aumentar o desempenho na Programação em Lógica. O ponto central do trabalho e a concepção de um modelo para exploração do paralelismo E Restrito na execução de Prolog, voltado para arquiteturas multiprocessadoras sem memória comum. Como ponto de partida foi utilizado o modelo já definido para exploração do paralelismo OU do projeto OPERA, do Instituto de Informática da UFRGS, de maneira que o modelo de paralelismo E proposto possa vir a compor, com aquele, uma plataforma que integre a exploração simultânea dos paralelismos E e OU. O modelo concebido compreende uma proposta de compilação e um ambiente de execução. A detecção e o controle do paralelismo é iniciado na compilação. Nesta fase, a gerada uma Expressão Condicional de Execução para cada clausula do programa Prolog, cuja avaliação em tempo de processamento determina a execução, em paralelo ou não, dos literais que compõem a clausula. A Maquina Abstrata Prolog, projetada para o emulador paralelo, é baseada na WAM (Warren Abstract Machine), uma das mais eficientes e difundidas técnicas para compilação Prolog. Isto, dentre outros aspectos, confere uma boa portabilidade ao modelo. O ambiente de execução compreende a concepção de uma arquitetura de processos formada por trabalhadores OPERA, uma filosofia de escalonamento de serviço entre estes trabalhadores, uma política para gerencia de sua memória e uma estratégia para as comunicações. Para validar o modelo proposto para exploração do paralelismo E, o mesmo foi implementado em rede local de estações Unix, obtendo bons resultados. / This work is devoted to the study of the exploration of parallelism in Logic Programming. The declarative aspect of the Logic Programming languages allows an efficient exploration of the implicit parallelism in the code, in a simpler form than the imperative languages. At the same time, parallelism has been shown as a strong option to the search for significant increases in the performance of the computers. As a consequence, in the last years, several parallel machines have been sprung up into the market. Nevertheless, their effective usefulness still undergoes some difficulties in programming which are greater than those of the sequential machines. On the other hand, the high level of Logic Programming languages allows programs development to be faster and concise than in the traditional languages (imperatives). However, despite the important progress in compiling techniques for these languages, they remain less efficient than the imperatives languages. The increase in execution efficiency of logic programs, with the use of parallelism, will probabily extend their use. Having this in mind, the union of the Logic Programming and parallel machines has been proposed as an alternative to make programming of the parallel machines easier, as well as to increase the performance of Logic Programming. The central aspect of the work is the conception of a model to explore the Restricted AND Parallelism in the execution of Prolog, turned to multiprocessing architectures without a common memory. As a starting point, the already defined model for exploring OR parallelism of the OPERA project, from the Instituto de Informatica da UFRGS was used. This happened so that the proposed model of AND parallelism can make up a plataform with that one to integrate the simultaneous exploration of the AND and OR parallelisms. The conceived model holds a proposal of compilation and execution environment. The detection and the control of the parallelism is started in the compilation. A Conditional Expression of Execution to each clause of the Prolog program is generated on this phase. Its evaluation, during the time of processing, determines the execution, whether or not in parallel, of the literals that constitute the clause. The Abstract Prolog Machine, projected for the parallel emulator, is based on the WAM (Warren Abstract Machine) which is one of the most efficient and spread techniques for Prolog compilation. This aspects, among others, gives a good portability to the model. The environmente of execution comprises the conception of an architecture of processes formed by OPERA workers and a philosophy of scheduling service among these workers; it also comprise a policy to manage its memory and a strategy for the communications. So that the proposed model for the exploitation of AND parallelism got validated, it was implemented on a local net of Unix workstations, obtaining good results. Linguagens : Programacao Prolog Processamento paralelo Programacao em logica Parallel processing Parallelism in logic programming AND parallelism OR parallelism Parallel prolog Computers architecture
23	Um ambiente para exploração de paralelismo na programação em lógica / A environment to explotation of parallelism in the logic programming Yamin, Adenauer Correa January 1994 (has links) Este trabalho e dedicado ao estudo da exploração de paralelismo na Programação em Lógica. O aspecto declarativo das linguagens de Programação em Lógica permite uma exploração eficiente do paralelismo implícito no código, de forma mais simples que as linguagens imperativas. Ao mesmo tempo, o paralelismo tem-se mostrado uma forte opção para procura de aumentos significativos do desempenho dos computadores. Como conseqüência, nos últimos anos, diversas maquinas paralelas tem surgido no mercado. No entanto, a sua efetiva utilização ainda ressente-se de uma dificuldade de programação maior que a das maquinas sequênciais. Por outro lado, o alto nível das linguagens de Programação em Lógica permite o desenvolvimento de programas de forma mais rápida e concisa do que as linguagens tradicionais (imperativas). Porem, apesar dos importantes progressos nas técnicas de compilação destas linguagens, elas permanecem menos eficientes que as linguagens imperativas. 0 aumento na eficiência de execução da Programação em Lógica, com o use do paralelismo, certamente estenderá o seu emprego. Em função disto, a unido da Programação em Lógica e maquinas paralelas tem sido proposta como uma alternativa para facilitar a programação das maquinas paralelas, bem como para aumentar o desempenho na Programação em Lógica. O ponto central do trabalho e a concepção de um modelo para exploração do paralelismo E Restrito na execução de Prolog, voltado para arquiteturas multiprocessadoras sem memória comum. Como ponto de partida foi utilizado o modelo já definido para exploração do paralelismo OU do projeto OPERA, do Instituto de Informática da UFRGS, de maneira que o modelo de paralelismo E proposto possa vir a compor, com aquele, uma plataforma que integre a exploração simultânea dos paralelismos E e OU. O modelo concebido compreende uma proposta de compilação e um ambiente de execução. A detecção e o controle do paralelismo é iniciado na compilação. Nesta fase, a gerada uma Expressão Condicional de Execução para cada clausula do programa Prolog, cuja avaliação em tempo de processamento determina a execução, em paralelo ou não, dos literais que compõem a clausula. A Maquina Abstrata Prolog, projetada para o emulador paralelo, é baseada na WAM (Warren Abstract Machine), uma das mais eficientes e difundidas técnicas para compilação Prolog. Isto, dentre outros aspectos, confere uma boa portabilidade ao modelo. O ambiente de execução compreende a concepção de uma arquitetura de processos formada por trabalhadores OPERA, uma filosofia de escalonamento de serviço entre estes trabalhadores, uma política para gerencia de sua memória e uma estratégia para as comunicações. Para validar o modelo proposto para exploração do paralelismo E, o mesmo foi implementado em rede local de estações Unix, obtendo bons resultados. / This work is devoted to the study of the exploration of parallelism in Logic Programming. The declarative aspect of the Logic Programming languages allows an efficient exploration of the implicit parallelism in the code, in a simpler form than the imperative languages. At the same time, parallelism has been shown as a strong option to the search for significant increases in the performance of the computers. As a consequence, in the last years, several parallel machines have been sprung up into the market. Nevertheless, their effective usefulness still undergoes some difficulties in programming which are greater than those of the sequential machines. On the other hand, the high level of Logic Programming languages allows programs development to be faster and concise than in the traditional languages (imperatives). However, despite the important progress in compiling techniques for these languages, they remain less efficient than the imperatives languages. The increase in execution efficiency of logic programs, with the use of parallelism, will probabily extend their use. Having this in mind, the union of the Logic Programming and parallel machines has been proposed as an alternative to make programming of the parallel machines easier, as well as to increase the performance of Logic Programming. The central aspect of the work is the conception of a model to explore the Restricted AND Parallelism in the execution of Prolog, turned to multiprocessing architectures without a common memory. As a starting point, the already defined model for exploring OR parallelism of the OPERA project, from the Instituto de Informatica da UFRGS was used. This happened so that the proposed model of AND parallelism can make up a plataform with that one to integrate the simultaneous exploration of the AND and OR parallelisms. The conceived model holds a proposal of compilation and execution environment. The detection and the control of the parallelism is started in the compilation. A Conditional Expression of Execution to each clause of the Prolog program is generated on this phase. Its evaluation, during the time of processing, determines the execution, whether or not in parallel, of the literals that constitute the clause. The Abstract Prolog Machine, projected for the parallel emulator, is based on the WAM (Warren Abstract Machine) which is one of the most efficient and spread techniques for Prolog compilation. This aspects, among others, gives a good portability to the model. The environmente of execution comprises the conception of an architecture of processes formed by OPERA workers and a philosophy of scheduling service among these workers; it also comprise a policy to manage its memory and a strategy for the communications. So that the proposed model for the exploitation of AND parallelism got validated, it was implemented on a local net of Unix workstations, obtaining good results. Linguagens : Programacao Prolog Processamento paralelo Programacao em logica Parallel processing Parallelism in logic programming AND parallelism OR parallelism Parallel prolog Computers architecture
24	Scaling managed runtime systems for future multicore hardware Ha, Jung Woo 27 August 2010 (has links) The exponential improvement in single processor performance has recently come to an end, mainly because clock frequency has reached its limit due to power constraints. Thus, processor manufacturers are choosing to enhance computing capabilities by placing multiple cores into a single chip, which can improve performance given parallel software. This paradigm shift to chip multiprocessors (also called multicore) requires scalable parallel applications that execute tasks on each core, otherwise the additional cores are worthless. Making an application scalable requires more than simply parallelizing the application code itself. Modern applications are written in managed languages, which require automatic memory management, type and memory abstractions, dynamic analysis and just-in-time (JIT) compilation. These managed runtime systems monitor and interact frequently with the executing application. Hence, the managed runtime itself must be scalable, and the instrumentation that monitors the application should not perturb its scalability. While multicore hardware forces a redesign of managed runtimes for scalability, it also provides opportunities when applications do not fully utilize all of the cores. Using available cores for concurrent helper threads that enhance the software, with debugging, security, and software support will make the runtime itself more capable and more scalable. This dissertation presents two novel techniques that improve the scalability of managed runtimes by utilizing unused cores. The first technique is a concurrent dynamic analysis framework that provides a low-overhead buffering mechanism called Cache-friendly Asymmetric Buffering (CAB) that quickly offloads data from the application to helper threads that perform specific dynamic analyses. Our framework minimizes application instrumentation overhead, prevents microarchitectural side-effects, and supports a variety of dynamic analysis clients, ranging from call graph and path profiling to cache simulation. The use of this framework ensures that helper threads perturb the performance of application as little as possible. Our second technique is concurrent trace-based just-in-time compilation, which exploits available cores for the JavaScript runtime. The JavaScript language limits applications to a single-thread, so extra cores are worthless unless they are used by the runtime components. We redesigned a production trace-based JIT compiler to run concurrently with the interpreter, and our technique is the first to improve both responsiveness and throughput in a trace-based JIT compiler. This thesis presents the design and implementation of both techniques and shows that they improve scalability and core utilization when running applications in managed runtimes. Industry is already adopting our approaches, which demonstrates the urgency of the scalable runtime problem and the utility of these techniques. / text Scalability Multicore Managed Language Runtime System Parallelism
25	Exploiting parallelism within multidimensional multirate digital signal processing systems Peng, Dongming 30 September 2004 (has links) The intense requirements for high processing rates of multidimensional Digital Signal Processing systems in practical applications justify the Application Specific Integrated Circuits designs and parallel processing implementations. In this dissertation, we propose novel theories, methodologies and architectures in designing high-performance VLSI implementations for general multidimensional multirate Digital Signal Processing systems by exploiting the parallelism within those applications. To systematically exploit the parallelism within the multidimensional multirate DSP algorithms, we develop novel transformations including (1) nonlinear I/O data space transforms, (2) intercalation transforms, and (3) multidimensional multirate unfolding transforms. These transformations are applied to the algorithms leading to systematic methodologies in high-performance architectural designs. With the novel design methodologies, we develop several architectures with parallel and distributed processing features for implementing multidimensional multirate applications. Experimental results have shown that those architectures are much more efficient in terms of execution time and/or hardware cost compared with existing hardware implementations. multirate signal processing multidmensional signal processing parallelism
26	Dynamic Data Race Detection for Structured Parallelism Raman, Raghavan 24 July 2013 (has links) With the advent of multicore processors and an increased emphasis on parallel computing, parallel programming has become a fundamental requirement for achieving available performance. Parallel programming is inherently hard because, to reason about the correctness of a parallel program, programmers have to consider large numbers of interleavings of statements in different threads in the program. Though structured parallelism imposes some restrictions on the programmer, it is an attractive approach because it provides useful guarantees such as deadlock-freedom. However, data races remain a challenging source of bugs in parallel programs. Data races may occur only in few of the possible schedules of a parallel program, thereby making them extremely hard to detect, reproduce, and correct. In the past, dynamic data race detection algorithms have suffered from at least one of the following limitations: some algorithms have a worst-case linear space and time overhead, some algorithms are dependent on a specific scheduling technique, some algorithms generate false positives and false negatives, some have no empirical evaluation as yet, and some require sequential execution of the parallel program. In this thesis, we introduce dynamic data race detection algorithms for structured parallel programs that overcome past limitations. We present a race detection algorithm called ESP-bags that requires the input program to be executed sequentially and another algorithm called SPD3 that can execute the program in parallel. While the ESP-bags algorithm addresses all the above mentioned limitations except sequential execution, the SPD3 algorithm addresses the issue of sequential execution by scaling well across highly parallel shared memory multiprocessors. Our algorithms incur constant space overhead per memory location and time overhead that is independent of the number of processors on which the programs execute. Our race detection algorithms support a rich set of parallel constructs (including async, finish, isolated, and future) that are found in languages such as HJ, X10, and Cilk. Our algorithms for async, finish, and future are precise and sound for a given input. In the presence of isolated, our algorithms are precise but not sound. Our experiments show that our algorithms (for async, finish, and isolated) perform well in practice, incurring an average slowdown of under 3x over the original execution time on a suite of 15 benchmarks. SPD3 is the first practical dynamic race detection algorithm for async-finish parallel programs that can execute the input program in parallel and use constant space per memory location. This takes us closer to our goal of building dynamic data race detectors that can be "always-on" when developing parallel applications. Structured Parallelism Data Races Program Analysis
27	Improving ILP with the Vectorized Computing Mechanism in VLIW DSP Architecture Yang, Te-Shin 25 June 2003 (has links) In order to improving the performance for real-time application, current digital signal processors use VLIW architectures to increase the degree of instruction level parallelism (ILP). Two factors will limit the ILP, one is enough hardware resource for all parallel instructions. Another is the dependence relations between instructions. This thesis designs a VLIW architecture processing core called DVBTDSP molded by FFT algorithm and uses the software pipelining mechanism to schedule the loop to achieve the highest ILP degree when used to execute FFT butterfly operations. Furthermore, in order to provide the smooth data stream for pipeline operations, we design a mechanism to improve the modulo addressing, which will collect the discrete vectors into one continuous vector. The simulation results show that the DVBTDSP has double performance of the C6200 for the FFT processing, and has good performance for FIR, IIR and DCT algorithm computing. VLIW vector computing instruction level parallelism
28	Core-characteristic-aware off-chip memory management in a multicore system-on-chip Jeong, Min Kyu 30 January 2013 (has links) Future processors will integrate an increasing number of cores because the scaling of single-thread performance is limited and because smaller cores are more power efficient. Off-chip memory bandwidth that is shared between those many cores, however, scales slower than the transistor (and core) count does. As a result, in many future systems, off-chip bandwidth will become the bottleneck of heavy demand from multiple cores. Therefore, optimally managing the limited off-chip bandwidth is critical to achieving high performance and efficiency in future systems. In this dissertation, I will develop techniques to optimize the shared use of limited off-chip memory bandwidth in chip-multiprocessors. I focus on issues that arise from the sharing and exploit the differences in memory access characteristics, such as locality, bandwidth requirement, and latency sensitivity, between the applications running in parallel and competing for the bandwidth. First, I investigate how the shared use of memory by many cores can result in reduced spatial locality in memory accesses. I propose a technique that partitions the internal memory banks between cores in order to isolate their access streams and eliminate locality interference. The technique compensates for the reduced bank-level parallelism of each thread by employing memory sub-ranking to effectively increase the number of independent banks. For three different workload groups that consist of benchmarks with high spatial locality, low spatial locality, and mixes of the two, the average system efficiency improves by 10%, 7%, 9% for 2-rank systems, and 18%, 25%, 20% for 1-rank systems, respectively, over the baseline shared-bank system. Next, I improve the performance of a heterogeneous system-on-chip (SoC) in which cores have distinct memory access characteristics. I develop a deadline-aware shared memory bandwidth management scheme for SoCs that have both CPU and GPU cores. I show that statically prioritizing the CPU can severely constrict GPU performance, and propose to dynamically adapt the priority of CPU and GPU memory requests based on the progress of GPU workload. The proposed dynamic bandwidth management scheme provides the target GPU performance while prioritizing CPU performance as much as possible, for any CPU-GPU workload combination with different complexities. / text Memory CMP Locality Parallelism SoC GPU QoS
29	Exploiting parallelism within multidimensional multirate digital signal processing systems Peng, Dongming 30 September 2004 (has links) The intense requirements for high processing rates of multidimensional Digital Signal Processing systems in practical applications justify the Application Specific Integrated Circuits designs and parallel processing implementations. In this dissertation, we propose novel theories, methodologies and architectures in designing high-performance VLSI implementations for general multidimensional multirate Digital Signal Processing systems by exploiting the parallelism within those applications. To systematically exploit the parallelism within the multidimensional multirate DSP algorithms, we develop novel transformations including (1) nonlinear I/O data space transforms, (2) intercalation transforms, and (3) multidimensional multirate unfolding transforms. These transformations are applied to the algorithms leading to systematic methodologies in high-performance architectural designs. With the novel design methodologies, we develop several architectures with parallel and distributed processing features for implementing multidimensional multirate applications. Experimental results have shown that those architectures are much more efficient in terms of execution time and/or hardware cost compared with existing hardware implementations. multirate signal processing multidmensional signal processing parallelism
30	Linking Scheme code to data-parallel CUDA-C code 2013 December 1900 (has links) In Compute Unified Device Architecture (CUDA), programmers must manage memory operations, synchronization, and utility functions of Central Processing Unit programs that control and issue data-parallel general purpose programs running on a Graphics Processing Unit (GPU). NVIDIA Corporation developed the CUDA framework to enable and develop data-parallel programs for GPUs to accelerate scientific and engineering applications by providing a language extension of C called CUDA-C. A foreign-function interface comprised of Scheme and CUDA-C constructs extends the Gambit Scheme compiler and enables linking of Scheme and data-parallel CUDA-C code to support high-performance parallel computation with reasonably low overhead in runtime. We provide six test cases — implemented both in Scheme and CUDA-C — in order to evaluate performance of our implementation in Gambit and to show 0–35% overhead in the usual case. Our work enables Scheme programmers to develop expressive programs that control and issue data-parallel programs running on GPUs, while also reducing hands-on memory management. Data parallelism GPGPU Scheme Skeletons CUDA Linking

Search results