Spelling suggestions: "subject:"3dgraphics aprocessing units."" "subject:"3dgraphics aprocessing knits.""
41 |
Modeling the performance of many-core programs on GPUs with advanced featuresPei, Mo Mo January 2012 (has links)
University of Macau / Faculty of Science and Technology / Department of Computer and Information Science
|
42 |
Flexible architecture methods for graphics processingDutton, Marcus 29 March 2011 (has links)
The FPGA GPU architecture proposed in this thesis was motivated by underserved markets for graphics processing that desire flexibility, long-term device availability, scalability, certifiability, and high reliability. These markets of industrial,
medical, and avionics applications often are forced to rely on the latest GPUs that were
actually designed for gaming PCs or handheld consumer devices.
The architecture for the GPU in this thesis was crafted specifically for an FPGA and therefore takes advantage of its capabilities while also avoiding its limitations. Previous work did not specifically exploit the FPGA's structures and instead used FPGA implementations merely as an integration platform prior to proceeding on to a final ASIC design. The target of an FPGA for this architecture is also important because its
flexibility and programmability allow the GPU's performance to be scaled or supplemented to fit unique application requirements. This tailoring of the architecture to specific requirements minimizes power consumption and device cost while still satisfying performance, certification, and device availability requirements.
To demonstrate the feasibility of the flexible FPGA GPU architectural concepts, the architecture is applied to an avionics application and analyzed to confirm satisfactory results. The architecture is further validated through the development of extensions to support more comprehensive graphics processing applications. In addition, the breadth of this research is illustrated through its applicability to general-purpose computations and more specifically, scientific visualizations.
|
43 |
Δημιουργία, μελέτη και βελτιστοποίηση φωτορεαλιστικών απεικονίσεων πραγματικού χρόνου με χρήση προγραμματιζόμενων επεξεργαστών γραφικώνΣταυρόπουλος, Ασημάκης 22 September 2009 (has links)
Οι προγραμματιζόμενοι επεξεργαστές γραφικών (Graphics Processing Units -
GPUs), είναι πανίσχυροι παράλληλοι επεξεργαστές και πλέον υπάρχουν σε κάθε
σύγχρονο προσωπικό υπολογιστή (PC). Οι GPUs αναλαμβάνουν κι επιταχύνουν την
σχεδίαση δισδιάστατων και τρισδιάστατων γραφικών στην οθόνη του υπολογιστή.
Η εξέλιξή τους είναι τόσο ραγδαία τα τελευταία χρόνια, που πλέον ξεπερνούν
σε πολυπλοκότητα τις σύγχρονες κεντρικές μονάδες επεξεργασίας (CPUs), ενώ
είναι ικανές να επιταχύνουν εκτός από γραφικά κι άλλες απαιτητικές σε
επεξεργαστική ισχύ εφαρμογές, όπως είναι η τεχνητή νοημοσύνη και η
προσομοίωση φυσικών αλληλεπιδράσεων μεταξύ αντικειμένων (συγκρούσεις,
εκρήξεις, προσομοίωση κίνησης υγρών) κ.α.
Σκοπός της συγκεκριμένης εργασίας είναι η δημιουργία, η μελέτη και η
βελτιστοποίηση αλγορίθμων σκίασης με χρήση GPUs. Ο όρος σκίαση (shading)
αναφέρεται στην αλληλεπίδραση του φωτός με τα αντικείμενα ενός εικονικού
περιβάλλοντος. Παρουσιάζονται τα εργαλεία (APIs) και οι γλώσσες
προγραμματισμού των GPUs καθώς και τρόποι βελτιστοποίησης της εκτέλεσης των
σκιάσεων που είναι ένα θέμα μείζονος σημασίας σε προσομοιώσεις πραγματικού
χρόνου. / Graphics processing units (GPUs), are powerful parallel processors and today are found in every modern Personal Computer (PC). The GPUs accelerate the drawing of two and three dimensional graphics on the monitor of the PCs. The evolution of this hardware is very rapid the last decade and today these circuits are more complex than CPUs. They are capable of accelerating many demanding applications except graphics, like Artificial Intelligence and Physics Simulation.
The purpose of this thesis is to implement, study and optimize the execution of shading algorithms that run on GPUs in real time. The term shading refers to the interactions between light and the material of every object in a virtual three dimensional environment. In this thesis we present the tools, the programming languages and techniques for optimizing the execution of the shaders which is a matter of major importance in real time simulations.
|
44 |
A portable relational algebra library for high performance data-intensive query processingSaeed, Ifrah 09 April 2014 (has links)
A growing number of industries are turning to data warehousing applications such as forecasting and risk assessment to process large volumes of data. These data warehousing applications, which utilize queries comprised of a mix of arithmetic and relational algebra (RA) operators, currently run on systems that utilize commodity multi-core CPUs. If we acknowledge the data-intensive nature of these applications, general purpose graphics processing units (GPUs) with high throughput and memory bandwidth seem to be natural candidates to host these applications. However, since such relational queries exhibit irregular parallelism and data accesses, their efficient implementation on GPUs remains challenging. Thus, although tailored solutions for individual processors using their native programming environments have evolved, these solutions are not accessible to other processors. This thesis addresses this problem by providing a portable implementation of RA, mathematical, and related primitives required to implement and accelerate relational queries over large data sets in the form of the library. These primitives can run on any modern multi- and many-core architecture that supports OpenCL, thereby enhancing the performance potential of such architectures for warehousing applications. In essence, this thesis describes the implementation of primitives and the results of their performance evaluation on a range of platforms and concludes with insights, the identification of opportunities, and lessons learned. One of the major insights from our analysis is that for complex relational queries, the time taken to transfer data between host CPUs and discrete GPUs can render the performance of discrete and integrated GPUs comparable in spite of the higher computing power and memory bandwidth of discrete GPUs. Therefore, data movement optimization is the key to eff ectively harnessing the high performance of discrete GPUs; otherwise, cost eff ectiveness would encourage the use of integrated GPUs. Furthermore, portability also enables the complete utilization of all GPUs and CPUs in the system at run time by opportunistically using any type of available processor when a kernel is ready for execution.
|
45 |
Accelerating computational diffusion MRI using Graphics Processing UnitsFernandez, Moises Hernandez January 2017 (has links)
Diffusion magnetic resonance imaging (dMRI) allows uniquely the study of the human brain non-invasively and in vivo. Advances in dMRI offer new insight into tissue microstructure and connectivity, and the possibility of investigating the mechanisms and pathology of neurological diseases. The great potential of the technique relies on indirect inference, as modelling frameworks are necessary to map dMRI measurements to neuroanatomical features. However, this mapping can be computationally expensive, particularly given the trend of increasing dataset sizes and/or the increased complexity in biophysical modelling. Limitations on computing can restrict data exploration and even methodology development. A step forward is to take advantage of the power offered by recent parallel computing architectures, especially Graphics Processing Units (GPUs). GPUs are massive parallel processors that offer trillions of floating point operations per second, and have made possible the solution of computationally intensive scientific problems that were intractable before. However, they are not inherently suited for all types of problems, and bespoke computational frameworks need to be developed in many cases to take advantage of their full potential. In this thesis, we propose parallel computational frameworks for the analysis of dMRI using GPUs within different contexts. We show that GPU-based designs can offer accelerations of more than two orders of magnitude for a number of scientific computing tasks with different parallelisability requirements, ranging from biophysical modelling for tissue microstructure estimation to white matter tractography for connectome generation. We develop novel and efficient GPUaccelerated solutions, including a framework that automatically generates GPU parallel code from a user-specified biophysical model. We also present a parallel GPU framework for performing probabilistic tractography and generating whole-brain connectomes. Throughout the thesis, we discuss several strategies for parallelising scientific applications, and we show the great potential of the accelerations obtained, which change the perspective of what is computationally feasible.
|
46 |
Adaptive signal processing for multichannel sound using high performance computingLorente Giner, Jorge 02 December 2015 (has links)
[EN] The field of audio signal processing has undergone a major development in recent years. Both the consumer and professional marketplaces continue to show growth in audio applications such as immersive audio schemes that offer optimal listening experience, intelligent noise reduction in cars or improvements in audio teleconferencing or hearing aids. The development of these applications has a common interest in increasing or improving the number of discrete audio channels, the quality of the audio or the sophistication of the algorithms. This often gives rise to problems of high computational cost, even when using common signal processing algorithms, mainly due to the application of these algorithms to multiple signals with real-time requirements. The field of High Performance Computing (HPC) based on low cost hardware elements is the bridge needed between the computing problems and the real multimedia signals and systems that lead to user's applications. In this sense, the present thesis goes a step further in the development of these systems by using the computational power of General Purpose Graphics Processing Units (GPGPUs) to exploit the inherent parallelism of signal processing for multichannel audio applications.
The increase of the computational capacity of the processing devices has been historically linked to the number of transistors in a chip. However, nowadays the improvements in the computational capacity are mainly given by increasing the number of processing units and using parallel processing. The Graphics Processing Units (GPUs), which have now thousands of computing cores, are a representative example. The GPUs were traditionally used to graphic or image processing, but new releases in the GPU programming environments such as CUDA have allowed the use of GPUS for general processing applications. Hence, the use of GPUs is being extended to a wide variety of intensive-computation applications among which audio processing is included. However, the data transactions between the CPU and the GPU and viceversa have questioned the viability of the use of GPUs for audio applications in which real-time interaction between microphones and loudspeakers is required. This is the case of the adaptive filtering applications, where an efficient use of parallel computation in not straightforward. For these reasons, up to the beginning of this thesis, very few publications had dealt with the GPU implementation of real-time acoustic applications based on adaptive filtering. Therefore, this thesis aims to demonstrate that GPUs are totally valid tools to carry out audio applications based on adaptive filtering that require high computational resources. To this end, different adaptive applications in the field of audio processing are studied and performed using GPUs. This manuscript also analyzes and solves possible limitations in each GPU-based implementation both from the acoustic point of view as from the computational point of view. / [ES] El campo de procesado de señales de audio ha experimentado un desarrollo importante en los últimos años. Tanto el mercado de consumo como el profesional siguen mostrando un crecimiento en aplicaciones de audio, tales como: los sistemas de audio inmersivo que ofrecen una experiencia de sonido óptima, los sistemas inteligentes de reducción de ruido en coches o las mejoras en sistemas de teleconferencia o en audífonos. El desarrollo de estas aplicaciones tiene un propósito común de aumentar o mejorar el número de canales de audio, la propia calidad del audio o la sofisticación de los algoritmos. Estas mejoras suelen dar lugar a sistemas de alto coste computacional, incluso usando algoritmos comunes de procesado de señal. Esto se debe principalmente a que los algoritmos se suelen aplicar a sistemas multicanales con requerimientos de procesamiento en tiempo real. El campo de la Computación de Alto Rendimiento basado en elementos hardware de bajo coste es el puente necesario entre los problemas de computación y los sistemas multimedia que dan lugar a aplicaciones de usuario. En este sentido, la presente tesis va un paso más allá en el desarrollo de estos sistemas mediante el uso de la potencia de cálculo de las Unidades de Procesamiento Gráfico (GPU) en aplicaciones de propósito general. Con ello, aprovechamos la inherente capacidad de paralelización que poseen las GPU para procesar señales de audio y obtener aplicaciones de audio multicanal.
El aumento de la capacidad computacional de los dispositivos de procesado ha estado vinculado históricamente al número de transistores que había en un chip. Sin embargo, hoy en día, las mejoras en la capacidad computacional se dan principalmente por el aumento del número de unidades de procesado y su uso para el procesado en paralelo. Las GPUs son un ejemplo muy representativo. Hoy en día, las GPUs poseen hasta miles de núcleos de computación. Tradicionalmente, las GPUs se han utilizado para el procesado de gráficos o imágenes. Sin embargo, la aparición de entornos sencillos de programación GPU, como por ejemplo CUDA, han permitido el uso de las GPU para aplicaciones de procesado general. De ese modo, el uso de las GPU se ha extendido a una amplia variedad de aplicaciones que requieren cálculo intensivo. Entre esta gama de aplicaciones, se incluye el procesado de señales de audio. No obstante, las transferencias de datos entre la CPU y la GPU y viceversa pusieron en duda la viabilidad de las GPUs para aplicaciones de audio en las que se requiere una interacción en tiempo real entre micrófonos y altavoces. Este es el caso de las aplicaciones basadas en filtrado adaptativo, donde el uso eficiente de la computación en paralelo no es sencillo. Por estas razones, hasta el comienzo de esta tesis, había muy pocas publicaciones que utilizaran la GPU para implementaciones en tiempo real de aplicaciones acústicas basadas en filtrado adaptativo. A pesar de todo, esta tesis pretende demostrar que las GPU son herramientas totalmente válidas para llevar a cabo aplicaciones de audio basadas en filtrado adaptativo que requieran elevados recursos computacionales. Con este fin, la presente tesis ha estudiado y desarrollado varias aplicaciones adaptativas de procesado de audio utilizando una GPU como procesador. Además, también analiza y resuelve las posibles limitaciones de cada aplicación tanto desde el punto de vista acústico como desde el punto de vista computacional. / [CA] El camp del processament de senyals d'àudio ha experimentat un desenvolupament important als últims anys. Tant el mercat de consum com el professional segueixen mostrant un creixement en aplicacions d'àudio, com ara: els sistemes d'àudio immersiu que ofereixen una experiència de so òptima, els sistemes intel·ligents de reducció de soroll en els cotxes o les millores en sistemes de teleconferència o en audiòfons. El desenvolupament d'aquestes aplicacions té un propòsit comú d'augmentar o millorar el nombre de canals d'àudio, la pròpia qualitat de l'àudio o la sofisticació dels algorismes que s'utilitzen. Això, sovint dóna lloc a sistemes d'alt cost computacional, fins i tot quan es fan servir algorismes comuns de processat de senyal. Això es deu principalment al fet que els algorismes se solen aplicar a sistemes multicanals amb requeriments de processat en temps real. El camp de la Computació d'Alt Rendiment basat en elements hardware de baix cost és el pont necessari entre els problemes de computació i els sistemes multimèdia que donen lloc a aplicacions d'usuari. En aquest sentit, aquesta tesi va un pas més enllà en el desenvolupament d'aquests sistemes mitjançant l'ús de la potència de càlcul de les Unitats de Processament Gràfic (GPU) en aplicacions de propòsit general. Amb això, s'aprofita la inherent capacitat de paral·lelització que posseeixen les GPUs per processar senyals d'àudio i obtenir aplicacions d'àudio multicanal.
L'augment de la capacitat computacional dels dispositius de processat ha estat històricament vinculada al nombre de transistors que hi havia en un xip. No obstant, avui en dia, les millores en la capacitat computacional es donen principalment per l'augment del nombre d'unitats de processat i el seu ús per al processament en paral·lel. Un exemple molt representatiu són les GPU, que avui en dia posseeixen milers de nuclis de computació. Tradicionalment, les GPUs s'han utilitzat per al processat de gràfics o imatges. No obstant, l'aparició d'entorns senzills de programació de la GPU com és CUDA, han permès l'ús de les GPUs per a aplicacions de processat general. D'aquesta manera, l'ús de les GPUs s'ha estès a una àmplia varietat d'aplicacions que requereixen càlcul intensiu. Entre aquesta gamma d'aplicacions, s'inclou el processat de senyals d'àudio. No obstant, les transferències de dades entre la CPU i la GPU i viceversa van posar en dubte la viabilitat de les GPUs per a aplicacions d'àudio en què es requereix la interacció en temps real de micròfons i altaveus. Aquest és el cas de les aplicacions basades en filtrat adaptatiu, on l'ús eficient de la computació en paral·lel no és senzilla. Per aquestes raons, fins al començament d'aquesta tesi, hi havia molt poques publicacions que utilitzessin la GPU per implementar en temps real aplicacions acústiques basades en filtrat adaptatiu. Malgrat tot, aquesta tesi pretén demostrar que les GPU són eines totalment vàlides per dur a terme aplicacions d'àudio basades en filtrat adaptatiu que requereixen alts recursos computacionals. Amb aquesta finalitat, en la present tesi s'han estudiat i desenvolupat diverses aplicacions adaptatives de processament d'àudio utilitzant una GPU com a processador. A més, aquest manuscrit també analitza i resol les possibles limitacions de cada aplicació, tant des del punt de vista acústic, com des del punt de vista computacional. / Lorente Giner, J. (2015). Adaptive signal processing for multichannel sound using high performance computing [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/58427
|
47 |
GPU acceleration of matrix-based methods in computational electromagneticsLezar, Evan 03 1900 (has links)
Thesis (PhD (Electrical and Electronic Engineering))--University of Stellenbosch, 2011. / ENGLISH ABSTRACT: This work considers the acceleration of matrix-based computational electromagnetic (CEM)
techniques using graphics processing units (GPUs). These massively parallel processors have
gained much support since late 2006, with software tools such as CUDA and OpenCL greatly
simplifying the process of harnessing the computational power of these devices. As with any
advances in computation, the use of these devices enables the modelling of more complex problems,
which in turn should give rise to better solutions to a number of global challenges faced
at present.
For the purpose of this dissertation, CUDA is used in an investigation of the acceleration
of two methods in CEM that are used to tackle a variety of problems. The first of these is the
Method of Moments (MOM) which is typically used to model radiation and scattering problems,
with the latter begin considered here. For the CUDA acceleration of the MOM presented here,
the assembly and subsequent solution of the matrix equation associated with the method are
considered. This is done for both single and double precision
oating point matrices.
For the solution of the matrix equation, general dense linear algebra techniques are used,
which allow for the use of a vast expanse of existing knowledge on the subject. This also means
that implementations developed here along with the results presented are immediately applicable
to the same wide array of applications where these methods are employed.
Both the assembly and solution of the matrix equation implementations presented result in
signi cant speedups over multi-core CPU implementations, with speedups of up to 300x and
10x, respectively, being measured. The implementations presented also overcome one of the
major limitations in the use of GPUs as accelerators (that of limited memory capacity) with
problems up to 16 times larger than would normally be possible being solved.
The second matrix-based technique considered is the Finite Element Method (FEM), which
allows for the accurate modelling of complex geometric structures including non-uniform dielectric
and magnetic properties of materials, and is particularly well suited to handling bounded
structures such as waveguide. In this work the CUDA acceleration of the cutoff and dispersion
analysis of three waveguide configurations is presented. The modelling of these problems using
an open-source software package, FEniCS, is also discussed.
Once again, the problem can be approached from a linear algebra perspective, with the
formulation in this case resulting in a generalised eigenvalue (GEV) problem. For the problems
considered, a total solution speedup of up to 7x is measured for the solution of the generalised
eigenvalue problem, with up to 22x being attained for the solution of the standard eigenvalue
problem that forms part of the GEV problem. / AFRIKAANSE OPSOMMING: In hierdie werkstuk word die versnelling van matriksmetodes in numeriese elektromagnetika
(NEM) deur die gebruik van grafiese verwerkingseenhede (GVEe) oorweeg. Die gebruik van
hierdie verwerkingseenhede is aansienlik vergemaklik in 2006 deur sagteware pakette soos CUDA
en OpenCL. Hierdie toestelle, soos ander verbeterings in verwerkings vermoe, maak dit moontlik
om meer komplekse probleme op te los. Hierdie stel wetenskaplikes weer in staat om globale
uitdagings beter aan te pak.
In hierdie proefskrif word CUDA gebruik om ondersoek in te stel na die versnelling van twee
metodes in NEM, naamlik die Moment Metode (MOM) en die Eindige Element Metode (EEM).
Die MOM word tipies gebruik om stralings- en weerkaatsingsprobleme op te los. Hier word slegs
na die weerkaatsingsprobleme gekyk. CUDA word gebruik om die opstel van die MOM matriks
en ook die daaropvolgende oplossing van die matriksvergelyking wat met die metode gepaard
gaan te bespoedig.
Algemene digte lineere algebra tegnieke word benut om die matriksvergelykings op te los.
Dit stel die magdom bestaande kennis in die vagebied beskikbaar vir die oplossing, en gee ook
aanleiding daartoe dat enige implementasies wat ontwikkel word en resultate wat verkry word
ook betrekking het tot 'n wye verskeidenheid probleme wat die lineere algebra metodes gebruik.
Daar is gevind dat beide die opstelling van die matriks en die oplossing van die matriksvergelyking
aansienlik vinniger is as veelverwerker SVE implementasies. 'n Verselling van tot 300x
en 10x onderkeidelik is gemeet vir die opstel en oplos fases. Die hoeveelheid geheue beskikbaar
tot die GVE is een van die belangrike beperkinge vir die gebruik van GVEe vir groot probleme.
Hierdie beperking word hierin oorkom en probleme wat selfs 16 keer groter is as die GVE se
beskikbare geheue word geakkommodeer en suksesvol opgelos.
Die Eindige Element Metode word op sy beurt gebruik om komplekse geometriee asook nieuniforme
materiaaleienskappe te modelleer. Die EEM is ook baie geskik om begrensde strukture
soos golfgeleiers te hanteer. Hier word CUDA gebruik of om die afsny- en dispersieanalise van
drie gol
eierkonfigurasies te versnel. Die implementasie van hierdie probleme word gedoen deur
'n versameling oopbronkode wat bekend staan as FEniCS, wat ook hierin bespreek word.
Die probleme wat ontstaan in die EEM kan weereens vanaf 'n lineere algebra uitganspunt
benader word. In hierdie geval lei die formulering tot 'n algemene eiewaardeprobleem. Vir die
gol
eier probleme wat ondersoek word is gevind dat die algemene eiewaardeprobleem met tot 7x
versnel word. Die standaard eiewaardeprobleem wat 'n stap is in die oplossing van die algemene
eiewaardeprobleem is met tot 22x versnel.
|
48 |
Modeling Multi-factor Financial Derivatives by a Partial Differential Equation Approach with Efficient Implementation on Graphics Processing UnitsDang, Duy Minh 15 November 2013 (has links)
This thesis develops efficient modeling frameworks via a Partial Differential Equation (PDE) approach for multi-factor financial derivatives, with emphasis on three-factor models, and studies highly efficient implementations of the numerical methods on novel high-performance computer architectures, with particular focus on Graphics Processing Units (GPUs) and multi-GPU platforms/clusters of GPUs. Two important classes of multi-factor financial instruments are considered: cross-currency/foreign exchange (FX) interest rate derivatives and multi-asset options. For cross-currency interest rate derivatives, the focus of the thesis is on Power Reverse Dual Currency (PRDC) swaps with three of the most popular exotic features, namely Bermudan cancelability, knockout, and FX Target Redemption. The modeling of PRDC swaps using one-factor Gaussian models for the domestic and foreign interest short rates, and a one-factor skew model for the spot FX rate results in a time-dependent parabolic PDE in three space dimensions. Our proposed PDE pricing framework is based on partitioning the pricing problem into several independent pricing subproblems over each time period of the swap's tenor structure, with possible communication at the end of the time period. Each of these subproblems requires a solution of the model PDE. We then develop a highly efficient GPU-based parallelization of the Alternating Direction Implicit (ADI) timestepping methods for solving the model PDE. To further handle the substantially increased computational requirements due to the exotic features, we extend the pricing procedures to multi-GPU platforms/clusters of GPUs to solve each of these independent subproblems on a separate GPU. Numerical results indicate that the proposed GPU-based parallel numerical methods are highly efficient and provide significant increase in performance over CPU-based methods when pricing PRDC swaps. An analysis of the impact of the FX volatility skew on the price of PRDC swaps is provided.
In the second part of the thesis, we develop efficient pricing algorithms for multi-asset options under the Black-Scholes-Merton framework, with strong emphasis on multi-asset American options. Our proposed pricing approach is built upon a combination of (i) a discrete penalty approach for the linear complementarity problem arising due to the free boundary and (ii) a GPU-based parallel ADI Approximate Factorization technique for the solution of the linear algebraic system arising from each penalty iteration. A timestep size selector implemented efficiently on GPUs is used to further increase the efficiency of the methods. We demonstrate the efficiency and accuracy of the proposed GPU-based parallel numerical methods by pricing American options written on three assets.
|
49 |
Accélérateurs logiciels et matériels pour l'algèbre linéaire creuse sur les corps finis / Hardware and Software Accelerators for Sparse Linear Algebra over Finite FieldsJeljeli, Hamza 16 July 2015 (has links)
Les primitives de la cryptographie à clé publique reposent sur la difficulté supposée de résoudre certains problèmes mathématiques. Dans ce travail, on s'intéresse à la cryptanalyse du problème du logarithme discret dans les sous-groupes multiplicatifs des corps finis. Les algorithmes de calcul d'index, utilisés dans ce contexte, nécessitent de résoudre de grands systèmes linéaires creux définis sur des corps finis de grande caractéristique. Cette algèbre linéaire représente dans beaucoup de cas le goulot d'étranglement qui empêche de cibler des tailles de corps plus grandes. L'objectif de cette thèse est d'explorer les éléments qui permettent d'accélérer cette algèbre linéaire sur des architectures pensées pour le calcul parallèle. On est amené à exploiter le parallélisme qui intervient dans différents niveaux algorithmiques et arithmétiques et à adapter les algorithmes classiques aux caractéristiques des architectures utilisées et aux spécificités du problème. Dans la première partie du manuscrit, on présente un rappel sur le contexte du logarithme discret et des architectures logicielles et matérielles utilisées. La seconde partie du manuscrit est consacrée à l'accélération de l'algèbre linéaire. Ce travail a donné lieu à deux implémentations de résolution de systèmes linéaires basées sur l'algorithme de Wiedemann par blocs : une implémentation adaptée à un cluster de GPU NVIDIA et une implémentation adaptée à un cluster de CPU multi-cœurs. Ces implémentations ont contribué à la réalisation de records de calcul de logarithme discret dans les corps binaires GF(2^{619}) et GF(2^{809} et dans le corps premier GF(p_{180}) / The security of public-key cryptographic primitives relies on the computational difficulty of solving some mathematical problems. In this work, we are interested in the cryptanalysis of the discrete logarithm problem over the multiplicative subgroups of finite fields. The index calculus algorithms, which are used in this context, require solving large sparse systems of linear equations over finite fields. This linear algebra represents a serious limiting factor when targeting larger fields. The object of this thesis is to explore all the elements that accelerate this linear algebra over parallel architectures. We need to exploit the different levels of parallelism provided by these computations and to adapt the state-of-the-art algorithms to the characteristics of the considered architectures and to the specificities of the problem. In the first part of the manuscript, we present an overview of the discrete logarithm context and an overview of the considered software and hardware architectures. The second part deals with accelerating the linear algebra. We developed two implementations of linear system solvers based on the block Wiedemann algorithm: an NVIDIA-GPU-based implementation and an implementation adapted to a cluster of multi-core CPU. These implementations contributed to solving the discrete logarithm problem in binary fields GF(2^{619}) et GF(2^{809}) and in the prime field GF(p_{180})
|
50 |
Efficient Compilation Of Stream Programs Onto Multi-cores With AcceleratorsUdupa, Abhishek 07 1900 (has links)
Over the past two decades, microprocessor manufacturers have typically relied on wider issue widths and deeper pipelines to obtain performance improvements for single threaded applications. However, in the recent years, with power dissipation and wire delays becoming primary design constraints, this approach can no longer be effectively used to yield performance improvements. Thus process designers and vendors are universally moving towards multi-core designs. Examples for these are the commodity general purpose multi-core processors, the CellBE accelerator from IBM and the Graphics Processing Units from NVIDIA and ATI. Although these many and multi-core architectures can provide enormous performance benefits, it is difficult to program for them due to the complexity of writing explicitly parallel code. The ubiquity of computationally intensive media processing applications makes it imperative to consider new programming frameworks and languages that can express parallelism in an easy, portable manner. The StreamIt programming language has been proposed to efficiently exploit parallelism at various levels on general purpose multi-core architectures and stream processors and allow media processing and DSP application to be developed in an easy and portable fashion. The StreamIt model allows programmers to specify a program as a set of filters connected by FIFO communication channels. The graphs thus specified by the StreamIt programs describe task, data and pipeline parallelism which can be potentially exploited on modern Graphics Processing Units (GPUs), which have emerged as powerful, commodity stream processors, which support abundant parallelism in hardware.
The first part of this thesis deals with the challenges in mapping StreamIt programs to GPUs and proposes an efficient technique to software pipeline the execution of stream Programs on GPUs. We formulate this problem—both scheduling and assignment of filters to processors—as an efficient Integer Linear Program(ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. We have evaluated our approach on a platform equipped with an NVIDIA GeForce 8800 GTS 512 GPU and our approach yields a (geometric) mean speedup of 5.02X, with a maximum speedup of 36.83X across a set of StreamIt benchmarks, with the speedup measured relative to an optimized single threaded CPU execution.
While the approach of software pipelining the execution of stream programs on GPUs is efficient and performs well, it does not utilize the CPU cores to perform useful computation. Further, it does not support programs with stateful filters, which are essentially filters that are not data parallel owing to a dependence between each successive firing that is carried through the implicit state of the filter. The second part of the thesis aims at addressing these issues and describes a novel method to orchestrate the execution of a StreamIt program on the multiple cores of a system and GPUs in a synergistic manner. The proposed approach identifies, using profiling, the relative benefits of executing a task on the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers, the limited DMA bandwidth available and the required buffer layout transformations associated with the partitioning, as an integrated Integer Linear Program(ILP) which can then be solved by an ILP solver. Since solving an ILP is NP-Hard in the general case and may thus require a large amount of time, we also propose an efficient heuristic algorithm for the work partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solutions to the ILP formulation on an average across the benchmark suite, while requiring 2–3 orders of magnitude less time than the ILP approach. The partitioned tasks are then software pipelined to execute on the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with eight CPU cores, out of which four were used, and a GeForce 8800 GTS512 GPU show a(geometric) mean speed up of 6.84X with a maximum of 51.96X over a single threaded CPU execution across a set of StreamIt benchmarks.
|
Page generated in 0.0945 seconds