Global ETD Search

51	Support for Send-and-Receive Based Message-Passing for the Single-Chip Message-Passing Architecture Lewis, Charles William Jr. 06 May 2004 (has links) Arguably, from the programmer's perspective, the programming model is the most important characteristic of any computer system. Perhaps this explains why, after many decades of research, architects and programmers alike continue to debate the appropriate programming model for parallel computers. Though thousands of programming models have been developed, standards such as PVM and MPI have made send-and-receive based message-passing the most popular programming model for distributed memory architectures. This thesis explores modifying the Single-Chip Message-Passing (SCMP) architecture to more efficiently support send-and-receive based message-passing. The proposed system is compared, for performance and programmability, to the active messaging programming model currently used by SCMP. SCMP offers a unique platform for send-and-receive based message-passing. The SCMP design incorporates multiple multi-threaded processors, memory, and a network onto a single chip. This integration reduces the penalties of thread switching, memory access, and inter-process communication typically seen on more traditional distributed memory parallel machines. The mechanisms proposed in this thesis to support send-and-receive based message-passing on SCMP attempt to preserve and exploit these features as much as possible. / Master of Science Single-Chip Computer Parallel Computing Message-Passing
52	GPU Based Large Scale Multi-Agent Crowd Simulation and Path Planning Gusukuma, Luke 13 May 2015 (has links) Crowd simulation is used for many applications including (but not limited to) videogames, building planning, training simulators, and various virtual environment applications. Particularly, crowd simulation is most useful for when real life practices wouldn't be practical such as repetitively evacuating a building, testing the crowd flow for various building blue prints, placing law enforcers in actual crowd suppression circumstances, etc. In our work, we approach the fidelity to scalability problem of crowd simulation from two angles, a programmability angle, and a scalability angle, by creating new methodology building off of a struct of arrays approach and transforming it into an Object Oriented Struct of Arrays approach. While the design pattern itself is applied to crowd simulation in our work, the application of crowd simulation exemplifies the variety of applications for which the design pattern can be used. / Master of Science CUDA Roadmap GPU Crowd Simulation Parallel Computing
53	Accélération des calculs en Chimie théorique : l'exemple des processeurs graphiques / Accelerating Computations in Theoretical Chemistry : The Example of Graphic Processors Rubez, Gaëtan 06 December 2018 (has links) Nous nous intéressons à l'utilisation de la technologie manycore des cartes graphiques dans le cadre de la Chimie théorique. Nous soutenons la nécessité pour ce domaine d'être capable de tirer profit de cette technologie. Nous montrons la faisabilité et les limites de l'utilisation de cartes graphiques en Chimie théorique par le portage sur GPU de deux méthodes de calcul en modélisation moléculaire. Ces deux méthodes n’intégrerons ultérieurement au programme de docking moléculaire AlgoGen. L'accélération et la performance énergétique ont été examinées au cours de ce travail.Le premier programme NCIplot implémente la méthodologie NCI qui permet de détecter et de caractériser les interactions non-covalentes dans un système chimique. L'approche NCI se révèle être idéale pour l'utilisation de cartes graphiques comme notre analyse et nos résultats le montrent. Le meilleur portage que nous avons obtenu, a permis de constater des facteurs d'accélération allant jusqu'à 100 fois plus vite par rapport au programme NCIplot. Nous diffusons actuellement librement notre portage GPU : cuNCI.Le second travail de portage sur GPU se base sur GAMESS qui est un logiciel complexe de portée internationale implémentant de nombreuses méthodes quantiques. Nous nous sommes intéressés à la méthode combinée DFTB/FMO/PCM pour le calcul quantique de l'énergie potentielle d'un complexe. Nous sommes intervenus dans la partie du programme calculant l'effet du solvant. Ce cas s'avère moins favorable à l'utilisation de cartes graphiques, cependant nous avons su obtenir une accélération. / In this research work we are interested in the use of the manycore technology of graphics cards in the framework of approaches coming from the field of Theoretical Chemistry. We support the need for Theoretical Chemistry to be able to take advantage of the use of graphics cards. We show the feasibility as well as the limits of the use of graphics cards in the framework of the theoretical chemistry through two usage of GPU on different approaches.We first base our research work on the GPU implementation of the NCIplot program. The NCIplot program has been distributed since 2011 by Julia CONTRERAS-GARCIA implementing the NCI methodology published in 2010. The NCI approach is proving to be an ideal candidate for the use of graphics cards as demonstrated by our analysis of the NCIplot program, as well as the performance achieved by our GPU implementations. Our best implementation (VHY) shows an acceleration factors up to 100 times faster than the NCIplot program. We are currently freely distributing this implementation in the cuNCI program.The second GPU accelerated work is based on the software GAMESS-US, a free competitor of GAUSSIAN. GAMESS is an international software that implements many quantum methods. We were interested in the simultaneous use of DTFB, FMO and PCM methods. The frame is less favorable to the use of graphics cards however we have been able to accelerate the part carried by two K20X graphics cards. Docking Nci Dftb Drug design Parallel computing Gpu Docking Nci Dftb Drug design Parallel computing Gpu
54	Designing a Compiler for a Distributed Memory Parallel Computing System Bennett, Sidney Page 22 January 2004 (has links) The SCMP processor presents a unique approach to processor design: integrating multiple processors, a network, and memory onto a single chip. The benefits to this design include a reduction in overhead incurred by synchronization, communication, and memory accesses. To properly determine its effectiveness, the SCMP architecture must be exercised under a wide variety of workloads, creating the need for a variety of applications. A compiler can relieve the time spent developing these applications by allowing the use of languages such as C and Fortran. However, compiler development is a research area in its own right, requiring extensive knowledge of the architecture to make good use of its resources. This thesis presents the design and implementation of a compiler for the SCMP architecture. The thesis includes an in-depth analysis of SCMP and the necessary design choices for an effective compiler using the SUIF and MachSUIF toolsets. Two optimizations passes are included in the discussion: partial redundancy elimination and instruction scheduling. While these optimizations are not specific to parallel computing, architectural considerations must still be made to properly implement the algorithms within the SCMP compiler. These optimizations yield an overall reduction in execution time of 15-36%. / Master of Science SUIF Compiler Design Parallel Computing Optimization MachSUIF MultithreadingOptimization Parallel Computing Multithreading
55	Application of multi-core and cluster computing to the Transmission Line Matrix method Browne, Daniel R. January 2014 (has links) The Transmission Line Matrix (TLM) method is an existing and established mathematical method for conducting computational electromagnetic (CEM) simulations. TLM models Maxwell s equations by discretising the contiguous nature of an environment and its contents into individual small-scale elements and it is a computationally intensive process. This thesis focusses on parallel processing optimisations to the TLM method when considering the opposing ends of the contemporary computing hardware spectrum, namely large-scale computing systems versus small-scale mobile computing devices. Theoretical aspects covered in this thesis are: The historical development and derivation of the TLM method. A discrete random variable (DRV) for rain-drop diameter,allowing generation of a rain-field with raindrops adhering to a Gaussian size distribution, as a case study for a 3-D TLM implementation. Investigations into parallel computing strategies for accelerating TLM on large and small-scale computing platforms. Implementation aspects covered in this thesis are: A script for modelling rain-fields using free-to-use modelling software. The first known implementation of 2-D TLM on mobile computing devices. A 3-D TLM implementation designed for simulating the effects of rain-fields on extremely high frequency (EHF) band signals. By optimising both TLM solver implementations for their respective platforms, new opportunities present themselves. Rain-field simulations containing individual rain-drop geometry can be simulated, which was previously impractical due to the lengthy computation times required. Also, computationally time-intensive methods such as TLM were previously impractical on mobile computing devices. Contemporary hardware features on these devices now provide the opportunity for CEM simulations at speeds that are acceptable to end users, as well as providing a new avenue for educating relevant user cohorts via dynamic presentations of EM phenomena. 621.384
56	Mapping parallel programs to heterogeneous multi-core systems Grewe, Dominik January 2014 (has links) Heterogeneous computer systems are ubiquitous in all areas of computing, from mobile to high-performance computing. They promise to deliver increased performance at lower energy cost than purely homogeneous, CPU-based systems. In recent years GPU-based heterogeneous systems have become increasingly popular. They combine a programmable GPU with a multi-core CPU. GPUs have become flexible enough to not only handle graphics workloads but also various kinds of general-purpose algorithms. They are thus used as a coprocessor or accelerator alongside the CPU. Developing applications for GPU-based heterogeneous systems involves several challenges. Firstly, not all algorithms are equally suited for GPU computing. It is thus important to carefully map the tasks of an application to the most suitable processor in a system. Secondly, current frameworks for heterogeneous computing, such as OpenCL, are low-level, requiring a thorough understanding of the hardware by the programmer. This high barrier to entry could be lowered by automatically generating and tuning this code from a high-level and thus more user-friendly programming language. Both challenges are addressed in this thesis. For the task mapping problem a machine learning-based approach is presented in this thesis. It combines static features of the program code with runtime information on input sizes to predict the optimal mapping of OpenCL kernels. This approach is further extended to also take contention on the GPU into account. Both methods are able to outperform competing mapping approaches by a significant margin. Furthermore, this thesis develops a method for targeting GPU-based heterogeneous systems from OpenMP, a directive-based framework for parallel computing. OpenMP programs are translated to OpenCL and optimized for GPU performance. At runtime a predictive model decides whether to execute the original OpenMP code on the CPU or the generated OpenCL code on the GPU. This approach is shown to outperform both a competing approach as well as hand-tuned code. 004
57	Shape-based cost analysis of skeletal parallel programs Hayashi, Yasushi January 2001 (has links) This work presents an automatic cost-analysis system for an implicitly parallel skeletal programming language. Although deducing interesting dynamic characteristics of parallel programs (and in particular, run time) is well known to be an intractable problem in the general case, it can be alleviated by placing restrictions upon the programs which can be expressed. By combining two research threads, the “skeletal” and “shapely” paradigms which take this route, we produce a completely automated, computation and communication sensitive cost analysis system. This builds on earlier work in the area by quantifying communication as well as computation costs, with the former being derived for the Bulk Synchronous Parallel (BSP) model. We present details of our shapely skeletal language and its BSP implementation strategy together with an account of the analysis mechanism by which program behaviour information (such as shape and cost) is statically deduced. This information can be used at compile-time to optimise a BSP implementation and to analyse computation and communication costs. The analysis has been implemented in Haskell. We consider different algorithms expressed in our language for some example problems and illustrate each BSP implementation, contrasting the analysis of their efficiency by traditional, intuitive methods with that achieved by our cost calculator. The accuracy of cost predictions by our cost calculator against the run time of real parallel programs is tested experimentally. Previous shape-based cost analysis required all elements of a vector (our nestable bulk data structure) to have the same shape. We partially relax this strict requirement on data structure regularity by introducing new shape expressions in our analysis framework. We demonstrate that this allows us to achieve the first automated analysis of a complete derivation, the well known maximum segment sum algorithm of Skillicorn and Cai. 005.3
58	Exploring the neural codes using parallel hardware / Explorer les codes neuronaux utilisant des machines parallèles Baladron Pezoa, Javier 07 June 2013 (has links) L'objectif de cette thèse est de comprendre la dynamique des grandes populations de neurones interconnectées. La méthode utilisée pour atteindre cet objectif est un mélange de modèles mésoscopiques et calculs de haute performance. Le premier permet de réduire la complexité du réseau neuronale et le second de réaliser des simulations à grandes échelles. Dans la première partie de cette thèse une nouvelle approche du champ moyen est utilisée pour étudier numériquement les effets du bruit sur un groupe extrêmement grand de neurones. La même approche a été utilisée pour créer un modèle d' hypercolonne du premier cortex visuel d'où l'unité basique, est des grandes populations de neurones au lieu d'une seule cellule. Les simulations sont réalisées en résolvant un système d'équation différentielle partielle qui décrit l'évolution de la fonction de densité de probabilité du réseau. Dans la deuxième partie de cette thèse est présentée une étude numérique de deux modèles de champs neuronaux du premier cortex visuel. Le principal objectif est de déterminer comment les contours sont sélectionnés dans le cortex visuel. La différence entre les deux modèles est la manière de représenter des préférences d'orientations des neurones. Pour l'un des modèles, l'orientation est une caractéristique de l'équation et la connectivité dépend d'elle. Dans l'autre, il existe une carte d'orientation qui définit une fonction d'entrée. Toutes les simulations sont réalisées sur un cluster de processeurs graphiques. Cette thèse propose des techniques pour simuler rapidement les modèles proposés sur ce type de machine. La vitesse atteinte est équivalente à un cluster standard très grand. / The aim of this thesis is to understand the dynamics of large interconnected populations of neurons. The method we use to reach this objective is a mixture of mesoscopic modeling and high performance computing. The rst allows us to reduce the complexity of the network and the second to perform large scale simulations. In the rst part of this thesis a new mean eld approach for conductance based neurons is used to study numerically the eects of noise on extremely large ensembles of neurons. Also, the same approach is used to create a model of one hypercolumn from the primary visual cortex where the basic computational units are large populations of neurons instead of simple cells. All of these simulations are done by solving a set of partial dierential equations that describe the evolution of the probability density function of the network. In the second part of this thesis a numerical study of two neural eld models of the primary visual cortex is presented. The main focus in both cases is to determine how edge selection and continuation can be computed in the primary visual cortex. The dierence between the two models is in how they represent the orientation preference of neurons, in one this is a feature of the equations and the connectivity depends on it, while in the other there is an underlying map which denes an input function. All the simulations are performed on a Graphic Processing Unit cluster. Thethesis proposes a set of techniques to simulate the models fast enough on this kind of hardware. The speedup obtained is equivalent to that of a huge standard cluster. Neurosciences Calcul numérique Calcul parallèle Neuroscience Numerical methods Parallel computing
59	Training and Optimizing Distributed Neural Networks Using a Genetic Algorithm McMurtrey, Shannon Dale 01 January 2010 (has links) Parallelizing neural networks is an active area of research. Current approaches surround the parallelization of the widely used back-propagation (BP) algorithm, which has a large amount of communication overhead, making it less than ideal for parallelization. An algorithm that does not depend on the calculation of derivatives, and the backward propagation of errors, better lends itself to a parallel implementation. One well known training algorithm for neural networks explicitly incorporates network structure in the objective function to be minimized which yields simpler neural networks. Prior work has implemented this using a modified genetic algorithm in a serial fashion that is not scalable, thus limiting its usefulness. This dissertation created a parallel version of the algorithm. The performance of the proposed algorithm is compared against the existing algorithm using a variety of syn-thetic and real world problems. Computational experiments with benchmark datasets in-dicate that the parallel algorithm proposed in this research outperforms the serial version from prior research in finding better minima in the same time as well as identifying a simpler architecture. Genetic Algorithms Neural Networks Parallel Computing Computer Sciences
60	A scalable data store and analytic platform for real-time monitoring of data-intensive scientific infrastructure Suthakar, Uthayanath January 2017 (has links) Monitoring data-intensive scientific infrastructures in real-time such as jobs, data transfers, and hardware failures is vital for efficient operation. Due to the high volume and velocity of events that are produced, traditional methods are no longer optimal. Several techniques, as well as enabling architectures, are available to support the Big Data issue. In this respect, this thesis complements existing survey work by contributing an extensive literature review of both traditional and emerging Big Data architecture. Scalability, low-latency, fault-tolerance, and intelligence are key challenges of the traditional architecture. However, Big Data technologies and approaches have become increasingly popular for use cases that demand the use of scalable, data intensive processing (parallel), and fault-tolerance (data replication) and support for low-latency computations. In the context of a scalable data store and analytics platform for monitoring data-intensive scientific infrastructure, Lambda Architecture was adapted and evaluated on the Worldwide LHC Computing Grid, which has been proven effective. This is especially true for computationally and data-intensive use cases. In this thesis, an efficient strategy for the collection and storage of large volumes of data for computation is presented. By moving the transformation logic out from the data pipeline and moving to analytics layers, it simplifies the architecture and overall process. Time utilised is reduced, untampered raw data are kept at storage level for fault-tolerance, and the required transformation can be done when needed. An optimised Lambda Architecture (OLA), which involved modelling an efficient way of joining batch layer and streaming layer with minimum code duplications in order to support scalability, low-latency, and fault-tolerance is presented. A few models were evaluated; pure streaming layer, pure batch layer and the combination of both batch and streaming layers. Experimental results demonstrate that OLA performed better than the traditional architecture as well the Lambda Architecture. The OLA was also enhanced by adding an intelligence layer for predicting data access pattern. The intelligence layer actively adapts and updates the model built by the batch layer, which eliminates the re-training time while providing a high level of accuracy using the Deep Learning technique. The fundamental contribution to knowledge is a scalable, low-latency, fault-tolerant, intelligent, and heterogeneous-based architecture for monitoring a data-intensive scientific infrastructure, that can benefit from Big Data, technologies and approaches.

Search results