Global ETD Search

61	Designing a Compiler for a Distributed Memory Parallel Computing System Bennett, Sidney Page 22 January 2004 (has links) The SCMP processor presents a unique approach to processor design: integrating multiple processors, a network, and memory onto a single chip. The benefits to this design include a reduction in overhead incurred by synchronization, communication, and memory accesses. To properly determine its effectiveness, the SCMP architecture must be exercised under a wide variety of workloads, creating the need for a variety of applications. A compiler can relieve the time spent developing these applications by allowing the use of languages such as C and Fortran. However, compiler development is a research area in its own right, requiring extensive knowledge of the architecture to make good use of its resources. This thesis presents the design and implementation of a compiler for the SCMP architecture. The thesis includes an in-depth analysis of SCMP and the necessary design choices for an effective compiler using the SUIF and MachSUIF toolsets. Two optimizations passes are included in the discussion: partial redundancy elimination and instruction scheduling. While these optimizations are not specific to parallel computing, architectural considerations must still be made to properly implement the algorithms within the SCMP compiler. These optimizations yield an overall reduction in execution time of 15-36%. / Master of Science SUIF Compiler Design Parallel Computing Optimization MachSUIF MultithreadingOptimization Parallel Computing Multithreading
62	Application of multi-core and cluster computing to the Transmission Line Matrix method Browne, Daniel R. January 2014 (has links) The Transmission Line Matrix (TLM) method is an existing and established mathematical method for conducting computational electromagnetic (CEM) simulations. TLM models Maxwell s equations by discretising the contiguous nature of an environment and its contents into individual small-scale elements and it is a computationally intensive process. This thesis focusses on parallel processing optimisations to the TLM method when considering the opposing ends of the contemporary computing hardware spectrum, namely large-scale computing systems versus small-scale mobile computing devices. Theoretical aspects covered in this thesis are: The historical development and derivation of the TLM method. A discrete random variable (DRV) for rain-drop diameter,allowing generation of a rain-field with raindrops adhering to a Gaussian size distribution, as a case study for a 3-D TLM implementation. Investigations into parallel computing strategies for accelerating TLM on large and small-scale computing platforms. Implementation aspects covered in this thesis are: A script for modelling rain-fields using free-to-use modelling software. The first known implementation of 2-D TLM on mobile computing devices. A 3-D TLM implementation designed for simulating the effects of rain-fields on extremely high frequency (EHF) band signals. By optimising both TLM solver implementations for their respective platforms, new opportunities present themselves. Rain-field simulations containing individual rain-drop geometry can be simulated, which was previously impractical due to the lengthy computation times required. Also, computationally time-intensive methods such as TLM were previously impractical on mobile computing devices. Contemporary hardware features on these devices now provide the opportunity for CEM simulations at speeds that are acceptable to end users, as well as providing a new avenue for educating relevant user cohorts via dynamic presentations of EM phenomena. 621.384
63	Mapping parallel programs to heterogeneous multi-core systems Grewe, Dominik January 2014 (has links) Heterogeneous computer systems are ubiquitous in all areas of computing, from mobile to high-performance computing. They promise to deliver increased performance at lower energy cost than purely homogeneous, CPU-based systems. In recent years GPU-based heterogeneous systems have become increasingly popular. They combine a programmable GPU with a multi-core CPU. GPUs have become flexible enough to not only handle graphics workloads but also various kinds of general-purpose algorithms. They are thus used as a coprocessor or accelerator alongside the CPU. Developing applications for GPU-based heterogeneous systems involves several challenges. Firstly, not all algorithms are equally suited for GPU computing. It is thus important to carefully map the tasks of an application to the most suitable processor in a system. Secondly, current frameworks for heterogeneous computing, such as OpenCL, are low-level, requiring a thorough understanding of the hardware by the programmer. This high barrier to entry could be lowered by automatically generating and tuning this code from a high-level and thus more user-friendly programming language. Both challenges are addressed in this thesis. For the task mapping problem a machine learning-based approach is presented in this thesis. It combines static features of the program code with runtime information on input sizes to predict the optimal mapping of OpenCL kernels. This approach is further extended to also take contention on the GPU into account. Both methods are able to outperform competing mapping approaches by a significant margin. Furthermore, this thesis develops a method for targeting GPU-based heterogeneous systems from OpenMP, a directive-based framework for parallel computing. OpenMP programs are translated to OpenCL and optimized for GPU performance. At runtime a predictive model decides whether to execute the original OpenMP code on the CPU or the generated OpenCL code on the GPU. This approach is shown to outperform both a competing approach as well as hand-tuned code. 004
64	Shape-based cost analysis of skeletal parallel programs Hayashi, Yasushi January 2001 (has links) This work presents an automatic cost-analysis system for an implicitly parallel skeletal programming language. Although deducing interesting dynamic characteristics of parallel programs (and in particular, run time) is well known to be an intractable problem in the general case, it can be alleviated by placing restrictions upon the programs which can be expressed. By combining two research threads, the “skeletal” and “shapely” paradigms which take this route, we produce a completely automated, computation and communication sensitive cost analysis system. This builds on earlier work in the area by quantifying communication as well as computation costs, with the former being derived for the Bulk Synchronous Parallel (BSP) model. We present details of our shapely skeletal language and its BSP implementation strategy together with an account of the analysis mechanism by which program behaviour information (such as shape and cost) is statically deduced. This information can be used at compile-time to optimise a BSP implementation and to analyse computation and communication costs. The analysis has been implemented in Haskell. We consider different algorithms expressed in our language for some example problems and illustrate each BSP implementation, contrasting the analysis of their efficiency by traditional, intuitive methods with that achieved by our cost calculator. The accuracy of cost predictions by our cost calculator against the run time of real parallel programs is tested experimentally. Previous shape-based cost analysis required all elements of a vector (our nestable bulk data structure) to have the same shape. We partially relax this strict requirement on data structure regularity by introducing new shape expressions in our analysis framework. We demonstrate that this allows us to achieve the first automated analysis of a complete derivation, the well known maximum segment sum algorithm of Skillicorn and Cai. 005.3
65	Exploring the neural codes using parallel hardware / Explorer les codes neuronaux utilisant des machines parallèles Baladron Pezoa, Javier 07 June 2013 (has links) L'objectif de cette thèse est de comprendre la dynamique des grandes populations de neurones interconnectées. La méthode utilisée pour atteindre cet objectif est un mélange de modèles mésoscopiques et calculs de haute performance. Le premier permet de réduire la complexité du réseau neuronale et le second de réaliser des simulations à grandes échelles. Dans la première partie de cette thèse une nouvelle approche du champ moyen est utilisée pour étudier numériquement les effets du bruit sur un groupe extrêmement grand de neurones. La même approche a été utilisée pour créer un modèle d' hypercolonne du premier cortex visuel d'où l'unité basique, est des grandes populations de neurones au lieu d'une seule cellule. Les simulations sont réalisées en résolvant un système d'équation différentielle partielle qui décrit l'évolution de la fonction de densité de probabilité du réseau. Dans la deuxième partie de cette thèse est présentée une étude numérique de deux modèles de champs neuronaux du premier cortex visuel. Le principal objectif est de déterminer comment les contours sont sélectionnés dans le cortex visuel. La différence entre les deux modèles est la manière de représenter des préférences d'orientations des neurones. Pour l'un des modèles, l'orientation est une caractéristique de l'équation et la connectivité dépend d'elle. Dans l'autre, il existe une carte d'orientation qui définit une fonction d'entrée. Toutes les simulations sont réalisées sur un cluster de processeurs graphiques. Cette thèse propose des techniques pour simuler rapidement les modèles proposés sur ce type de machine. La vitesse atteinte est équivalente à un cluster standard très grand. / The aim of this thesis is to understand the dynamics of large interconnected populations of neurons. The method we use to reach this objective is a mixture of mesoscopic modeling and high performance computing. The rst allows us to reduce the complexity of the network and the second to perform large scale simulations. In the rst part of this thesis a new mean eld approach for conductance based neurons is used to study numerically the eects of noise on extremely large ensembles of neurons. Also, the same approach is used to create a model of one hypercolumn from the primary visual cortex where the basic computational units are large populations of neurons instead of simple cells. All of these simulations are done by solving a set of partial dierential equations that describe the evolution of the probability density function of the network. In the second part of this thesis a numerical study of two neural eld models of the primary visual cortex is presented. The main focus in both cases is to determine how edge selection and continuation can be computed in the primary visual cortex. The dierence between the two models is in how they represent the orientation preference of neurons, in one this is a feature of the equations and the connectivity depends on it, while in the other there is an underlying map which denes an input function. All the simulations are performed on a Graphic Processing Unit cluster. Thethesis proposes a set of techniques to simulate the models fast enough on this kind of hardware. The speedup obtained is equivalent to that of a huge standard cluster. Neurosciences Calcul numérique Calcul parallèle Neuroscience Numerical methods Parallel computing
66	Training and Optimizing Distributed Neural Networks Using a Genetic Algorithm McMurtrey, Shannon Dale 01 January 2010 (has links) Parallelizing neural networks is an active area of research. Current approaches surround the parallelization of the widely used back-propagation (BP) algorithm, which has a large amount of communication overhead, making it less than ideal for parallelization. An algorithm that does not depend on the calculation of derivatives, and the backward propagation of errors, better lends itself to a parallel implementation. One well known training algorithm for neural networks explicitly incorporates network structure in the objective function to be minimized which yields simpler neural networks. Prior work has implemented this using a modified genetic algorithm in a serial fashion that is not scalable, thus limiting its usefulness. This dissertation created a parallel version of the algorithm. The performance of the proposed algorithm is compared against the existing algorithm using a variety of syn-thetic and real world problems. Computational experiments with benchmark datasets in-dicate that the parallel algorithm proposed in this research outperforms the serial version from prior research in finding better minima in the same time as well as identifying a simpler architecture. Genetic Algorithms Neural Networks Parallel Computing Computer Sciences
67	A scalable data store and analytic platform for real-time monitoring of data-intensive scientific infrastructure Suthakar, Uthayanath January 2017 (has links) Monitoring data-intensive scientific infrastructures in real-time such as jobs, data transfers, and hardware failures is vital for efficient operation. Due to the high volume and velocity of events that are produced, traditional methods are no longer optimal. Several techniques, as well as enabling architectures, are available to support the Big Data issue. In this respect, this thesis complements existing survey work by contributing an extensive literature review of both traditional and emerging Big Data architecture. Scalability, low-latency, fault-tolerance, and intelligence are key challenges of the traditional architecture. However, Big Data technologies and approaches have become increasingly popular for use cases that demand the use of scalable, data intensive processing (parallel), and fault-tolerance (data replication) and support for low-latency computations. In the context of a scalable data store and analytics platform for monitoring data-intensive scientific infrastructure, Lambda Architecture was adapted and evaluated on the Worldwide LHC Computing Grid, which has been proven effective. This is especially true for computationally and data-intensive use cases. In this thesis, an efficient strategy for the collection and storage of large volumes of data for computation is presented. By moving the transformation logic out from the data pipeline and moving to analytics layers, it simplifies the architecture and overall process. Time utilised is reduced, untampered raw data are kept at storage level for fault-tolerance, and the required transformation can be done when needed. An optimised Lambda Architecture (OLA), which involved modelling an efficient way of joining batch layer and streaming layer with minimum code duplications in order to support scalability, low-latency, and fault-tolerance is presented. A few models were evaluated; pure streaming layer, pure batch layer and the combination of both batch and streaming layers. Experimental results demonstrate that OLA performed better than the traditional architecture as well the Lambda Architecture. The OLA was also enhanced by adding an intelligence layer for predicting data access pattern. The intelligence layer actively adapts and updates the model built by the batch layer, which eliminates the re-training time while providing a high level of accuracy using the Deep Learning technique. The fundamental contribution to knowledge is a scalable, low-latency, fault-tolerant, intelligent, and heterogeneous-based architecture for monitoring a data-intensive scientific infrastructure, that can benefit from Big Data, technologies and approaches.
68	Advancement of Computing on Large Datasets via Parallel Computing and Cyberinfrastructure Yildirim, Ahmet Artu 01 May 2015 (has links) Large datasets require efficient processing, storage and management to efficiently extract useful information for innovation and decision-making. This dissertation demonstrates novel approaches and algorithms using virtual memory approach, parallel computing and cyberinfrastructure. First, we introduce a tailored user-level virtual memory system for parallel algorithms that can process large raster data files in a desktop computer environment with limited memory. The application area for this portion of the study is to develop parallel terrain analysis algorithms that use multi-threading to take advantage of common multi-core processors for greater efficiency. Second, we present two novel parallel WaveCluster algorithms that perform cluster analysis by taking advantage of discrete wavelet transform to reduce large data to coarser representations so data is smaller and more easily managed than the original data in size and complexity. Finally, this dissertation demonstrates an HPC gateway service that abstracts away many details and complexities involved in the use of HPC systems including authentication, authorization, and data and job management. advancement of computing large datasets parallel computing cyberinfrastructure Computer Sciences
69	MATLABP 2.0: A unified parallel MATLAB Choy, Ron, Edelman, Alan* 01 1900 (has links) MATLAB is one of the most widely used mathematical computing environments in technical computing. It is an interactive environment that provides high performance computational routines and an easy-to-use, C-like scripting language. Mathworks, the company that develops MATLAB, currently does not provide a version of MATLAB that can utilize parallel computing. This has led to academic and commercial efforts outside Mathworks to build a parallel MATLAB, using a variety of approaches. In a survey, 26 parallel MATLAB projects utilizing four different approaches have been identified. MATLABP is one of the 26 projects. It makes use of the backend support approach. This approach provides parallelism to MATLAB programs by relaying MATLAB commands to a parallel backend. The main difference between MATLABP and other projects that make use of the same approach is in its focus. MATLABP aims to provide a user-friendly supercomputing environment in which parallelism is achieved transparently through the use of objected oriented programming features in MATLAB. One advantage of this approach is that existing scripts can be run in parallel with no or minimal modifications. This paper describes MATLABP 2.0, which is a complete rewrite of MATLAB*P. This new version brings together the backend support approach with embarrassingly parallel and MPI approaches to provide the first complete parallel MATLAB framework. / Singapore-MIT Alliance (SMA) MATLAB parallel computing backend support approach embarassingly parallel operations
70	Mapping Unstructured Parallelism to Series-Parallel DAGs Pan, Yan, Hsu, Wen Jing 01 1900 (has links) Many parallel programming languages allow programmers to describe parallelism by using constructs such as fork/join. When executed, such programs can be modeled as directed graphs, with nodes representing a computation and edges representing the sequence and dependency. However, because it does not coerce regularity in the computation, the general model is not amenable to efficient execution of the resulting program. Therefore, a more restrictive model called Series-Parallel DAG (Directed Acyclic Graph) has been proposed and adopted by several major parallel languages. As reported by the Cilk developers, many parallel computations can be easily expressed in the series-parallel model, and there are provably efficient scheduling algorithms for the SP DAGs. Nevertheless, it remains open how much inherent parallelism will be lost when conforming to the model, because expressing a computation in the series-parallel model may also induce performance losses. We will show that any general DAG can be converted into an SP DAG without violating the original precedence relations; moreover, the conversion can be carried out in essentially linear time and space, and the resulting DAG exhibits little loss in the parallelism. Since the resulting SP DAG can then be executed with high efficiency, it implies that the languages based on SP DAGs are not as restrictive as they were thought to be. / Singapore-MIT Alliance (SMA) parallel computing Series-Parallel Directed Acyclic Graph Cilk SP DAG

Page generated in 0.0674 seconds