31 |
Exploiting parallel features of modern computer architectures in bioinformatics : applications to genetics, structure comparison and large graph analysis / Exploiter les capacités de calcul parallèle des architectures modernes en bioinformatiqueChapuis, Guillaume 18 December 2013 (has links)
La croissance exponentielle de la génération de données pour la bioinformatique couplée à une stagnation des fréquences d’horloge des processeurs modernes accentuent la nécessité de fournir des implémentation tirant bénéfice des capacités parallèles des ordinateurs modernes. Cette thèse se concentre sur des algorithmes et implementations pour des problèmes de bioinformatique. Plusieurs types de parallélisme sont décrits et exploités. Cette thèse présente des applications en génétique, avec un outil de détection de QTL paralllisé sur GPU, en comparaison de structures de protéines, avec un outil permettant de trouver des régions similaires entre protéines parallélisé sur CPU, ainsi qu’à l’analyse de larges graphes avec une implémentation multi-GPUs d’un nouvel algorithme pour le problème du «All-Pairs Shortest Path». / The exponential growth in bioinformatics data generation and the stagnation of processor frequencies in modern processors stress the need for efficient implementations that fully exploit the parallel capabilities offered by modern computers. This thesis focuses on parallel algorithms and implementations for bioinformatics problems. Various types of parallelism are described and exploited. This thesis presents applications in genetics with a GPU parallel tool for QTL detection, in protein structure comparison with a multicore parallel tool for finding similar regions between proteins, and large graph analysis with a multi-GPU parallel implementation for a novel algorithm for the All-Pairs Shortest Path problem.
|
32 |
Profiling of RT-PICLS CodeKelling, Jeffrey, Juckeland, Guido January 2017 (has links)
It was observed, that the RT-PICLS code ran by FWKT on the hypnos cluster was producing an unusual amount of system load, according to Ganglia metrics. Since this may point to an IO-problem in the code, this code was analyzed more closely.
|
33 |
Tiling and Asynchronous Communication Optimizations for Stencil ComputationsMalas, Tareq Majed Yasin 07 December 2015 (has links)
The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. Most of the established work concentrates on updating separate cache blocks per thread, which works on all types of shared memory systems, regardless of whether there is a shared cache among the cores. This approach is memory-bandwidth limited in several situations, where the cache space for each thread can be too small to provide sufficient in-cache data reuse.
We introduce a generalized multi-dimensional intra-tile parallelization scheme for shared-cache multicore processors that results in a significant reduction of cache size requirements and shows a large saving in memory bandwidth usage compared to existing approaches. It also provides data access patterns that allow efficient hardware prefetching. Our parameterized thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the Central Processing Unit (CPU).We also introduce efficient diamond tiling structure for both shared memory cache
blocking and distributed memory relaxed-synchronization communication, demonstrated using one-dimensional domain decomposition. We describe the approach and our open-source testbed implementation details (called Girih), present performance results on contemporary Intel processors, and apply advanced performance modeling techniques to reconcile the observed performance with hardware capabilities. Furthermore, we conduct a comparison with the state-of-the-art stencil frameworks PLUTO and Pochoir in shared memory, using corner-case stencil operators. We study the impact of the diamond tile size on computational intensity, cache block size, and energy consumption. The impact of computational intensity on power dissipation on the CPU and in the DRAM is investigated and shows that DRAM power is a decisive factor for energy consumption in the Intel Ivy Bridge processor, which is strongly influenced by the computational intensity. Moreover, we show that highest performance does not necessarily lead to lowest energy even if the clock speed is fixed. We apply our approach to an electromagnetic simulation application for solar cell development, demonstrating several-fold speedup compared to an efficient spatially blocked variant. Finally, we discuss the integration of our approach with other techniques for future High Performance Computing (HPC) systems, which are expected to be more memory bandwidth-starved with a deeper memory hierarchy.
|
34 |
An Expanded Speedup Model for the Early Phases of High Performance Computing Cluster (HPCC) DesignGabriel, Matthew Frederick 15 May 2013 (has links)
The size and complexity of many scientific and enterprise-level applications require a high degree of parallelization in order to produce outputs within an acceptable period of time. This often necessitates the uses of high performance computing clusters (HPCCs) and parallelized applications which are carefully designed and optimized. A myriad of papers study the various factors which influence performance and then attempt to quantify the maximum theoretical speedup that can be achieved by a cluster relative to a sequential processor.
The studies tend to only investigate the influences in isolation, but in practice these factors tend to be interdependent. It is the interaction rather than any solitary influence which normally creates the bounds of the design trade space. In the attempt to address this disconnect, this thesis blends the studies into an expanded speedup model which captures the interplay. The model is intended to help the cluster engineer make initial estimates during the early phases of design while the system is not mature enough for refinement using timing studies.
The model pulls together factors such as problem scaling, resource allocation, critical sections, and the problem's inherent parallelizability. The derivation was examined theoretically and then validated by timing studies on a physical HPCC. The validation studies found that the model was an adequate generic first approximation. However, it was also found that customizations may be needed in order to account for application-specific influences such as bandwidth limitations and communication delays which are not readily incorporated into a generic model. / Master of Science
|
35 |
Nas benchmark evaluation of HKU cluster of workstations /Mak, Chi-wah. January 1999 (has links)
Thesis (M. Phil.)--University of Hong Kong, 1999. / Includes bibliographical references (leaves 71-75).
|
36 |
High-Performance Domain-Specific Systems for Graph and Machine Learning WorkloadsJingji Chen (18991088) 09 July 2024 (has links)
<p dir="ltr">Graph-structure data is prevalent because of its ability to capture relations between real-world entities. However, graph data analyzing applications, including traditional and machine-learning-based approaches, are highly resource-demanding, necessitating massively parallel hardware like distributed clusters. Domain-specific systems, which aims to hide the hardware complexity from application users, suffers from the communication and computation efficiency problems.</p><p dir="ltr">This thesis tackles the problems with a set of novel specialized system designs for each category of workloads. For graph analytics workloads, we propose to enforce precise loop-carried dependency propagation to reduce redundant communication and computation in our SympleGraph system. SympleGraph achieves up to 2.30x and 7.76x speedups over Gemini and D-Galios, two state-of-the-art systems. For graph pattern mining workloads, we propose to co-design the pattern decomposition algorithm and compilation techniques to improve computation efficiency, and leverage application-characteristics-aware optimizations to reduce and hide communication overhead efficiently in our DecoMine and Khuzdul systems, respectively. Our extensive experiments show that, DecoMine and Khuzdul significantly outperform previous state-of-the-art solutions. For graph neural network training, we propose to introduce pipelined model parallelism for deep model training to reduce the worst-case communication complexity by a factor of model depth. With the proposed technique, our system, GNNPipe, can reduce the communication volume by up to 22.89x and speed up the training by up to 2.45x.</p>
|
37 |
Computer simulations of polymers and gelsWood, Dean January 2013 (has links)
Computer simulations have become a vital tool in modern science. The ability to reliably move beyond the capabilities of experiment has allowed great insights into the nature of matter. To enable the study of a wide range of systems and properties a plethora of simulation techniques have been developed and refined, allowing many aspects of complex systems to be demystified. I have used a range of these to study a variety of systems, utilising the latest technology in high performance computing (HPC) and novel, nanoscale models. Monte Carlo (MC) simulation is a commonly used method to study the properties of system using statistical mechanics and I have made use of it in published work [1] to study the properties of ferrogels in homogeneous magnetic fields using a simple microscopic model. The main phenomena of interest concern the anisotropy and enhancement of the elastic moduli that result from applying uniform magnetic fields before and after the magnetic grains are locked in to the polymer-gel matrix by cross-linking reactions. The positional organization of the magnetic grains is influenced by the application of a magnetic field during gel formation, leading to a pronounced anisotropy in the mechanical response of the ferrogel to an applied magnetic field. In particular, the elastic moduli can be enhanced to different degrees depending on the mutual orientation of the fields during and after ferrogel formation. Previously, no microscopic models have been produced to shed light on this effect and the main purpose of the work presented here is to illuminate the microscopic behaviour. The model represents ferrogels by ensembles of dipolar spheres dispersed in elastic matrices. Experimental trends are shown to be reflected accurately in the simulations of the microscopic model while shedding light on the microscopic mechanism causing these effects. These mechanisms are shown to be related to the behaviour of the dipoles during the production of the gels and caused by the chaining of dipoles in magnetic fields. Finally, simple relationships between the elastic moduli and the magnetization are proposed. If supplemented by the magnetization curve, these relationships yield the dependencies of the elastic moduli on the applied magnetic field, which are often measured directly in experiments. While MC simulations are useful for statistical studies, it can be difficult to use them to gather information about the dynamics of a system. In this case, Molecular Dynamics (MD) is more widely used. MD generally utilises the classical equations of motion to simulate the evolution of a system. For large systems, which are often of interest, and multi-species polymers, the required computer power still poses a challenge and requires the use of HPC techniques. The most recent development in HPC is the use of Graphical Processing Units (GPU) for the fast solution of data parallel problems. In further published work [2], I have used a bespoke MD code utilising GPU acceleration in order to simulate large systems of block copolymers(BC) in solvent over long timescales. I have studied thin films of BC solutions drying on a flat, smooth surface which requires long timescales due to the ’slow’ nature of the process. BC’s display interesting self-organisation behaviour in bulk solution and near surfaces and have a wide range of potential applications from semi-conductors to self-constructing fabrics. Previous studies have shown some unusual behaviour of PI-PEO diblock co-polymers adsorbing to a freshly cleaved mica surface. These AFM studies showed polymers increasing in height over time and proposed the change of affinity of mica to water and the loss of water layers on the surface as a driver for this change. The MD simulation aimed to illuminate the process involved in this phenomena. The process of evaporation of water layers from a surface was successfully simulated and gave a good indication that the process of solvent evaporation from the surface and the ingress of solvent beneath the adsorbed polymer caused the increase in height seen in experiment.
|
38 |
Performance studies of high-speed communication on commodity cluster譚達俊, Tam, Tat-chun, Anthony. January 2001 (has links)
published_or_final_version / Computer Science and Information Systems / Doctoral / Doctor of Philosophy
|
39 |
Parallel finite element analysisMargetts, Lee January 2002 (has links)
Finite element analysis is versatile and used widely in a range of engineering andscientific disciplines. As time passes, the problems that engineers and designers areexpected to solve are becoming more computationally demanding. Often theproblems involve the interplay of two or more processes which are physically andtherefore mathematically coupled. Although parallel computers have been availablefor about twenty years to satisfy this demand, finite element analysis is still largelyexecuted on serial machines. Parallelisation appears to be difficult, even for thespecialist. Parallel machines, programming languages, libraries and tools are used toparallelise old serial programs with mixed success. In some cases the serialalgorithm is not naturally suitable for parallel computing. Some argue that rewritingthe programs from scratch, using an entirely different solution strategy is a betterapproach. Taking this point of view, using MPI for portability, a mesh free elementby element method for simple data distribution and the appropriate iterative solvers,a general parallel strategy for finite element analysis is developed and assessed.
|
40 |
Establishing Linux Clusters for high-performance computing (HPC) at NPSDaillidis, Christos 09 1900 (has links)
Approved for public release; distribution is unlimited / S tasks. Discrete Event Simulation (DES) often involves repeated, independent runs of the same models with different input parameters. A system which is able to run many replications quickly is more useful than one in which a single monolithic application runs quickly. A loosely coupled parallel system is indicated. Inexpensive commodity hardware, high speed local area networking, and open source software have created the potential to create just such loosely coupled parallel systems. These systems are constructed from Linux-based computers and are called Beowulf clusters. This thesis presents an analysis of clusters in high-performance computing and establishes a testbed implementation at the MOVES Institute. It describes the steps necessary to create a cluster, factors to consider in selecting hardware and software, and describes the process of creating applications that can run on the cluster. Monitoring the running cluster and system administration are also addressed. / Major, Hellenic Army
|
Page generated in 0.0966 seconds