Spelling suggestions: "subject:"high performance computing"" "subject:"hugh performance computing""
41 |
Wavelet Compression for Visualization and Analysis on High Performance ComputersLi, Shaomeng 31 October 2018 (has links)
As HPC systems move towards exascale, the discrepancy between computational power and I/O transfer rate is only growing larger. Lossy in situ compression is a promising solution to address this gap, since it alleviates I/O constraints while still enabling traditional post hoc analysis. This dissertation explores the viability of such a solution with respect to a specific kind of compressor — wavelets. We especially examine three aspects of concern regarding the viability of wavelets: 1) information loss after compression, 2) its capability to fit within in situ constraints, and 3) the compressor’s capability to adapt to HPC architectural changes. Findings from this dissertation inform in situ use of wavelet compressors on HPC systems, demonstrate its viabilities, and argue that its viability will only increase as exascale computing becomes a reality.
|
42 |
Simulations massives de Dynamique des Dislocations : fiabilité et performances sur architectures parallèles et distribuées. / Large scale Dislocation Dynamics simulations : performance and reliability on parallel and distributed architectures.Durocher, Arnaud 19 December 2018 (has links)
La Dynamique des Dislocations modélise le comportement de défauts linéiques - les dislocations - présents dans la structure des matériaux cristallins. Il s'agit d'un maillon essentiel de la modélisation multi-échelles des matériaux utilisé par exemple dans l’industrie du nucléaire pour caractériser le comportement mécanique et le vieillissement des matériaux sous irradiation. La capacité des dislocations à se multiplier, s’annihiler et interagir pose de nombreux défis informatiques, notamment sur la manière de stocker et traiter de manière efficace les données de la simulation. L'objectif de cette thèse est de répondre à ces défis que posent les simulations massives de Dynamique des Dislocations dans un environnement parallèle et distribué au travers du logiciel Optidis. Dans cette thèse, je propose des améliorations au simulateur Optidis afin de permettre des simulations plus complexes en utilisant la puissance des super-calculateurs. Mes contributions sont axées sur l'amélioration de la fiabilité et de la performance d'Optidis. La mise en place d'une nouvelle interface d'accès aux données a permis de dissocier l'implémentation des algorithmes de l'optimisation des performances. Cette structure de données permet de meilleures performances tout en améliorant la maintenabilité du code, même lorsque les données sont distribuées. Un nouvel algorithme de gestion des collisions entre dislocations et de formation des jonctions fiable et performant a été mis en place. Des techniques de détection de collision empruntées aux application en temps réel et à la dynamique moléculaire sont utilisées pour accélérer le calcul. S’appuyant sur l’utilisation de la nouvelle structure de données et un traitement des collisions plus élaboré, il permet une gestion de collisions fiable et autorise l'utilisation de pas de temps plus grands. La précision du résultat a été étudiée en se comparant au code NUMODIS, et la performance d'Optidis a été mesurée sur des simulations massives contenant plusieurs millions de segments de dislocations en utilisant plusieurs centaines de cœurs de calcul, démontrant que de telles simulations sont réalisables en un temps raisonnable. / Dislocation dynamics simulations investigate the behavior of linear defects, called dislocations, in crystalline materials. It is an essential part multiscale modelling of the materials, used for instance in the nuclear industry to characterize the behavior and aging of materials under irradiation. The ability of dislocations to multiply, annihilate and interact presents many challenges, for instance in terms of storage and access to data. This thesis addresses some challenges of dislocation dynamics simulation on parallel and distributed computers. In this thesis, I improve the Optidis simulator to open the way to more complex simulations. My contributions focuses mainly on improving the reliability and performance of Optidis. A new interface to access simulation data is proposed to dissociate its implementation form the physical algorithms. This data structure allows better performance as well as better code maintainability, even with distributed data. A new fast and reliable collision detection and handling algorithm has been implemented. Collision detection techniques from the robotics and 3D animation industries are used to speedup the detection process. With the use of the new data structure and a more reliable design, this algorithm enables more precise collision handling and the use of a larger simulation timestep. The precision of the results have been measured by comparing Optidis to Numodis. The performance of the code has been studied on larger scale simulations with millions of segments and hundreds of CPU cores, demonstrating that such simulations can now be achieved.
|
43 |
O2-tree: a shared memory resident index in multicore architecturesOhene-Kwofie, Daniel 06 February 2013 (has links)
Shared memory multicore computer architectures are now commonplace in computing.
These can be found in modern desktops and workstation computers and also in High
Performance Computing (HPC) systems. Recent advances in memory architecture and
in 64-bit addressing, allow such systems to have memory sizes of the order of hundreds of
gigabytes and beyond. This now allows for realistic development of main memory resident
database systems. This still requires the use of a memory resident index such as T-Tree,
and the B+-Tree for fast access to the data items.
This thesis proposes a new indexing structure, called the O2-Tree, which is essentially
an augmented Red-Black Tree in which the leaf nodes are index data blocks that store
multiple pairs of key and value referred to as \key-value" pairs. The value is either the
entire record associated with the key or a pointer to the location of the record. The
internal nodes contain copies of the keys that split blocks of the leaf nodes in a manner
similar to the B+-Tree. O2-Tree structure has the advantage that: it can be easily
reconstructed by reading only the lowest value of the key of each leaf node page. The size
is su ciently small and thus can be dumped and restored much faster.
Analysis and comparative experimental study show that the performance of the O2-Tree
is superior to other tree-based index structures with respect to various query operations
for large datasets. We also present results which indicate that the O2-Tree outperforms
popular key-value stores such as BerkelyDB and TreeDB of Kyoto Cabinet for various
workloads. The thesis addresses various concurrent access techniques for the O2-Tree for
shared memory multicore architecture and gives analysis of the O2-Tree with respect to
query operations, storage utilization, failover and recovery.
|
44 |
Failure Prediction using Machine Learning in a Virtualized HPC System and applicationMohammed, Bashir, Awan, Irfan U., Ugail, Hassan, Muhammad, Y. January 2019 (has links)
Yes / Failure is an increasingly important issue in high performance computing and cloud systems. As large-scale systems continue to grow in scale and complexity, mitigating the impact of failure and providing accurate predictions with sufficient lead time remains a challenging research problem. Traditional existing fault-tolerance strategies such as regular checkpointing and replication are not adequate because of the emerging complexities of high performance computing systems. This necessitates the importance of having an effective as well as proactive failure management approach in place aimed at minimizing the effect of failure
within the system. With the advent of machine learning techniques, the ability to learn from past information to predict future pattern of behaviours makes it possible to predict potential system failure more accurately. Thus, in this paper, we explore the predictive abilities of machine learning by applying a number of algorithms to improve the accuracy of failure prediction. We have developed a failure prediction model using time series and machine learning, and performed comparison based tests on the prediction accuracy. The primary algorithms we considered are the Support Vector Machine (SVM), Random Forest(RF), k-Nearest Neighbors (KNN), Classi cation and Regression Trees (CART) and Linear Discriminant Analysis (LDA). Experimental results show that the average prediction accuracy of our model using SVM when predicting failure is 90% accurate and effective compared to other algorithms. This fi nding means that our method can effectively predict all possible future system and application failures within the system. / The full-text of this article will be released for public view a year after publication.
|
45 |
Performance Analysis of Hybrid CPU/GPU EnvironmentsSmith, Michael Shawn 01 January 2010 (has links)
We present two metrics to assist the performance analyst to gain a unified view of application performance in a hybrid environment: GPU Computation Percentage and GPU Load Balance. We analyze the metrics using a matrix multiplication benchmark suite and a real scientific application. We also extend an experiment management system to support GPU performance data and to calculate and store our GPU Computation Percentage and GPU Load Balance metrics.
|
46 |
Enhancements to the scalable coherent interface cache protocolSafranek, Robert J. 01 January 1999 (has links)
As the number of NUMA system's cache coherency protocols based on the IEEE Std. 1596-1992, Standard for Scalable Coherent Interface (SCI) Specification increases, it is important to review this complex protocol to determine if the protocol can be enhanced in any way. This research provides two realizable extensions to the standard SCI cache protocol. Both of these extensions lie in the basic confines of the SCI architectures.
The first extension is a simplification to the SCI protocol in the area of prepending to a sharing list. Depending if the cache line is marked "Fresh" or "Gone", the flow of events is distinctly different. The guaranteed forward progress extension is a simplification to the SCI protocol in this area; making the act of prepending to an existing sharing list independent of whether the line is in the "Fresh" or "Gone" state. In addition, this extension eliminates the need for SCI command, as well as distributes the resource requirements of supplying data of a shared line equally among all nodes of the sharing list. The second extension addresses the time to purge (or invalidate) an SCI sharing list. This extension provides a realizable solution that allows the node being invalidated to acknowledge the request prior to the completion of the invalidation while maintaining the memory consistency model of the processors of the system.
The resulting cache protocol was developed and implemented for Sequent Computer System Inc. NUMA-Q system. The cache protocol was run on systems ranging from eight to sixty four processors and provided between 7% and 20% reduction in time to invalidate an SCI sharing list.
|
47 |
Policy-Gradient Algorithms for Partially Observable Markov Decision ProcessesAberdeen, Douglas Alexander, doug.aberdeen@anu.edu.au January 2003 (has links)
Partially observable Markov decision processes are interesting because
of their ability to model most conceivable real-world learning
problems, for example, robot navigation, driving a car, speech
recognition, stock trading, and playing games. The downside of this
generality is that exact algorithms are computationally
intractable. Such computational complexity motivates approximate
approaches. One such class of algorithms are the so-called
policy-gradient methods from reinforcement learning. They seek to
adjust the parameters of an agent in the direction that maximises the
long-term average of a reward signal. Policy-gradient methods are
attractive as a \emph{scalable} approach for controlling partially
observable Markov decision processes (POMDPs).
¶
In the most general case POMDP policies require some form of internal
state, or memory, in order to act optimally. Policy-gradient methods
have shown promise for problems admitting memory-less policies but have
been less successful when memory is required. This thesis develops
several improved algorithms for learning policies with memory in an
infinite-horizon setting. Directly, when the dynamics of the world
are known, and via Monte-Carlo methods otherwise.
The algorithms simultaneously learn how to act and what to remember.
¶
Monte-Carlo policy-gradient approaches tend to produce gradient
estimates with high variance. Two novel methods for reducing variance
are introduced. The first uses high-order filters to replace the
eligibility trace of the gradient estimator. The second uses a
low-variance value-function method to learn a subset of the parameters
and a policy-gradient method to learn the remainder.
¶
The algorithms are applied to large domains including a simulated
robot navigation scenario, a multi-agent scenario with 21,000 states,
and the complex real-world task of large vocabulary continuous speech
recognition. To the best of the author's knowledge, no other policy-gradient
algorithms have performed well at such tasks.
¶
The high variance of Monte-Carlo methods requires lengthy simulation
and hence a super-computer to train agents within a reasonable time. The ANU
``Bunyip'' Linux cluster was built with such tasks in mind. It was
used for several of the experimental results presented here. One
chapter of this thesis describes an application written for the Bunyip
cluster that won the international Gordon-Bell prize for
price/performance in 2001.
|
48 |
Communication performance measurement and analysis on commodity clusters.Abdul Hamid, Nor Asilah Wati January 2008 (has links)
Cluster computers have become the dominant architecture in high-performance computing. Parallel programs on these computers are mostly written using the Message Passing Interface (MPI) standard, so the communication performance of the MPI library for a cluster is very important. This thesis investigates several different aspects of performance analysis for MPI libraries, on both distributed memory clusters and shared memory parallel computers. The performance evaluation was done using MPIBench, a new MPI benchmark program that provides some useful new functionality compared to existing MPI benchmarks. Since there has been only limited previous use of MPIBench, some initial work was done on comparing MPIBench with other MPI benchmarks, and improving its functionality, reliability, portability and ease of use. This work included a detailed comparison of results from the Pallas MPI Benchmark (PMB), SKaMPI, Mpptest, MPBench and MPIBench on both distributed memory and shared memory parallel computers, which has not previously been done. This comparison showed that the results for some MPI routines were significantly different between the different benchmarks, particularly for the shared memory machine. A comparison was done between Myrinet and Ethernet network performance on the same machine, an IBM Linux cluster with 128 dual processor nodes, using the MPICH MPI library. The analysis focused mainly on the scalability and variability of communication times for the different networks, making use of the capability of MPIBench to generate distributions of MPI communication times. The analysis provided an improved understanding of the effects of TCP retransmission timeouts on Ethernet networks. This analysis showed anomalous results for some MPI routines. Further investigation showed that this is because MPICH uses different algorithms for small and large message sizes for some collective communication routines, and the message size where this changeover occurs is fixed, based on measurements using a cluster with a single processor per node. Experiments were done to measure the performance of the different algorithms, which demonstrated that for some MPI routines the optimal changeover points were very different between Myrinet and Ethernet networks and for 1 and 2 processors per node. Significant performance improvements can be made by allowing the changeover points to be tuned rather than fixed, particularly for commodity Ethernet networks and for clusters with more than 1 process per node. MPIBench was also used to analyse the MPI performance and scalability of a large ccNUMA shared memory machine, an SGI Altix 3000 with 160 processors. The results were compared with a high-end cluster, an AlphaServer SC with Quadrics QsNet interconnect. For most MPI routines the Altix showed significantly better performance, particularly when non-buffered copy was used. MPIBench proved to be a very capable tool for analyzing MPI performance in a variety of different situations. / http://proxy.library.adelaide.edu.au/login?url= http://library.adelaide.edu.au/cgi-bin/Pwebrecon.cgi?BBID=1331421 / Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 2008
|
49 |
Online Task Scheduling on Heterogeneous Clusters : An Experimental StudyRosenvinge, Einar Magnus January 2004 (has links)
<p>We study the problem of scheduling applications composed of a large number of tasks on heterogeneous clusters. Tasks are identical, independent from each other, and can hence be computed in any order. The goal is to execute all the tasks as quickly as possible. We use the Master-Worker paradigm, where tasks are maintained by the master which will hand out batches of a variable amount of tasks to requesting workers. We introduce a new scheduling strategy, the Monitor strategy, and compare it to other strategies suggested in the literature. An image filtering application, known as matched filtering, has been used to compare the different strategies. Our implementation involves datastaging techniques in order to circumvent the possible bottleneck incurred by the master, and multi-threading to prevent possible processor idleness.</p>
|
50 |
Fast Sorting on a Distributed-Memory ArchitectureCheng, David R., Shah, Viral, Gilbert, John R., Edelman, Alan 01 1900 (has links)
We consider the often-studied problem of sorting, for a parallel computer. Given an input array distributed evenly over p processors, the task is to compute the sorted output array, also distributed over the p processors. Many existing algorithms take the approach of approximately load-balancing the output, leaving each processor with Θ(n/p) elements. However, in many cases, approximate load-balancing leads to inefficiencies in both the sorting itself and in further uses of the data after sorting. We provide a deterministic parallel sorting algorithm that uses parallel selection to produce any output distribution exactly, particularly one that is perfectly load-balanced. Furthermore, when using a comparison sort, this algorithm is 1-optimal in both computation and communication. We provide an empirical study that illustrates the efficiency of exact data splitting, and shows an improvement over two sample sort algorithms. / Singapore-MIT Alliance (SMA)
|
Page generated in 0.1061 seconds