1 |
A parallel adaptive method for pseudo-arclength continuationDubitski, Alexander 01 August 2011 (has links)
We parallelize the pseudo-arclength continuation method for solving nonlinear systems
of equations. Pseudo-arclength continuation is a predictor-corrector method where the
correction step consists of solving a linear system of algebraic equations. Our algorithm
parallelizes adaptive step-length selection and inexact prediction. Prior attempts to parallelize
pseudo-arclength continuation are typically based on parallelization of the linear
solver which leads to completely solver-dependent software. In contrast, our method is
completely independent of the internal solver and therefore applicable to a large domain
of problems. Our software is easy to use and does not require the user to have extensive
prior experience with High Performance Computing; all the user needs to provide is the
implementation of the corrector step. When corrector steps are costly or continuation
curves are complicated, we observe up to sixty percent speed up with moderate numbers
of processors. We present results for a synthetic problem and a problem in turbulence. / UOIT
|
2 |
Scalability-Driven Approaches to Key Aspects of the Message Passing Interface for Next Generation SupercomputingZounmevo, Ayi Judicael 23 May 2014 (has links)
The Message Passing Interface (MPI), which dominates the supercomputing programming environment,
is used to orchestrate and fulfill communication in High Performance Computing (HPC).
How far HPC programs can scale depends in large part
on the ability to achieve fast communication; and to overlap communication with computation or
communication with communication.
This dissertation proposes a new asynchronous solution to the nonblocking Rendezvous protocol used
between pairs of processes to transfer large payloads. On top of enforcing communication/computation
overlapping in a comprehensive way, the proposal trumps existing network device-agnostic
asynchronous solutions by being memory-scalable and by avoiding brute force strategies.
Achieving overlapping between communication and computation is important; but each
communication is also expected to generate minimal latency. In that respect, the
processing of the queues meant to hold messages pending reception inside the MPI middleware is
expected to be fast. Currently though, that processing slows down when program scales grow.
This research presents a novel scalability-driven message queue whose processing skips
altogether large portions of queue items that are deterministically guaranteed to lead to
unfruitful searches. For having little sensitivity to program sizes, the proposed message
queue maintains a very good performance,
on top of displaying a low and flattening memory footprint growth pattern.
Due to the blocking nature of its required synchronizations, the one-sided
communication model of MPI creates both communication/computation and communication/communication
serializations. This research fixes these issues and latency-related inefficiencies documented for
MPI one-sided communications by proposing completely nonblocking and non-serializing versions for
those synchronizations. The improvements, meant for consideration in a future MPI standard,
also allow new classes of programs to be more efficiently expressed in MPI.
Finally, a persistent distributed service is designed over MPI to show its impacts
at large scales beyond communication-only activities.
MPI is analyzed in situations of resource exhaustion, partial failure and heavy use of internal
objects for communicating and non-communicating routines. Important scalability issues are revealed
and solution approaches are put forth. / Thesis (Ph.D, Electrical & Computer Engineering) -- Queen's University, 2014-05-23 15:08:58.56
|
3 |
Parallelisierung des Wellenfrontrekonstruktionsalgorithmus auf Multicore-ProzessorenSchenke, Jonas 10 July 2018 (has links)
Ziel dieser Arbeit war die Beschleunigung des von Elena-Ruxandra Cojocaru und Sébastien Bérujon in Python implementierten Wellenfrontrekonstruktionsalgorithmus. Dieser berechnet aus zwei Bildern einer Probe pixelweise die Fronten der elektromagnetischen Welle eines Röntgenlasers. Die Bilder werden hierbei von zwei hochempfindlichen Röntgen-CCD-Sensoren aufgenommen, welche in einem festen Abstand zueinander und zur Probe positioniert sind. Treffen Strahlen des Röntgenlasers auf diese, so lässt sich aus den so aufgenommenen Streubildern die Wellenfront rekonstruieren, was Rückschlüsse auf die Struktur der Probe zulässt. Auf Basis von Performance-Analysen der Python-Implementierung wurden Optimierungen und Parallelisierungsmöglichkeiten für die kritischen Programmabschnitte ermittelt, implementiert und evaluiert. Die schnellste vorgestellte Lösung basiert auf der Verteilung der Bildpaare auf mehrere Rechenkerngruppen und der Parallelisierung der Berechnung der Bildpaare auf diesen, was eine Skalierung über mehrere Knoten erlaubt. Kombiniert mit der Nutzung optimierter Bibliotheken und dem Übersetzen des Python-Codes wurde eine Beschleunigung von bis zu vier gegenüber der Referenzimplementierung mit gleicher Kernanzahl erreicht. Wurden 120 Kerne verwendet, so war eine Beschleunigung auf das bis zu 133-fache gegenüber der Referenz auf einem Kern möglich. Die Referenzdaten hierfür wurden an der Beamline BM05 der European Synchrotron Radiation Facility aufgenommen. / The goal of this thesis was the acceleration of the wavefront reconstruction algorithm which was developed in Python by Elena-Ruxandra Cojocaru and Sébastien Bérujon. This algorithms calculates the electromagnetic wavefront of an X-ray laser from two images of a target pixelwise. The images were captured by two highly sensitive X-ray CCD sensors, which were positioned in a fixed distance to each other and the target. When the refracted X-ray beam hits these detectors a distortion image is generated from which the wavefront can be reconstructed. The result can be used to draw conclusions about the structure of the target. On the basis of performance measurements of the Python implementation optimization and parallelization possibilities for critical sections were determined, implemented and evaluated. The fastest proposed solution is based on the distribution of the image pairs onto CPU core groups and the parallelization of the calculation of the image pairs on these which allows scaling the problem over multiple nodes. This combined with the use of optimized libraries and the compilation of the Python code resulted in a speedup of up to four towards the reference implementation without the use of more cores. When using 120 cores a speedup of up to 133 towards the reference implementation running on a single core could be achieved. The here used datasets were recorded at Beamline BM05 of the European Synchrotron Radiation Facility.
|
4 |
Lazy Fault Recovery for Redundant MPISaliba, Elie 01 June 2019 (has links) (PDF)
Distributed Systems (DS) where multiple computers share a workload across a network, are used everywhere, from data intensive computations to storage and machine learning. DS provide a relatively cheap and efficient solution that allows stability with improved performance for computational intensive applications. In a DS faults and failures are the norm not the exception. At any moment data corruption can occur especially since a DS usually consists of hundred to thousands of units of commodity hardware. The large number and quality of components guarantees, by probability, that at any given time some of the components will not be working and some of them will not recover from failure. DS can experience problems caused by application bugs, operating systems bugs, failures with disks, memory, connectors, networking, power supply, and other components; therefore, constant monitoring and failure detection are fundamental. Automatic recovery must be integral to the system. One of the most commonly used programming languages for DS is Message Passing Interface (MPI). Unfortunately MPI does not support fault detection or recovery. In this thesis, we build a recovery mechanism based on replicas that works on top of the asynchronous fault detection implemented in previous work. Results shows that our recovery implementation is successful and the overhead in execution time is minimal.
|
5 |
Distribuované zpracování zachycené síťové komunikace / Captured Communication Processing on Distributed SystemHvězda, Matěj January 2016 (has links)
When you need to assess or troubleshoot network by analysing capture file, you want it done as fast as possible and you do not always have a high-performance computer. Here comes the distributed system, which allows you to use his high computing power and lot of available memory. I introduce distributed application, which is scalable, extensible and capable of processing captured network communication and is developed for Windows platform. That provides technology, like Microsoft HPC Pack and Windows Communication foundation. The application supports multiple capture formats. In parallel system (cluster), exists database in order to save statistics and data of captured communication in order to save user's computer memory so client's application can be used for low-performance computers or make data available to a client after distributed processing.
|
6 |
The performance evaluation of workstation clustersMelas, Panagiotis January 2000 (has links)
No description available.
|
7 |
Software support for advanced applications on distributed memory multiprocessor systemsChapman, Barbara Mary January 1998 (has links)
No description available.
|
8 |
Performance Models For Distributed Memory HPC Systems And Deep Neural NetworksCardwell, David 12 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Performance models are useful as mathematical models to reason about the behavior of different computer systems while running various applications. In this thesis, we aim to provide two distinct performance models: one for distributed-memory high performance computing systems with network communication, and one for deep neural networks. Our main goal for the first model is insight and simplicity, while for the second we aim for accuracy in prediction. The first model is generalized for networked multi-core computer systems, while the second is specific to deep neural networks on a shared-memory system.
|
9 |
Data-flow vs control-flow for extreme level computingEvripidou, P., Kyriacou, Costas January 2013 (has links)
No / This paper challenges the current thinking for building High Performance Computing (HPC) Systems, which is currently based on the sequential computing also known as the von Neumann model, by proposing the use of Novel systems based on the Dynamic Data-Flow model of computation. The switch to Multi-core chips has brought the Parallel Processing into the mainstream. The computing industry and research community were forced to do this switch because they hit the Power and Memory walls. Will the same happen with HPC? The United States through its DARPA agency commissioned a study in 2007 to determine what kind of technologies will be needed to build an Exaflop computer. The head of the study was very pessimistic about the possibility of having an Exaflop computer in the foreseeable future. We believe that many of the findings that caused the pessimistic outlook were due to the limitations of the sequential model. A paradigm shift might be needed in order to achieve the affordable Exascale class Supercomputers.
|
10 |
Radar Signal Processing with Graphics Processors (GPUS)Pettersson, Jimmy, Wainwright, Ian January 2010 (has links)
No description available.
|
Page generated in 0.0295 seconds