Spelling suggestions: "subject:"ppc"" "subject:"dppc""
161 |
Characterization of FPGA-based High Performance ComputersPimenta Pereira, Karl Savio 02 September 2011 (has links)
As CPU clock frequencies plateau and the doubling of CPU cores per processor exacerbate the memory wall, hybrid core computing, utilizing CPUs augmented with FPGAs and/or GPUs holds the promise of addressing high-performance computing demands, particularly with respect to performance, power and productivity. While traditional approaches to benchmark high-performance computers such as SPEC, took an architecture-based approach, they do not completely express the parallelism that exists in FPGA and GPU accelerators. This thesis follows an application-centric approach, by comparing the sustained performance of two key computational idioms, with respect to performance, power and productivity. Specifically, a complex, single precision, floating-point, 1D, Fast Fourier Transform (FFT) and a Molecular Dynamics modeling application, are implemented on state-of-the-art FPGA and GPU accelerators. As results show, FPGA floating-point FFT performance is highly sensitive to a mix of dedicated FPGA resources; DSP48E slices, block RAMs, and FPGA I/O banks in particular. Estimated results show that for the floating-point FFT benchmark on FPGAs, these resources are the performance limiting factor. Fixed-point FFTs are important in a lot of high performance embedded applications. For an integer-point FFT, FPGAs exploit a flexible data path width to trade-off circuit cost and speed of computation, improving performance and resource utilization. GPUs cannot fully take advantage of this, having a fixed data-width architecture. For the molecular dynamics application, FPGAs benefit from the flexibility in creating a custom, tightly-pipelined datapath, and a highly optimized memory subsystem of the accelerator. This can provide a 250-fold improvement over an optimized CPU implementation and 2-fold improvement over an optimized GPU implementation, along with massive power savings. Finally, to extract the maximum performance out of the FPGA, each implementation requires a balance between the formulation of the algorithm on the platform, the optimum use of available external memory bandwidth, and the availability of computational resources; at the expense of a greater programming effort. / Master of Science
|
162 |
FPGA-Based Accelerator Development for Non-EngineersUliana, David Christopher 02 June 2014 (has links)
In today's world of big-data computing, access to massive, complex data sets has reached an unprecedented level, and the task of intelligently processing such data into useful information has become a growing concern to the high-performance computing community.
However, domain experts, who are the brains behind this processing, typically lack the skills required to build FPGA-based hardware accelerators ideal for their applications, as traditional development flows targeting such hardware require digital design expertise.
This work proposes a usable, end-to-end accelerator development methodology that attempts to bridge this gap between domain-experts and the vast computational capacity of FPGA-based heterogeneous platforms.
To accomplish this, two development flows were assembled, both targeting the Convey Hybrid-Core HC-1 heterogeneous platform and utilizing existing graphical design environments for design entry.
Furthermore, incremental implementation techniques were applied to one of the flows to accelerate bitstream compilation, improving design productivity.
The efficacy of these flows in extending FPGA-based acceleration to non-engineers in the life sciences was informally tested at two separate instances of an NSF-funded summer workshop, organized and hosted by the Virginia Bioinformatics Institute at Virginia Tech.
In both workshops, groups of four or five non-engineer participants made significant modifications to a bare-bones Smith-Waterman accelerator, extending functionality and improving performance. / Master of Science
|
163 |
On the Use of Containers in High Performance ComputingAbraham, Subil 09 July 2020 (has links)
The lightweight, portable, and flexible nature of containers is driving their widespread adoption in cloud solutions. Data analysis and deep learning applications have especially benefited from containerized solutions. As such data analysis is also being utilized in the high performance computing (HPC) domain, the need for container support in HPC has become paramount. However, container adoption in HPC face crucial performance and I/O challenges. One obstacle is that while there have been container solutions for HPC, such solutions have not been thoroughly investigated, especially from the aspect of their impact on the crucial I/O throughput needs of HPC. To this end, this paper provides a first-of-its-kind empirical analysis of state-of-the-art representative container solutions (Docker, Podman, Singularity, and Charliecloud) in HPC environments, especially how containers interact with the HPC storage systems. We present the design of an analysis framework that is deployed on all nodes in an HPC environment, and captures aspects such as CPU, memory, network, and file I/O statistics from the nodes and the storage system. We are able to garner key insights from our analysis, e.g., Charliecloud outperforms other container solutions in terms of container start-up time, while Singularity and Charliecloud are equivalent in I/O throughput. But this comes at a cost, as Charliecloud invokes the most metadata and I/O operations on the underlying Lustre file system. By identifying such optimization opportunities, we can enhance performance of containers atop HPC and help the aforementioned applications. / Master of Science / Containers are a technology that allow for applications to be packaged along with its ideal environment, all the way down to its preferred operating system. This allows an application to run anywhere that can support containers without a huge hit to the application performance. Hence containers have seen wide adoption for use in the cloud. These qualities have also made it very appealing for use in the world of scientific research in national labs. Modern research heavily relies on the power of computing in order to model, simulate, and test the behavior of real world entities, often making use of large amounts of data and utilizing machine learning and deep learning. Doing this often requires the high performance computing power found in supercomputers. In most cases, scientists just want to be able to write their code and expect it to just work. Their applications might depend on other source code that form part of their standard toolkit and would expect to also be installed in the supercomputing environment. This may not always be the case, taking the scientist's focus away from their work in order ensure their requirements are set up in the supercomputing environment which might require extensive cooperation with the operations team responsible for the supercomputers. Containers easily solve this problem because it can package everything together. However, the use of containers in these environments have not been extensively tested, especially for applications that are very heavy on the analysis of large quantities of data. To fill this gap, this work analyzes the performance of several state-of-the-art container technologies (Docker, Podman, Singularity, Charliecloud), with a particular focus on its interaction with the Lustre data storage systems widely used in supercomputing environments. As part of this work, we design an analysis setup that captures the behavior of various aspects of the high performance computing environment like CPU, memory, network usage and data movement while using containers to run data heavy applications. We garner important insights about their performance that can help inform the best choice of container technology given an environment and the kind of application that needs to be run.
|
164 |
Using Task Parallelism for Distributed Parallel Skeleton Programming : Implementing a StarPU Back-End to SkePU 2 / Distribuerade parallellprogrammeringsskelett genom uppgiftsparallellism : Implementation av en StarPU-baserad SkePU 2 backendHenrik, Henriksson January 2024 (has links)
We extended the parallel skeleton programming framework SkePU 2 with a new back-end utilizing StarPU, a task programming framework for hybrid and distributed architectures. The aim was to allow SkePU to run on distributed clusters, using MPI through StarPU. The implemented back-end distributes data and work across participating ranks. While we did not implement the full SkePU API, the Map and Reduce1D skeletons were successfully implemented. During the implementation, we discovered some differences in API design between SkePU and StarPU. We combine the type-safe templates used in the SkePU API with the C-style void*-heavy API of StarPU. This requires the implementation to use more complex templates than normally desired. While we could preserve most of the SkePU 2 API when moving to a distributed memory situation, some parts had to change. In particular, we needed to change the semantics of SkePU 2 containers with regards to iterators and random access. We benchmarked the performance of the implemented back-end against an MPI+OpenMP reference implementation on two problems, n-body and a simple reduction. While the n-body problem demonstrates promising scaling properties, reductions do not scale well to larger number of ranks. A performance comparison against the MPI+OpenMP reference implementation reveals that, aside from the higher communication overhead, there may also be some overhead in the work performed between communications, potentially performing at below 60-70% of the reference. In most cases, the new back-end to SkePU exhibits significantly lower performance than the reference. Extending the implemented solution to cover the full API and improving performance could provide a high level interface to distributed programming for application programmers. Indeed, subsequent developments of SkePU 3 extend and improve our StarPU back-end.
|
165 |
Failure Analysis Modelling in an Infrastructure as a Service (Iaas) EnvironmentMohammed, Bashir, Modu, Babagana, Maiyama, Kabiru M., Ugail, Hassan, Awan, Irfan U., Kiran, Mariam 30 October 2018 (has links)
Yes / Failure Prediction has long known to be a challenging problem. With the evolving trend of technology and growing complexity of high-performance cloud data centre infrastructure, focusing on failure becomes very vital particularly when designing systems for the next generation. The traditional runtime fault-tolerance (FT) techniques such as data replication and periodic check-pointing are not very effective to handle the current state of the art emerging computing systems. This has necessitated the urgent need for a robust system with an in-depth understanding of system and component failures as well as the ability to predict accurate potential future system failures. In this paper, we studied data in-production-faults recorded within a five years period from the National Energy Research Scientific computing centre (NERSC). Using
the data collected from the Computer Failure Data Repository (CFDR), we developed an effective failure
prediction model focusing on high-performance cloud data centre infrastructure. Using the Auto-Regressive Moving Average (ARMA), our model was able to predict potential future failures in the system. Our results also show a failure prediction accuracy of 95%, which is good.
|
166 |
Event Sequence Identification and Deep Learning Classification for Anomaly Detection and Predication on High-Performance Computing SystemsLi, Zongze 12 1900 (has links)
High-performance computing (HPC) systems continue growing in both scale and complexity. These large-scale, heterogeneous systems generate tens of millions of log messages every day. Effective log analysis for understanding system behaviors and identifying system anomalies and failures is highly challenging. Existing log analysis approaches use line-by-line message processing. They are not effective for discovering subtle behavior patterns and their transitions, and thus may overlook some critical anomalies. In this dissertation research, I propose a system log event block detection (SLEBD) method which can extract the log messages that belong to a component or system event into an event block (EB) accurately and automatically. At the event level, we can discover new event patterns, the evolution of system behavior, and the interaction among different system components. To find critical event sequences, existing sequence mining methods are mostly based on the a priori algorithm which is compute-intensive and runs for a long time. I develop a novel, topology-aware sequence mining (TSM) algorithm which is efficient to generate sequence patterns from the extracted event block lists. I also train a long short-term memory (LSTM) model to cluster sequences before specific events. With the generated sequence pattern and trained LSTM model, we can predict whether an event is going to occur normally or not. To accelerate such predictions, I propose a design flow by which we can convert recurrent neural network (RNN) designs into register-transfer level (RTL) implementations which are deployed on FPGAs. Due to its high parallelism and low power, FPGA achieves a greater speedup and better energy efficiency compared to CPU and GPU according to our experimental results.
|
167 |
A live imaging paradigm for studying Drosophila development and evolutionSchmied, Christopher 30 March 2016 (has links) (PDF)
Proper metazoan development requires that genes are expressed in a spatiotemporally controlled manner, with tightly regulated levels. Altering the expression of genes that govern development leads mostly to aberrations. However, alterations can also be beneficial, leading to the formation of new phenotypes, which contributes to the astounding diversity of animal forms. In the past the expression of developmental genes has been studied mostly in fixed tissues, which is unable to visualize these highly dynamic processes. We combine genomic fosmid transgenes, expressing genes of interest close to endogenous conditions, with Selective Plane Illumination Microscopy (SPIM) to image the expression of genes live with high temporal resolution and at single cell level in the entire embryo.
In an effort to expand the toolkit for studying Drosophila development we have characterized the global expression patterns of various developmentally important genes in the whole embryo. To process the large datasets generated by SPIM, we have developed an automated workflow for processing on a High Performance Computing (HPC) cluster.
In a parallel project, we wanted to understand how spatiotemporally regulated gene expression patterns and levels lead to different morphologies across Drosophila species. To this end we have compared by SPIM the expression of transcription factors (TFs) encoded by Drosophila melanogaster fosmids to their orthologous Drosophila pseudoobscura counterparts by expressing both fosmids in D. melanogaster. Here, we present an analysis of divergence of expression of orthologous genes compared A) directly by expressing the fosmids, tagged with different fluorophore, in the same D. melanogaster embryo or B) indirectly by expressing the fosmids, tagged with the same fluorophore, in separate D. melanogaster embryos.
Our workflow provides powerful methodology for the study of gene expression patterns and levels during development, such knowledge is a basis for understanding both their evolutionary relevance and developmental function.
|
168 |
Contributions à la modélisation mathématique et à l'algorithmique parallèle pour l'optimisation d'un propagateur d'ondes élastiques en milieu anisotrope / Contributions to the mathematical modeling and to the parallel algorithmic for the optimization of an elastic wave propagator in anisotropic mediaBoillot, Lionel 12 December 2014 (has links)
La méthode d’imagerie la plus répandue dans l’industrie pétrolière est la RTM (Reverse Time Migration) qui repose sur la simulation de la propagation des ondes dans le sous-sol. Nous nous sommes concentrés sur un propagateur d'ondes élastiques 3D en milieu anisotrope de type TTI (Tilted Transverse Isotropic). Nous avons directement travaillé dans le code de recherche de Total DIVA (Depth Imaging Velocity Analysis), basé sur une discrétisation par la méthode de Galerkin Discontinue et le schéma Leap-Frog, et développé pour le calcul parallèle intensif – HPC (High Performance Computing). Nous avons ciblé plus particulièrement deux contributions possibles qui, si elles supposent des compétences très différentes, ont la même finalité : réduire les coûts de calculs requis pour la simulation. D'une part, les conditions aux limites classiques de type PML (Perfectly Matched Layers) ne sont pas stables dans des milieux TTI. Nous avons proposé de formuler une CLA (Conditions aux Limites Absorbantes) stable dans des milieux anisotropes. La méthode de construction repose sur les propriétés des courbes de lenteur, ce qui donne à notre approche un caractère original. D'autre part, le parallélisme initial, basé sur une décomposition de domaine et des communications par passage de messages à l'aide de la bibliothèque MPI, conduit à un déséquilibrage de charge qui détériore son efficacité parallèle. Nous avons corrigé cela en remplaçant le paradigme parallélisme par l'utilisation de la programmation à base de tâches sur support d'exécution. Cette thèse a été réalisée dans le cadre de l'action de recherche DIP (Depth Imaging Partnership) qui lie la compagnie pétrolière Total et Inria. / The most common method of Seismic Imaging is the RTM (Reverse Time Migration) which depends on wave propagation simulations in the subsurface. We focused on a 3D elastic wave propagator in anisotropic media, more precisely TTI (Tilted Transverse Isotropic). We directly worked in the Total code DIVA (Depth Imaging Velocity Analysis) which is based on a discretization by the Discontinuous Galerkin method and the Leap-Frog scheme, and developed for intensive parallel computing – HPC (High Performance Computing). We choose to especially target two contributions. Although they required very different skills, they share the same goal: to reduce the computational cost of the simulation. On one hand, classical boundary conditions like PML (Perfectly Matched Layers) are unstable in TTI media. We have proposed a formulation of a stable ABC (Absorbing Boundary Condition) in anisotropic media. The technique is based on slowness curve properties, giving to our approach an original side. On the other hand, the initial parallelism, which is based on a domain decomposition and communications by message passing through the MPI library, leads to load-imbalance and so poor parallel efficiency. We have fixed this issue by replacing the paradigm for parallelism by the use of task-based programming through runtime system. This PhD thesis have been done in the framework of the research action DIP (Depth Imaging Partnership) between the Total oil company and Inria.
|
169 |
Jobzentrisches Monitoring in Verteilten Heterogenen Umgebungen mit Hilfe Innovativer Skalierbarer MethodenHilbrich, Marcus 24 June 2015 (has links) (PDF)
Im Bereich des wissenschaftlichen Rechnens nimmt die Anzahl von Programmläufen (Jobs), die von einem Benutzer ausgeführt werden, immer weiter zu. Dieser Trend resultiert sowohl aus einer steigenden Anzahl an CPU-Cores, auf die ein Nutzer zugreifen kann, als auch durch den immer einfacheren Zugriff auf diese mittels Portalen, Workflow-Systemen oder Services. Gleichzeitig schränken zusätzliche Abstraktionsschichten von Grid- und Cloud-Umgebungen die Möglichkeit zur Beobachtung von Jobs ein. Eine Lösung bietet das jobzentrische Monitoring, das die Ausführung von Jobs transparent darstellen kann.
Die vorliegende Dissertation zeigt zum einen Methoden mit denen eine skalierbare Infrastruktur zur Verwaltung von Monitoring-Daten im Kontext von Grid, Cloud oder HPC (High Performance Computing) realisiert werden kann. Zu diesem Zweck wird sowohl eine Aufgabenteilung unter Berücksichtigung von Aspekten wie Netzwerkbandbreite und Speicherkapazität mittels einer Strukturierung der verwendeten Server in Schichten, als auch eine dezentrale Aufbereitung und Speicherung der Daten realisiert. Zum anderen wurden drei Analyseverfahren zur automatisierten und massenhaften Auswertung der Daten entwickelt.
Hierzu wurde unter anderem ein auf der Kreuzkorrelation basierender Algorithmus mit einem baumbasierten Optimierungsverfahren zur Reduzierung der Laufzeit und des Speicherbedarfs entwickelt. Diese drei Verfahren können die Anzahl der manuell zu analysierenden Jobs von vielen Tausenden, auf die wenigen, interessanten, tatsächlichen Ausreißer bei der Jobausführung reduzieren. Die Methoden und Verfahren zur massenhaften Analyse, sowie zur skalierbaren Verwaltung der jobzentrischen Monitoring-Daten, wurden entworfen, prototypisch implementiert und mittels Messungen sowie durch theoretische Analysen untersucht. / An increasing number of program executions (jobs) is an ongoing trend in scientific computing. Increasing numbers of available compute cores and lower access barriers, based on portal-systems, workflow-systems, or services, drive this trend. At the same time, the abstraction layers that enable grid and cloud solutions pose challenges in observing job behaviour. Thus, observation and monitoring capabilities for large numbers of jobs are lacking. Job-centric monitoring offers a solution to present job executions in a transparent manner.
This dissertation presents methods for scalable infrastructures that handle monitoring data of jobs in grid, cloud, and HPC (High Performance Computing) solutions. A layer-based organisation of servers with a distributed storage scheme enables a task sharing that respects network bandwidths and data capacities. Additionally, three proposed automatic analysis techniques enable an evaluation of huge data quantities.
One of the developed algorithms is based on cross-correlation and uses a tree-based optimisation strategy to decrease both runtime and memory usage. These three methods are able to significantly reduce the number of jobs for manual analysis from many thousands to a few interesting jobs that exhibit outlier-behaviour during job execution. Contributions of this thesis include a design, a prototype implementation, and an evaluation for methods that analyse large amounts of job-data, as well for the scalable storage concept for such data.
|
170 |
System Profiling and Green Capabilities for Large Scale and Distributed InfrastructuresTsafack Chetsa, Ghislain Landry 03 December 2013 (has links) (PDF)
Nowadays, reducing the energy consumption of large scale and distributed infrastructures has truly become a challenge for both industry and academia. This is corroborated by the many efforts aiming to reduce the energy consumption of those systems. Initiatives for reducing the energy consumption of large scale and distributed infrastructures can without loss of generality be broken into hardware and software initiatives.Unlike their hardware counterpart, software solutions to the energy reduction problem in large scale and distributed infrastructures hardly result in real deployments. At the one hand, this can be justified by the fact that they are application oriented. At the other hand, their failure can be attributed to their complex nature which often requires vast technical knowledge behind proposed solutions and/or thorough understanding of applications at hand. This restricts their use to a limited number of experts, because users usually lack adequate skills. In addition, although subsystems including the memory are becoming more and more power hungry, current software energy reduction techniques fail to take them into account. This thesis proposes a methodology for reducing the energy consumption of large scale and distributed infrastructures. Broken into three steps known as (i) phase identification, (ii) phase characterization, and (iii) phase identification and system reconfiguration; our methodology abstracts away from any individual applications as it focuses on the infrastructure, which it analyses the runtime behaviour and takes reconfiguration decisions accordingly.The proposed methodology is implemented and evaluated in high performance computing (HPC) clusters of varied sizes through a Multi-Resource Energy Efficient Framework (MREEF). MREEF implements the proposed energy reduction methodology so as to leave users with the choice of implementing their own system reconfiguration decisions depending on their needs. Experimental results show that our methodology reduces the energy consumption of the overall infrastructure of up to 24% with less than 7% performance degradation. By taking into account all subsystems, our experiments demonstrate that the energy reduction problem in large scale and distributed infrastructures can benefit from more than "the traditional" processor frequency scaling. Experiments in clusters of varied sizes demonstrate that MREEF and therefore our methodology can easily be extended to a large number of energy aware clusters. The extension of MREEF to virtualized environments like cloud shows that the proposed methodology goes beyond HPC systems and can be used in many other computing environments.
|
Page generated in 0.0309 seconds