• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 8
  • 3
  • 2
  • 2
  • 1
  • Tagged with
  • 20
  • 20
  • 8
  • 8
  • 7
  • 6
  • 5
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Popcorn Linux: A Compiler and Runtime for Execution Migration Between Heterogeneous-ISA Architectures

Lyerly, Robert Frantz 25 April 2019 (has links)
In recent years there has been a proliferation of parallel and heterogeneous architectures. As chip designers have hit fundamental limits in traditional processor scaling, they have begun rethinking processor architecture from the ground up. In addition to creating new classes of processors, chip designers have revisited CPU microarchitecture in order to target different computing contexts. CPUs have been optimized for low-power smartphones and extended for high-performance computing in order to achieve better performance energy efficiency for heavy computational tasks. Although heterogeneity adds significant complexity to both hardware and software, recent works have shown tremendous power and performance benefits obtainable through specialization. It is clear that emerging systems will be increasingly heterogeneous. Many of these emerging systems couple together cores of different instruction set architectures (ISA), due to both market forces and the potential performance and power benefits in optimizing application execution. However, differently from symmetric multiprocessors or even asymmetric single-ISA multiprocessors, natively compiled applications cannot freely migrate between heterogeneous-ISA processors. This is due to the fact that applications are compiled to an instruction set architecture-specific format which is incompatible on other instruction set architectures. This creates serious limitations, as execution migration is a fundamental mechanism used by schedulers to reach performance or fairness goals, allows applications to migrate between heterogeneous-ISA CPUs in order to accelerate parallel applications or even leverage ISA-heterogeneity for security benefits. This dissertation describes system software for automatically migrating natively compiled applications across heterogeneous-ISA processors. This dissertation describes the implementation and evaluation of a complete software stack on commodity scale heterogeneous-ISA CPUs, emulating datacenters with heterogeneous-ISA systems or future systems that tightly integrate heterogeneous-ISA CPUs via point-to-point interconnect. This dissertation describes a compiler which builds applications for heterogeneous-ISA execution migration. The compiler generates machine code for every architecture in the system and lays out the application's code and data in a common format. In addition, the compiler generates metadata used by a state transformation runtime to dynamically transform thread execution state between ISA-specific formats, allowing application threads to migrate between different ISAs. The compiler and runtime is evaluated in conjunction with a replicated-kernel operating system, which provides thread migration and distributed shared virtual memory across heterogeneous-ISA processors. This redesigned software stack is evaluated on a setup containing and ARM and an x86 processor interconnected via point-to-point interconnect over PCIe. This dissertation shows that sub-millisecond state transformation is achievable. Additionally, it shows that for a datacenter-like workload using benchmarks from the NAS Parallel Benchmark suite, the system can trade some performance for up to a 66% reduction in energy and up to an 11% reduction in energy-delay product. This dissertation then describes an exploration into using hardware transactional memory (HTM) to maximize scheduling flexibility. Because applications can only migrate between ISAs at program locations with specific properties, there may be a significant delay between when the scheduler wishes to migrate an application and when the application can respond to the migration request. In order to reduce this migration response time, this dissertation describes compiler instrumentation which uses HTM to allow the scheduler to force applications to roll back to the most recently encountered program location suitable for migration. This is evaluated both in terms of overhead and responsiveness to migration requests. In addition to showing the viability of the infrastructure for optimizing workload placement in a heterogeneous-ISA datacenter, this dissertation also demonstrates utilizing the infrastructure to accelerate multithreaded applications. This dissertation describes a new OpenMP runtime named libopenpop that is optimized for executing applications in heterogeneous- ISA systems with distributed shared virtual memory. The runtime utilizes synchronization primitives that enable scale-out execution across rack-scale systems and new work distribution mechanisms that predict the best partitioning of parallel work across CPUs with diverse architectural characteristics. libopenpop demonstrates sizable improvements over a na¨ıve OpenMP implementation – a 38x improvement in multi-server barrier latency, a 5.4x improvement in multi-server data reductions and a geometric mean speedup of 4.04x for scalable applications in an 8-node x86-64 cluster. For a heterogeneous system composed of a highly-clocked x86 server and a highly-parallel ARM server, libopenpop delivers up to a 4.7x speedup and a geometric mean speedup of 41% across benchmarks from several benchmark suites versus the best single-node homogeneous execution. Finally, this dissertation describes leveraging the compiler and state transformation runtime to provide enhanced security for applications. Because the compiler provides detailed information about the stack layout of applications, it can be leveraged to defend against exploits such as stack smashing attacks and return-oriented programming attacks. This dissertation describes Chameleon, a runtime which uses the compiler and state transformation infrastructure to continuously re-randomize the stack layout and code of vulnerable applications to thwart attackers. Chameleon attaches to applications using existing operating system interfaces and periodically switches the application to new randomized stack layouts and code by rewriting the stack. Chameleon enhances security with little overhead – it disrupts a geometric mean 76.32% of code gadgets in benchmark binaries, randomizes stack element locations with geometric mean 3 potential randomized locations, and has 1.1% overhead when re-randomizing every 50 milliseconds, making it extremely difficult for attackers to exploit target applications. / Doctor of Philosophy / Computer processors have experienced unprecedented performance improvements over the past 50 years. However, due to physical limitations of how processors execute, in recent years this performance growth has started to slow. In order to continue scaling performance, chip designers have begun diversifying processor designs to meet different performance and power consumption targets. Processors specialized for different contexts use various instruction set architectures (ISAs), the operations made available for use by the hardware. Programs built for one instruction set architecture are not compatible with others, requiring developers to build complex applications to manually bridge the gap. This leads to brittle applications and prevents the system software managing the processors from adapting workloads to match processor characteristics. This dissertation presents the Popcorn Linux system software which provides transparent support for running applications across computers composed of processors of multiple ISAs. Popcorn Linux provides the ability to migrate applications between these processors without requiring developers to add any application instrumentation – the system software manages all the details of building and migrating applications. An evaluation of Popcorn Linux shows that transparently migrating applications between diverse processors provides power and performance benefits in a variety of scenarios. Additionally, this dissertation describes leveraging the Popcorn Linux software infrastructure to harden applications against attackers seeking to hijack applications for malicious purposes.
2

Leveraging Processor-diversity For Improved Performance In Heterogeneous-ISA Systems

Pang, Yihan 05 November 2019 (has links)
The purpose of this thesis is to investigate the effectiveness of executing High Performance Computing (HPC) workloads on multiprocessors with heterogeneous Instruction Set Architecture (ISA) cores. ISA-heterogeneity in processor designs provides a unique dimension for researchers to explore performance benefits through diversity in design choices. Additionally, each application has a natural preference to one processor in a selected group of processors (we defined this term as processor-preference), and processor-preference is highly affected by processor design choices. Thus, a system with heterogeneous-ISA cores offers an intriguing design perspective, packing heterogeneous-ISA cores in the same processor or system that compensate each other in dynamic workload scenarios. This thesis considers dynamic migrating applications with different processor-preferences across ISA-different cores to exploit the potential of this idea. With SIMD instructions getting more attention from chip designers, this thesis also presents the necessary modifications for a general compiler/run-time infrastructure to transform the dynamic program state of SIMD regions at run-time from one ISA format to another for cross-ISA migration and execution. Lastly, this thesis presents a processor-preference-aware scheduling policy that makes dynamic cross-ISA migration decisions that improve overall system throughput compared to homogeneous-ISA systems. This thesis prototypes a heterogeneous-ISA system using an Intel Xeon Gold 5118 x86-64 server and a Cavium ThunderX ARMv8 server and evaluates the effectiveness of our infrastructure and scheduling policy. Our results reveal that heterogeneous-ISA systems that are processor-preference-aware and with cross-ISA execution migration capability can yield throughput gains up to 36\% compared to traditional homogeneous ISA systems. / Master of Science / The author of this thesis has a family full of non-engineers. To persuade family members that the work of this thesis is meaningful, aka the author is not procrastinating in school, the author decided to draw an analogy between processors and cars. Suppose in an alternative universe, cars (systems) can be powered by engines (processors) that uses two different fuel-sources (ISAs): gasoline or electric (single-ISA) processors but not both (heterogeneous-ISA). Car manufacturers (chip designers) can build engines with different design choices (processors with varying design options): engines combined with turbochargers for gasoline-powered cars, high-performance batteries combined with energy-efficient batteries for electric-powered cars (added extended instruction sets, CPU designs that target vastly different use cases, etc.). However, each design choice is limited to improving performance for a specific type of fuel-source based engine. For example, having battery alternatives has no performance impact on gasoline-powered engines. As time passes by, car manufacturers have exhausted options to make a drastic improvement to their existing engine designs (limited performance gains in recent chips). To tackle this problem, in this thesis, the author first examined the usage of cars: driving on the road (running applications). The author's study found that no single engine is suitable for all routes (no single processor is good for all workloads), and cars powered by different fuel-source based engines showed a significant diversity in performance (application performance varies drastically between systems with processors built on different ISAs). Gasoline-powered cars perform well on high-speed roads, whereas electric-powered cars perform well on low-speed roads. Unfortunately, in real life, a person's commute (a workload of applications) consists of a mixture of high-speed roads and low-speed roads, and one cannot know the exact percentage of each kind of path they travel (exact application composition in a workload) beforehand. Therefore it is challenging for a person to make the correct car selection for the incoming commute (choose the right system for a workload). This thesis tries to solve this commuting problem by building a car that has multiple engines fitted to suit different road needs (systems with processors that have vastly different use cases). This thesis looks at a particular dimension of combining various fuel-powered engines in the same car (a system with heterogeneous-ISA processors). The author believes that adding diversity in fuel-powered engine selections provide an exciting dimension in car design choices (adding ISA-heterogeneity in processors provide a unique dimension in system design). Thus, this thesis focuses on estimating a theoretical multi fuel-powered car's performance by combining two different fuel-powered cars into a single mega-car using some framework (Popcorn Linux). This framework allows this mega-car to be driven by a combined fuel source with fuel intake freely transfer between fuel-sources (cross-ISA migration and execution) based on road conditions (application encountered). Based on the evaluation of this new prototype, the author finds that in a real-life scenario (workload with mixed application combination), cars with multiple fuel-source based engines have better performance than two single fuel-source based cars (systems with heterogeneous-ISAs processors perform better than systems with homogeneous-ISAs processors). The author hopes that this study can help build the foundation for the development of hybrid cars (system with heterogeneous-ISAs in the same processor) in the future as well as the consideration of modifying existing car into a mega-car with multiple engines suited for different road needs for improved commute performance for now. Ultimately, this thesis is not about cars. The author hopes that by explaining the research done in this paper through cars, general audiences can understand what this work is trying to investigate and what solution they have provided. In this work, we investigate the potential of a system with heterogeneous-ISA processors. This thesis prototypes one such system and finds that heterogeneous-ISA systems have performance benefits than traditional homogeneous-ISA systems over a series of experiment evaluations.
3

Mapping a Dataflow Programming Model onto Heterogeneous Architectures

Sbirlea, Alina 06 September 2012 (has links)
This thesis describes and evaluates how extending Intel's Concurrent Collections (CnC) programming model can address the problem of hybrid programming with high performance and low energy consumption, while retaining the ease of use of data-flow programming. The CnC model is a declarative, dynamic light-weight task based parallel programming model and is implicitly deterministic by enforcing the single assignment rule, properties which ensure that problems are modelled in an intuitive way. CnC offers a separation of concerns by allowing algorithms to be expressed as a two stage process: first by decomposing a problem into components and specifying how components interact with each other, and second by providing an implementation for each component. By facilitating the separation between a domain expert, who can provide an accurate problem specification at a high level, and a tuning expert, who can tune the individual components for better performance, we ensure that tuning and future development, such as replacement of a subcomponent with a more efficient algorithm, become straightforward. A recent trend in mainstream desktop systems is the use of graphics processor units (GPUs) to obtain order-of-magnitude performance improvements relative to general-purpose CPUs. In addition, the use of FPGAs has seen a significant increase for applications that can take advantage of such dedicated hardware. We see that computing is evolving from using many core CPUs to ``co-processing" on the CPU, GPU and FPGA, however hybrid programming models that support the interaction between multiple heterogeneous components are not widely accessible to mainstream programmers and domain experts who have a real need for such resources. We propose a C-based implementation of the CnC model for enabling parallelism across heterogeneous processor components in a flexible way, with high resource utilization and high programmability. We use the task-parallel HabaneroC language (HC) as the platform for implementing CnC-HabaneroC (CnC-HC), a language also used to implement the computation steps in CnC-HC, for interaction with GPU or FPGA steps and which offers the desired flexibility and extensibility of interacting with any other C based language. First, we extend the CnC model with tag functions and ranges to enable automatic code generation of high level operations for inter-task communication. This improves programmability and also makes the code more analysable, opening the door for future optimizations. Secondly, we introduce a way to specify steps that are data parallel and thus are fit to execute on the GPU, and the notion of task affinity, a tuning annotation in the specification language. Affinity is used by the runtime during scheduling and can be fine-tuned based on application needs to achieve better (faster, lower power, etc.) results. Thirdly, we introduce and develop a novel, data-driven runtime for the CnC model, using HabaneroC (HC) as a base language. In addition, we also create an implementation of the previous runtime approach and conduct a study to compare the performance. Next, we expand the HabaneroC dynamic work-stealing runtime to allow cross-device stealing based on task affinity. Cross-device dynamic work-stealing is used to achieve load balancing across heterogeneous platforms for improved performance. Finally, we implement and use a series of benchmarks for testing the model in different scenarios and show that our proposed approach can yield significant performance benefits and low power usage when using a hybrid execution.
4

Towards Enhancing Performance, Programmability, and Portability in Heterogeneous Computing

Krommydas, Konstantinos 03 May 2017 (has links)
The proliferation of a diverse set of heterogeneous computing platforms in conjunction with the plethora of programming languages and optimization techniques on each language for each underlying architecture exacerbate widespread adoption of such platforms. This is especially true for novice programmers and the non-technical-savvy masses that are largely precluded from enjoying the advantages of high-performance computing. Moreover, different groups within the heterogeneous computing community (e.g., hardware architects, tool developers, and programmers) are presented with new challenges with respect to performance, programmability, and portability (or the three P's) of heterogeneous computing. In this work we discuss such challenges and identify benchmarking techniques based on computation and communication patterns as an appropriate means for the systematic evaluation of heterogeneous computing with respect to the three P's. Our proposed approach is based on OpenCL implementations of the Berkeley dwarfs. We use our benchmark suite (OpenDwarfs) in characterizing performance of state-of-the-art parallel architectures, and as the main component of a methodology (Telescoping Architectures) for identifying trends in future heterogeneous architectures. Furthermore, we employ OpenDwarfs in a multi-faceted study on the gaps between the three P's in the context of the modern heterogeneous computing landscape. Our case-study spans a variety of compilers, languages, optimizations, and target architectures, including the CPU, GPU, MIC, and FPGA. Based on our insights, and extending aspects of prior research (e.g., in compilers, programming languages, and auto-tuning), we propose the introduction of grid-based data structures as the basis of programming frameworks and present a prototype unified framework (GLAF) that encompasses a novel visual programming environment with code generation, auto-parallelization, and auto-tuning capabilities. Our results, which span scientific domains, indicate that our holistic approach constitutes a viable alternative towards enhancing the three P's and further democratizing heterogeneous, parallel computing for non-programming-savvy audiences, and especially domain scientists. / Ph. D.
5

Automatic Scheduling of Compute Kernels Across Heterogeneous Architectures

Lyerly, Robert Frantz 24 June 2014 (has links)
The world of high-performance computing has shifted from increasing single-core performance to extracting performance from heterogeneous multi- and many-core processors due to the power, memory and instruction-level parallelism walls. All trends point towards increased processor heterogeneity as a means for increasing application performance, from smartphones to servers. These various architectures are designed for different types of applications — traditional "big" CPUs (like the Intel Xeon) are optimized for low latency while other architectures (such as the NVidia Tesla K20x) are optimized for high-throughput. These architectures have different tradeoffs and different performance profiles, meaning fantastic performance gains for the right types of applications. However applications that are ill-suited for a given architecture may experience significant slowdown; therefore, it is imperative that applications are scheduled onto the correct processor. In order to perform this scheduling, applications must be analyzed to determine their execution characteristics. Traditionally this application-to-hardware mapping was determined statically by the programmer. However, this requires intimate knowledge of the application and underlying architecture, and precludes load-balancing by the system. We demonstrate and empirically evaluate a system for automatically scheduling compute kernels by extracting program characteristics and applying machine learning techniques. We develop a machine learning process that is system-agnostic, and works for a variety of contexts (e.g. embedded, desktop/workstation, server). Finally, we perform scheduling in a workload-aware and workload-adaptive manner for these compute kernels. / Master of Science
6

Can my chip behave like my brain?

George, Suma 27 May 2016 (has links)
Many decades ago, Carver Mead established the foundations of neuromorphic systems. Neuromorphic systems are analog circuits that emulate biology. These circuits utilize subthreshold dynamics of CMOS transistors to mimic the behavior of neurons. The objective is to not only simulate the human brain, but also to build useful applications using these bio-inspired circuits for ultra low power speech processing, image processing, and robotics. This can be achieved using reconfigurable hardware, like field programmable analog arrays (FPAAs), which enable configuring different applications on a cross platform system. As digital systems saturate in terms of power efficiency, this alternate approach has the potential to improve computational efficiency by approximately eight orders of magnitude. These systems, which include analog, digital, and neuromorphic elements combine to result in a very powerful reconfigurable processing machine.
7

Performance optimization of geophysics stencils on HPC architectures / Optimização de desempenho de estênceis geofísicos sobre arquiteturas HPC

Abaunza, Víctor Eduardo Martínez January 2018 (has links)
A simulação de propagação de onda é uma ferramenta crucial na pesquisa de geofísica (para análise eficiente dos terremotos, mitigação de riscos e a exploração de petróleo e gáz). Devido à sua simplicidade e sua eficiência numérica, o método de diferenças finitas é uma das técnicas implementadas para resolver as equações da propagação das ondas. Estas aplicações são conhecidas como estênceis porque consistem num padrão que replica a mesma computação num domínio multidimensional de dados. A Computação de Alto Desempenho é requerida para solucionar este tipo de problemas, como consequência do grande número de pontos envolvidos nas simulações tridimensionais do subsolo. A optimização do desempenho dos estênceis é um desafio e depende do arquitetura usada. Neste contexto, focamos nosso trabalho em duas partes. Primeiro, desenvolvemos nossa pesquisa nas arquiteturas multicore; analisamos a implementação padrão em OpenMP dos modelos numéricos da transferência de calor (um estêncil Jacobi de 7 pontos), e o aplicativo Ondes3D (um simulador sísmico desenvolvido pela Bureau de Recherches Géologiques et Minières); usamos dois algoritmos conhecidos (nativo, e bloqueio espacial) para encontrar correlações entre os parâmetros da configuração de entrada, na execução, e o desempenho computacional; depois, propusemos um modelo baseado no Aprendizado de Máquina para avaliar, predizer e melhorar o desempenho dos modelos estênceis na arquitetura usada; também usamos um modelo de propagação da onda acústica fornecido pela empresa Petrobras; e predizemos o desempenho com uma alta precisão (até 99%) nas arquiteturas multicore. Segundo, orientamos nossa pesquisa nas arquiteturas heterogêneas, analisamos uma implementação padrão do modelo de propagação de ondas em CUDA, para encontrar os fatores que afetam o desempenho quando o número de aceleradores é aumentado; então, propusemos uma implementação baseada em tarefas para amelhorar o desempenho, de acordo com um conjunto de configuração no tempo de execução (algoritmo de escalonamento, tamanho e número de tarefas), e comparamos o desempenho obtido com as versões de só CPU ou só GPU e o impacto no desempenho das arquiteturas heterogêneas; nossos resultados demostram um speedup significativo (até 25) em comparação com a melhor implementação disponível para arquiteturas multicore. / Wave modeling is a crucial tool in geophysics, for efficient strong motion analysis, risk mitigation and oil & gas exploration. Due to its simplicity and numerical efficiency, the finite-difference method is one of the standard techniques implemented to solve the wave propagation equations. This kind of applications is known as stencils because they consist in a pattern that replicates the same computation on a multi-dimensional domain. High Performance Computing is required to solve this class of problems, as a consequence of a large number of grid points involved in three-dimensional simulations of the underground. The performance optimization of stencil computations is a challenge and strongly depends on the underlying architecture. In this context, this work was directed toward a twofold aim. Firstly, we have led our research on multicore architectures and we have analyzed the standard OpenMP implementation of numerical kernels from the 3D heat transfer model (a 7-point Jacobi stencil) and the Ondes3D code (a full-fledged application developed by the French Geological Survey). We have considered two well-known implementations (naïve, and space blocking) to find correlations between parameters from the input configuration at runtime and the computing performance; thus, we have proposed a Machine Learning-based approach to evaluate, to predict, and to improve the performance of these stencil models on the underlying architecture. We have also used an acoustic wave propagation model provided by the Petrobras company and we have predicted the performance with high accuracy on multicore architectures. Secondly, we have oriented our research on heterogeneous architectures, we have analyzed the standard implementation for seismic wave propagation model in CUDA, to find which factors affect the performance; then, we have proposed a task-based implementation to improve the performance, according to the runtime configuration set (scheduling algorithm, size, and number of tasks), and we have compared the performance obtained with the classical CPU or GPU only versions with the results obtained on heterogeneous architectures.
8

Performance optimization of geophysics stencils on HPC architectures / Optimização de desempenho de estênceis geofísicos sobre arquiteturas HPC

Abaunza, Víctor Eduardo Martínez January 2018 (has links)
A simulação de propagação de onda é uma ferramenta crucial na pesquisa de geofísica (para análise eficiente dos terremotos, mitigação de riscos e a exploração de petróleo e gáz). Devido à sua simplicidade e sua eficiência numérica, o método de diferenças finitas é uma das técnicas implementadas para resolver as equações da propagação das ondas. Estas aplicações são conhecidas como estênceis porque consistem num padrão que replica a mesma computação num domínio multidimensional de dados. A Computação de Alto Desempenho é requerida para solucionar este tipo de problemas, como consequência do grande número de pontos envolvidos nas simulações tridimensionais do subsolo. A optimização do desempenho dos estênceis é um desafio e depende do arquitetura usada. Neste contexto, focamos nosso trabalho em duas partes. Primeiro, desenvolvemos nossa pesquisa nas arquiteturas multicore; analisamos a implementação padrão em OpenMP dos modelos numéricos da transferência de calor (um estêncil Jacobi de 7 pontos), e o aplicativo Ondes3D (um simulador sísmico desenvolvido pela Bureau de Recherches Géologiques et Minières); usamos dois algoritmos conhecidos (nativo, e bloqueio espacial) para encontrar correlações entre os parâmetros da configuração de entrada, na execução, e o desempenho computacional; depois, propusemos um modelo baseado no Aprendizado de Máquina para avaliar, predizer e melhorar o desempenho dos modelos estênceis na arquitetura usada; também usamos um modelo de propagação da onda acústica fornecido pela empresa Petrobras; e predizemos o desempenho com uma alta precisão (até 99%) nas arquiteturas multicore. Segundo, orientamos nossa pesquisa nas arquiteturas heterogêneas, analisamos uma implementação padrão do modelo de propagação de ondas em CUDA, para encontrar os fatores que afetam o desempenho quando o número de aceleradores é aumentado; então, propusemos uma implementação baseada em tarefas para amelhorar o desempenho, de acordo com um conjunto de configuração no tempo de execução (algoritmo de escalonamento, tamanho e número de tarefas), e comparamos o desempenho obtido com as versões de só CPU ou só GPU e o impacto no desempenho das arquiteturas heterogêneas; nossos resultados demostram um speedup significativo (até 25) em comparação com a melhor implementação disponível para arquiteturas multicore. / Wave modeling is a crucial tool in geophysics, for efficient strong motion analysis, risk mitigation and oil & gas exploration. Due to its simplicity and numerical efficiency, the finite-difference method is one of the standard techniques implemented to solve the wave propagation equations. This kind of applications is known as stencils because they consist in a pattern that replicates the same computation on a multi-dimensional domain. High Performance Computing is required to solve this class of problems, as a consequence of a large number of grid points involved in three-dimensional simulations of the underground. The performance optimization of stencil computations is a challenge and strongly depends on the underlying architecture. In this context, this work was directed toward a twofold aim. Firstly, we have led our research on multicore architectures and we have analyzed the standard OpenMP implementation of numerical kernels from the 3D heat transfer model (a 7-point Jacobi stencil) and the Ondes3D code (a full-fledged application developed by the French Geological Survey). We have considered two well-known implementations (naïve, and space blocking) to find correlations between parameters from the input configuration at runtime and the computing performance; thus, we have proposed a Machine Learning-based approach to evaluate, to predict, and to improve the performance of these stencil models on the underlying architecture. We have also used an acoustic wave propagation model provided by the Petrobras company and we have predicted the performance with high accuracy on multicore architectures. Secondly, we have oriented our research on heterogeneous architectures, we have analyzed the standard implementation for seismic wave propagation model in CUDA, to find which factors affect the performance; then, we have proposed a task-based implementation to improve the performance, according to the runtime configuration set (scheduling algorithm, size, and number of tasks), and we have compared the performance obtained with the classical CPU or GPU only versions with the results obtained on heterogeneous architectures.
9

[en] SUPPORT FOR CODE PORTABILITY IN HIGH PERFORMANCE COMPUTING APPLICATIONS / [pt] AUXÍLIO A PORTABILIDADE DE CÓDIGO EM APLICAÇÕES DE ALTO DESEMPENHO

PAULO ROBERTO PEREIRA DE SOUZA FILHO 16 January 2017 (has links)
[pt] Atualmente na computação de alto desempenho existem diversas opções de arquiteturas de diversos fabricantes, algumas sendo heterogêneas como por exemplo CPU mais GPU. Este trabalho tem como objetivo implementar maneiras de codificar aplicações de alto desempenho contemplando alguns tipos de arquiteturas, incluindo algumas heterogêneas, garantindo a portabilidade em uma grande porção do código mas mantendo o desempenho e a capacidade de fazer otimizações específicas a cada arquitetura. Implementamos a biblioteca HLIB que gerencia as primitivas de arquiteturas heterogêneas do tipo CPU mais GPU, APU e CPU mais Phi e que também funciona em arquiteturas homogêneas tradicionais. Implementamos o OpenVec, uma ferramenta para gerar, de forma portável, código vetorial explícito. Contemplando as principais arquiteturas SIMD dos últimos 17 anos, tais como ARM Neon, Intel SSE até AVX-512 e IBM VSX. Demonstramos o uso combinado dessas duas ferramentas com aplicações de alto desempenho, que demandam mais de um petaflop. / [en] Today s platforms are becoming increasingly heterogeneous. A given platform may have many different computing elements in it: CPUs, coprocessors and GPUs of various kinds. This work propose a way too keep some portion of code portable without compromising the performance along different heterogeneous platforms. We implemented the HLIB library that handles the preparation code needed by heterogeneous computing, also this library transparently supports the traditional homogeneous platform. To address multiple SIMD architectures we implemented the OpenVec, a tool to help compiler to enable SIMD instructions. This tool provides a set of portable SIMD intrinsics and C plus plus operators to get a portable explicit vectorization, covering SIMD architectures from the last 17 years like ARM Neon, Intel SSE to AVX-512 and IBM Power8 Altivec plus VSX. We demonstrated the combination use of this strategy using both tools with petaflop HPC applications.
10

Unified system of code transformation and execution for heterogeneous multi-core architectures. / Système unifié de transformation de code et d'éxécution pour un passage aux architectures multi-coeurs hétérogènes

Li, Pei 17 December 2015 (has links)
Architectures hétérogènes sont largement utilisées dans le domaine de calcul haute performance. Cependant, le développement d'applications sur des architectures hétérogènes est indéniablement fastidieuse et sujette à erreur pour un programmeur même expérimenté. Pour passer une application aux architectures multi-cœurs hétérogènes, les développeurs doivent décomposer les données de l'entrée, gérer les échanges de valeur intermédiaire au moment d’exécution et garantir l'équilibre de charge de système. L'objectif de cette thèse est de proposer une solution de programmation parallèle pour les programmeurs novices, qui permet de faciliter le processus de codage et garantir la qualité de code. Nous avons comparé et analysé les défauts de solutions existantes, puis nous proposons un nouvel outil de programmation STEPOCL avec un nouveau langage de domaine spécifique qui est conçu pour simplifier la programmation sur les architectures hétérogènes. Nous avons évalué la performance de STEPOCL sur trois cas d'application classiques : un stencil 2D, une multiplication de matrices et un problème à N corps. Le résultat montre que : (i) avec l'aide de STEPOCL, la performance d'application varie linéairement selon le nombre d'accélérateurs, (ii) la performance de code généré par STEPOCL est comparable à celle de la version manuscrite. (iii) les charges de travail, qui sont trop grandes pour la mémoire d'un seul accélérateur, peuvent être exécutées en utilisant plusieurs accélérateurs. (iv) grâce à STEPOCL, le nombre de lignes de code manuscrite est considérablement réduit. / Heterogeneous architectures have been widely used in the domain of high performance computing. However developing applications on heterogeneous architectures is time consuming and error-prone because going from a single accelerator to multiple ones indeed requires to deal with potentially non-uniform domain decomposition, inter-accelerator data movements, and dynamic load balancing. The aim of this thesis is to propose a solution of parallel programming for novice developers, to ease the complex coding process and guarantee the quality of code. We lighted and analysed the shortcomings of existing solutions and proposed a new programming tool called STEPOCL along with a new domain specific language designed to simplify the development of an application for heterogeneous architectures. We evaluated both the performance and the usefulness of STEPOCL. The result show that: (i) the performance of an application written with STEPOCL scales linearly with the number of accelerators, (ii) the performance of an application written using STEPOCL competes with an handwritten version, (iii) larger workloads run on multiple devices that do not fit in the memory of a single device, (iv) thanks to STEPOCL, the number of lines of code required to write an application for multiple accelerators is roughly divided by ten.

Page generated in 0.4889 seconds