Global ETD Search

31	Analyse et transformation de programmes: du modèle polyédrique aux langages formels Cohen, Albert 21 December 1999 (has links) (PDF) Les microprocesseurs et les architectures parallèles d'aujourd'hui lancent de nouveaux défis aux techniques de compilation. En présence de parallélisme, les optimisations deviennent trop spécifiques et complexes pour être laissées au soin du programmeur. Les techniques de parallélisation automatique dépassent le cadre traditionnel des applications numériques et abordent de nouveaux modèles de programmes, tels que les nids de boucles non affines, les appels récursifs et les structures de données dynamiques. Des analyses précises sont au c{\oe}ur de la détection du parallélisme, elles rassemblent des informations à la compilation sur les propriétés des programmes à l'exécution. Ces informations valident des transformations utiles pour l'extraction du parallélisme et la génération de code parallèle. Cette thèse aborde principalement des analyses et des transformations avec une vision par instances, c'est-à-dire considérant les propriétés individuelles de chaque instance d'une instruction à l'exécution. Une nouvelle formalisation à l'aide de langages formels nous permet tout d'abord d'étudier une analyse de dépendances et de définitions visibles par instances pour programmes récursifs. L'application de cette analyse à l'expansion et la parallélisation de programmes récursifs dévoile des résultats encourageants. Les nids de boucles quelconques font l'objet de la deuxième partie de ce travail. Une nouvelle étude des techniques de parallélisation fondées sur l'expansion nous permet de proposer des solutions à des problèmes d'optimisation cruciaux. automatic parallelization recursive programs non-affine loop nests dependence analysis reaching definition analysis memory expansion
32	Contributions à la conception de systèmes à hautes performances, programmables et sûrs: principes, interfaces, algorithmes et outils Cohen, Albert 23 March 2007 (has links) (PDF) La loi de Moore sur semi-conducteurs approche de sa fin. L'evolution de l'architecture de von Neumann à travers les 40 ans d'histoire du microprocesseur a conduit à des circuits d'une insoutenable complexité, à un très faible rendement de calcul par transistor, et une forte consommation énergetique. D'autre-part, le monde du calcul parallèle ne supporte pas la comparaison avec les niveaux de portabilité, d'accessibilité, de productivité et de fiabilité de l'ingénérie du logiciel séquentiel. Ce dangereux fossé se traduit par des défis passionnants pour la recherche en compilation et en langages de programmation pour le calcul à hautes performances, généraliste ou embarqué. Cette thèse motive notre piste pour relever ces défis, introduit nos principales directions de travail, et établit des perspectives de recherche. automatic parallelization polyhedral compilation parallel programming data-flow synchronous languages iterative optimization
33	Formalisation et automatisation de YAO, générateur de code pour l’assimilation variationnelle de données Nardi, Luigi 08 March 2011 (has links) L’assimilation variationnelle de données 4D-Var est une technique très utilisée en géophysique, notamment en météorologie et océanographie. Elle consiste à estimer des paramètres d’un modèle numérique direct, en minimisant une fonction de coût mesurant l’écart entre les sorties du modèle et les mesures observées. La minimisation, qui est basée sur une méthode de gradient, nécessite le calcul du modèle adjoint (produit de la transposée de la matrice jacobienne avec le vecteur dérivé de la fonction de coût aux points d’observation). Lors de la mise en œuvre de l’AD 4D-Var, il faut faire face à des problèmes d’implémentation informatique complexes, notamment concernant le modèle adjoint, la parallélisation du code et la gestion efficace de la mémoire. Aﬁn d’aider au développement d’applications d’AD 4D-Var, le logiciel YAO qui a été développé au LOCEAN, propose de modéliser le modèle direct sous la forme d’un graphe de ﬂot de calcul appelé graphe modulaire. Les modules représentent des unités de calcul et les arcs décrivent les transferts des données entre ces modules. YAO est doté de directives de description qui permettent à un utilisateur de décrire son modèle direct, ce qui lui permet de générer ensuite le graphe modulaire associé à ce modèle. Deux algorithmes, le premier de type propagation sur le graphe et le second de type rétropropagation sur le graphe permettent, respectivement, de calculer les sorties du modèle direct ainsi que celles de son modèle adjoint. YAO génère alors le code du modèle direct et de son adjoint. En plus, il permet d’implémenter divers scénarios pour la mise en œuvre de sessions d’assimilation.Au cours de cette thèse, un travail de recherche en informatique a été entrepris dans le cadre du logiciel YAO. Nous avons d’abord formalisé d’une manière plus générale les spécifications deYAO. Par la suite, des algorithmes permettant l’automatisation de certaines tâches importantes ont été proposés tels que la génération automatique d’un parcours “optimal” de l’ordre des calculs et la parallélisation automatique en mémoire partagée du code généré en utilisant des directives OpenMP. L’objectif à moyen terme, des résultats de cette thèse, est d’établir les bases permettant de faire évoluer YAO vers une plateforme générale et opérationnelle pour l’assimilation de données 4D-Var, capable de traiter des applications réelles et de grandes tailles. / Variational data assimilation 4D-Var is a well-known technique used in geophysics, and in particular in meteorology and oceanography. This technique consists in estimating the control parameters of a direct numerical model, by minimizing a cost function which measures the misﬁt between the forecast values and some actual observations. The minimization, which is based on a gradient method, requires the computation of the adjoint model (product of the transpose Jacobian matrix and the derivative vector of the cost function at the observation points). In order to perform the 4DVar technique, we have to cope with complex program implementations, in particular concerning the adjoint model, the parallelization of the code and an efﬁcient memory management. To address these difﬁculties and to facilitate the implementation of 4D-Var applications, LOCEAN is developing the YAO framework. YAO proposes to represent a direct model with a computation ﬂow graph called modular graph. Modules depict computation units and edges between modules represent data transfer. Description directives proper to YAO allow a user to describe its direct model and to generate the modular graph associated to this model. YAO contains two core algorithms. The ﬁrst one is a forward propagation algorithm on the graph that computes the output of the numerical model; the second one is a back propagation algorithm on the graph that computes the adjoint model. The main advantage of the YAO framework, is that the direct and adjoint model programming codes are automatically generated once the modular graph has been conceived by the user. Moreover, YAO allows to cope with many scenarios for running different data assimilation sessions.This thesis introduces a computer science research on the YAO framework. In a ﬁrst step, we have formalized in a more general way the existing YAO speciﬁcations. Then algorithms allowing the automatization of some tasks have been proposed such as the automatic generation of an “optimal” computational ordering and the automatic parallelization of the generated code on shared memory architectures using OpenMP directives. This thesis permits to lay the foundations which, at medium term, will make of YAO a general and operational platform for data assimilation 4D-Var, allowing to process applications of high dimensions. Assimilation variationnelle de données Modèle numérique Modèle adjoint Génération automatique Parallélisation automatique Mémoire partagée OpenMP Variational data assimilation Numerical model Adjoint model Automatic generation Automatic parallelization Shared memory OpenMP
34	Automated Reasoning Support for Invasive Interactive Parallelization Moshir Moghaddam, Kianosh January 2012 (has links) To parallelize a sequential source code, a parallelization strategy must be defined that transforms the sequential source code into an equivalent parallel version. Since parallelizing compilers can sometimes transform sequential loops and other well-structured codes into parallel ones automatically, we are interested in finding a solution to parallelize semi-automatically codes that compilers are not able to parallelize automatically, mostly because of weakness of classical data and control dependence analysis, in order to simplify the process of transforming the codes for programmers.Invasive Interactive Parallelization (IIP) hypothesizes that by using anintelligent system that guides the user through an interactive process one can boost parallelization in the above direction. The intelligent system's guidance relies on a classical code analysis and pre-defined parallelizing transformation sequences. To support its main hypothesis, IIP suggests to encode parallelizing transformation sequences in terms of IIP parallelization strategies that dictate default ways to parallelize various code patterns by using facts which have been obtained both from classical source code analysis and directly from the user.In this project, we investigate how automated reasoning can supportthe IIP method in order to parallelize a sequential code with an acceptable performance but faster than manual parallelization. We have looked at two special problem areas: Divide and conquer algorithms and loops in the source codes. Our focus is on parallelizing four sequential legacy C programs such as: Quick sort, Merge sort, Jacobi method and Matrix multipliation and summation for both OpenMP and MPI environment by developing an interactive parallelizing assistance tool that provides users with the assistanceneeded for parallelizing a sequential source code. Multi-processors Dependence Analysis Code parallelization Semi-automatic parallelization IIP ISC OpenMP MPI Artificial Intelligence Reasoning Decision Tree Divide and Conquer (D&C) algorithms
35	Contributions to parallel stochastic simulation : application of good software engineering practices to the distribution of pseudorandom streams in hybrid Monte Carlo simulations / Contributions à la simulation stochastique parallèle : architectures logicielles pour la distribution de flux pseudo-aléatoires dans les simulations Monte Carlo sur CPU/GPU Passerat-Palmbach, Jonathan 11 October 2013 (has links) Résumé non disponible / The race to computing power increases every day in the simulation community. A few years ago, scientists have started to harness the computing power of Graphics Processing Units (GPUs) to parallelize their simulations. As with any parallel architecture, not only the simulation model implementation has to be ported to the new parallel platform, but all the tools must be reimplemented as well. In the particular case of stochastic simulations, one of the major element of the implementation is the pseudorandom numbers source. Employing pseudorandom numbers in parallel applications is not a straightforward task, and it has to be done with caution in order not to introduce biases in the results of the simulation. This problematic has been studied since parallel architectures are available and is called pseudorandom stream distribution. While the literature is full of solutions to handle pseudorandom stream distribution on CPU-based parallel platforms, the young GPU programming community cannot display the same experience yet.In this thesis, we study how to correctly distribute pseudorandom streams on GPU. From the existing solutions, we identified a need for good software engineering solutions, coupled to sound theoretical choices in the implementation. We propose a set of guidelines to follow when a PRNG has to be ported to GPU, and put these advice into practice in a software library called ShoveRand. This library is used in a stochastic Polymer Folding model that we have implemented in C++/CUDA. Pseudorandom streams distribution on manycore architectures is also one of our concerns. It resulted in a contribution named TaskLocalRandom, which targets parallel Java applications using pseudorandom numbers and task frameworks.Eventually, we share a reflection on the methods to choose the right parallel platform for a given application. In this way, we propose to automatically build prototypes of the parallel application running on a wide set of architectures. This approach relies on existing software engineering tools from the Java and Scala community, most of them generating OpenCL source code from a high-level abstraction layer. Pseudorandom Number Generation (PRNG) High Performance Computing (HPC) Software Engineering Stochastic Simulation Graphics Processing Units (GPUs) GPU Programming Automatic Parallelization
36	Efficient search-based strategies for polyhedral compilation : algorithms and experience in a production compiler / Stratégies exploratoires efficaces pour la compilation polyédrique : algorithmes et expérience dans un compilateur de production Trifunovic, Konrad 04 July 2011 (has links) Une pression accrue s'exerce sur les compilateurs pour mettre en œuvre des transformations de programmes de plus en plus complexes délivrant le potentiel de performance des processeurs multicœurs et des accélérateurs hétérogènes. L'espace de recherche des optimisations de programmes possibles est gigantesque est manque de structure. La recherche de la meilleure transformation, qui inclut la prédiction des gains estimés de performance offerts par cette transformation, constitue le problème le plus difficiles pour les compilateurs optimisants modernes. Nous avons choisi de nous concentrer sur les transformations de boucles et sur leur automatisation, exprimées dans le modèle polyédrique. Les méthodes d'optimisation de programmes dans le modèle polyédrique se répartissent grossièrement en deux classes. La première repose sur l'optimisation linéaire d'une fonction de analytique de coût. La deuxième classe de méthodes met en œuvre une recherche itérative. La première approche est rapide, mais elle est facilement mise en défaut en ce qui concerne la découverte de la solution optimale. L'approche itérative est plus précise, mais le temps de compilation peut devenir prohibitif. Cette thèse contribue une approche nouvelle de la recherche itérative de transformations de programmes dans le modèle polyédrique. La nouvelle méthode proposée possède la précision et la capacité effective à extraire des transformations profitables des méthodes itératives, tout en en minimisant les faiblesses. Notre approche repose sur l'évaluation systématique d'une fonction de coût et de prédiction de performances non-linéaire. Par ailleurs, la parallélisation automatique dans le modèle polyédrique est actuellement dominée par des outils de compilation source-à-source. Nous avons choisi au contraire d'implémenter nos techniques dans la plateforme GCC, en opérant sur une représentation de code de bas niveau, à trois adresses. Nous montrons que le niveau d'abstraction de la représentation intermédiaire choisie engendre des difficultés de passage à l'échelle, et nous montrons comment les surmonter. À l'inverse, nous montrons qu'une représentation intermédiaire de bas niveau ouvre de nouveaux degrés de liberté, bénéficiant à notre stratégie itérative de recherche de transformations, et à la compilation polyédrique de manière générale. / In order to take the performance advantages of the current multicore and heterogeneous architectures the compilers are required to perform more and more complex program transformations. The search space of the possible program optimizations is huge and unstructured. Selecting the best transformation and predicting the potential performance benefits of that transformation is the major problem in today's optimizing compilers. The promising approach to handling the program optimizations is to focus on the automatic loop optimizations expressed in the polyhedral model. The current approaches for optimizing programs in the polyhedral model broadly fall into two classes. The first class of the methods is based on the linear optimization of the analytical cost function. The second class is based on the exhaustive iterative search. While the first approach is fast, it can easily miss the optimal solution. The iterative approach is more precise, but its running time might be prohibitively expensive. In this thesis we present a novel search-based approach to program transformations in the polyhedral model. The new method combines the benefits - effectiveness and precision - of the current approaches, while it tries to minimize their drawbacks. Our approach is based on enumerating the evaluations of the precise, nonlinear performance predicting cost-function. The current practice is to use the polyhedral model in the context of source-to-source compilers. We have implemented our techniques in a GCC framework that is based on the low level three address code representation. We show that the chosen level of abstraction for the intermediate representation poses scalability challenges, and we show the ways to overcome those problems. On the other hand, it is shown that the low level IR abstraction opens new degrees of freedom that are beneficial for the search-based transformation strategies and for the polyhedral compilation in general. Compilateurs Langages de programmation Modèle polyédrique Transformations de programmes Transformations de boucles La parallélisation automatique Représentation intermédiaire Compilers Programming languages Polyhedral model Program transformations Loop transformations Automatic parallelization Intermediate representation
37	Formalisation et automatisation de YAO, générateur de code pour l’assimilation variationnelle de données / Formalisation and automation of YAO, code generator for variational data assimilation Nardi, Luigi 08 March 2011 (has links) L’assimilation variationnelle de données 4D-Var est une technique très utilisée en géophysique, notamment en météorologie et océanographie. Elle consiste à estimer des paramètres d’un modèle numérique direct, en minimisant une fonction de coût mesurant l’écart entre les sorties du modèle et les mesures observées. La minimisation, qui est basée sur une méthode de gradient, nécessite le calcul du modèle adjoint (produit de la transposée de la matrice jacobienne avec le vecteur dérivé de la fonction de coût aux points d’observation). Lors de la mise en œuvre de l’AD 4D-Var, il faut faire face à des problèmes d’implémentation informatique complexes, notamment concernant le modèle adjoint, la parallélisation du code et la gestion efficace de la mémoire. Aﬁn d’aider au développement d’applications d’AD 4D-Var, le logiciel YAO qui a été développé au LOCEAN, propose de modéliser le modèle direct sous la forme d’un graphe de ﬂot de calcul appelé graphe modulaire. Les modules représentent des unités de calcul et les arcs décrivent les transferts des données entre ces modules. YAO est doté de directives de description qui permettent à un utilisateur de décrire son modèle direct, ce qui lui permet de générer ensuite le graphe modulaire associé à ce modèle. Deux algorithmes, le premier de type propagation sur le graphe et le second de type rétropropagation sur le graphe permettent, respectivement, de calculer les sorties du modèle direct ainsi que celles de son modèle adjoint. YAO génère alors le code du modèle direct et de son adjoint. En plus, il permet d’implémenter divers scénarios pour la mise en œuvre de sessions d’assimilation.Au cours de cette thèse, un travail de recherche en informatique a été entrepris dans le cadre du logiciel YAO. Nous avons d’abord formalisé d’une manière plus générale les spécifications deYAO. Par la suite, des algorithmes permettant l’automatisation de certaines tâches importantes ont été proposés tels que la génération automatique d’un parcours “optimal” de l’ordre des calculs et la parallélisation automatique en mémoire partagée du code généré en utilisant des directives OpenMP. L’objectif à moyen terme, des résultats de cette thèse, est d’établir les bases permettant de faire évoluer YAO vers une plateforme générale et opérationnelle pour l’assimilation de données 4D-Var, capable de traiter des applications réelles et de grandes tailles. / Variational data assimilation 4D-Var is a well-known technique used in geophysics, and in particular in meteorology and oceanography. This technique consists in estimating the control parameters of a direct numerical model, by minimizing a cost function which measures the misﬁt between the forecast values and some actual observations. The minimization, which is based on a gradient method, requires the computation of the adjoint model (product of the transpose Jacobian matrix and the derivative vector of the cost function at the observation points). In order to perform the 4DVar technique, we have to cope with complex program implementations, in particular concerning the adjoint model, the parallelization of the code and an efﬁcient memory management. To address these difﬁculties and to facilitate the implementation of 4D-Var applications, LOCEAN is developing the YAO framework. YAO proposes to represent a direct model with a computation ﬂow graph called modular graph. Modules depict computation units and edges between modules represent data transfer. Description directives proper to YAO allow a user to describe its direct model and to generate the modular graph associated to this model. YAO contains two core algorithms. The ﬁrst one is a forward propagation algorithm on the graph that computes the output of the numerical model; the second one is a back propagation algorithm on the graph that computes the adjoint model. The main advantage of the YAO framework, is that the direct and adjoint model programming codes are automatically generated once the modular graph has been conceived by the user. Moreover, YAO allows to cope with many scenarios for running different data assimilation sessions.This thesis introduces a computer science research on the YAO framework. In a ﬁrst step, we have formalized in a more general way the existing YAO speciﬁcations. Then algorithms allowing the automatization of some tasks have been proposed such as the automatic generation of an “optimal” computational ordering and the automatic parallelization of the generated code on shared memory architectures using OpenMP directives. This thesis permits to lay the foundations which, at medium term, will make of YAO a general and operational platform for data assimilation 4D-Var, allowing to process applications of high dimensions. Assimilation variationnelle de données Modèle numérique Modèle adjoint Génération automatique Parallélisation automatique Mémoire partagée OpenMP Variational data assimilation Numerical model Adjoint model Automatic generation Automatic parallelization Shared memory OpenMP
38	EMPIRICALLY EXAMINING THE ROADBLOCKS TO THE AUTOMATIC PARALLELIZATION AND ANALYSIS OF OPEN SOURCE SOFTWARE SYSTEMS Alnaeli, Saleh M. 20 April 2015 (has links) No description available. Computer Science parallelization inhibitors data dependency function calls function pointers virtual functions empirical study static analysis inter-procedural analysis automatic parallelization
39	Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Saad Ibrahim, Mohamed Mohamed 26 July 2016 (has links) Increasing the number of processors has become the mainstream for the modern chip design approaches. However, most applications are designed or written for single core processors; so they do not benefit from the numerous underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering to be made parallel. In the past decades, there has been a growing interest in automatic parallelization. This is to relieve programmers from the painful and error-prone manual parallelization process, and to cope with new architecture trend of multi-core and many-core CPUs. Automatic parallelization techniques vary in properties such as: the level of paraellism (e.g., instructions, loops, traces, tasks); the need for custom hardware support; using optimistic execution or relying on conservative decisions; online, offline or both; and the level of source code exposure. Transactional Memory (TM) has emerged as a powerful concurrency control abstraction. TM simplifies parallel programming to the level of coarse-grained locking while achieving fine-grained locking performance. This dissertation exploits TM as an optimistic execution approach for transforming a sequential application into parallel. The design and the implementation of two frameworks that support automatic parallelization: Lerna and HydraVM, are proposed, along with a number of algorithmic optimizations to make the parallelization effective. HydraVM is a virtual machine that automatically extracts parallelism from legacy sequential code (at the bytecode level) through a set of techniques including code profiling, data dependency analysis, and execution analysis. HydraVM is built by extending the Jikes RVM and modifying its baseline compiler. Correctness of the program is preserved through exploiting Software Transactional Memory (STM) to manage concurrent and out-of-order memory accesses. Our experiments show that HydraVM achieves speedup between 2×-5× on a set of benchmark applications. Lerna is a compiler framework that automatically and transparently detects and extracts parallelism from sequential code through a set of techniques including code profiling, instrumentation, and adaptive execution. Lerna is cross-platform and independent of the programming language. The parallel execution exploits memory transactions to manage concurrent and out-of-order memory accesses. This scheme makes Lerna very effective for sequential applications with data sharing. This thesis introduces the general conditions for embedding any transactional memory algorithm into Lerna. In addition, the ordered version of four state-of-art algorithms have been integrated and evaluated using multiple benchmarks including RSTM micro benchmarks, STAMP and PARSEC. Lerna showed great results with average 2.7× (and up to 18×) speedup over the original (sequential) code. While prior research shows that transactions must commit in order to preserve program semantics, placing the ordering enforces scalability constraints at large number of cores. In this dissertation, we eliminates the need for commit transactions sequentially without affecting program consistency. This is achieved by building a cooperation mechanism in which transactions can forward some changes safely. This approach eliminates some of the false conflicts and increases the concurrency level of the parallel application. This thesis proposes a set of commit order algorithms that follow the aforementioned approach. Interestingly, using the proposed commit-order algorithms the peak gain over the sequential non-instrumented execution in RSTM micro benchmarks is 10× and 16.5× in STAMP. Another main contribution is to enhance the concurrency and the performance of TM in general, and its usage for parallelization in particular, by extending TM primitives. The extended TM primitives extracts the embedded low level application semantics without affecting TM abstraction. Furthermore, as the proposed extensions capture common code patterns, it is possible to be handled automatically through the compilation process. In this work, that was done through modifying the GCC compiler to support our TM extensions. Results showed speedups of up to 4× on different applications including micro benchmarks and STAMP. Our final contribution is supporting the commit-order through Hardware Transactional Memory (HTM). HTM contention manager cannot be modified because it is implemented inside the hardware. Given such constraint, we exploit HTM to reduce the transactional execution overhead by proposing two novel commit order algorithms, and a hybrid reduced hardware algorithm. The use of HTM improves the performance by up to 20% speedup. / Ph. D. Transaction Memory Automatic Parallelization Low-Level Virtual Machine Optimistic Concurrency Speculative Execution Legacy Systems Age Commitment Order Low-Level TM Semantics TM Friendly Semantics
40	Effective Automatic Computation Placement and Data Allocation for Parallelization of Regular Programs Chandan, G January 2014 (has links) (PDF) Scientiﬁc applications that operate on large data sets require huge amount of computation power and memory. These applications are typically run on High Performance Computing (HPC) systems that consist of multiple compute nodes, connected over an network interconnect such as InﬁniBand. Each compute node has its own memory and does not share the address space with other nodes. A signiﬁcant amount of work has been done in past two decades on parallelizing for distributed-memory architectures. A majority of this work was done in developing compiler technologies such as high performance Fortran (HPF) and partitioned global address space (PGAS). However, several steps involved in achieving good performance remained manual. Hence, the approach currently used to obtain the best performance is to rely on highly tuned libraries such as ScaLAPACK. The objective of this work is to improve automatic compiler and runtime support for distributed-memory clusters for regular programs. Regular programs typically use arrays as their main data structure and array accesses are afﬁne functions of outer loop indices and program parameters. A lot of scientiﬁc applications such as linear-algebra kernels, stencils, partial differential equation solvers, data-mining applications and dynamic programming codes fall in this category. In this work, we propose techniques for ﬁnding computation mapping and data allocation when compiling regular programs for distributed-memory clusters. Techniques for transformation and detection of parallelism, relying on the polyhedral framework already exist. We propose automatic techniques to determine computation placements for identiﬁed parallelism and allocation of data. We model the problem of ﬁnding good computation placement as a graph partitioning problem with the constraints to minimize both communication volume and load imbalance for entire program. We show that our approach for computation mapping is more effective than those that can be developed using vendor-supplied libraries. Our approach for data allocation is driven by tiling of data spaces along with a compiler assisted runtime scheme to allocate and deallocate tiles on-demand and reuse them. Experimental results on some sequences of BLAS calls demonstrate a mean speedup of 1.82× over versions written with ScaLAPACK. Besides enabling weak scaling for distributed memory, data tiling also improves locality for shared-memory parallelization. Experimental results on a 32-core shared-memory SMP system shows a mean speedup of 2.67× over code that is not data tiled. High Performance Computing Systems Computer Placement Data Analysis and Management Distributed-memory Automatic Parallelization Data-distribution Polyhedral Model Computation Placement Hyperplanes and Polyhedra HPC Systems Data Allocation and Management Data Movement Code Generation Computer Science

Search results