1 |
Formal verification-driven parallelisation synthesisBotinčan, Matko January 2018 (has links)
Concurrency is often an optimisation, rather than intrinsic to the functional behaviour of a program, i.e., a concurrent program is often intended to achieve the same effect of a simpler sequential counterpart, just faster. Error-free concurrent programming remains a tricky problem, beyond the capabilities of most programmers. Consequently, an attractive alternative to manually developing a concurrent program is to automatically synthesise one. This dissertation presents two novel formal verification-based methods for safely transforming a sequential program into a concurrent one. The first method---an instance of proof-directed synthesis---takes as the input a sequential program and its safety proof, as well as annotations on where to parallelise, and produces a correctly-synchronised parallelised program, along with a proof of that program. The method uses the sequential proof to guide the insertion of synchronisation barriers to ensure that the parallelised program has the same behaviour as the original sequential version. The sequential proof, written in separation logic, need only represent shape properties, meaning we can parallelise complex heap-manipulating programs without verifying every aspect of their behaviour. The second method proposes specification-directed synthesis: given a sequential program, we extract a rich, stateful specification compactly summarising program behaviour, and use that specification for parallelisation. At the heart of the method is a learning algorithm which combines dynamic and static analysis. In particular, dynamic symbolic execution and the computational learning technique grammar induction are used to conjecture input-output specifications, and counterexample-guided abstraction refinement to confirm or refute the equivalence between the conjectured specification and the original program. Once equivalence checking succeeds, from the inferred specifications we synthesise code that executes speculatively in parallel---enabling automated parallelisation of irregular loops that are not necessary polyhedral, disjoint or with a static pipeline structure.
|
2 |
Profile-driven parallelisation of sequential programsTournavitis, Georgios January 2011 (has links)
Traditional parallelism detection in compilers is performed by means of static analysis and more specifically data and control dependence analysis. The information that is available at compile time, however, is inherently limited and therefore restricts the parallelisation opportunities. Furthermore, applications written in C – which represent the majority of today’s scientific, embedded and system software – utilise many lowlevel features and an intricate programming style that forces the compiler to even more conservative assumptions. Despite the numerous proposals to handle this uncertainty at compile time using speculative optimisation and parallelisation, the software industry still lacks any pragmatic approaches that extracts coarse-grain parallelism to exploit the multiple processing units of modern commodity hardware. This thesis introduces a novel approach for extracting and exploiting multiple forms of coarse-grain parallelism from sequential applications written in C. We utilise profiling information to overcome the limitations of static data and control-flow analysis enabling more aggressive parallelisation. Profiling is performed using an instrumentation scheme operating at the Intermediate Representation (Ir) level of the compiler. In contrast to existing approaches that depend on low-level binary tools and debugging information, Ir-profiling provides precise and direct correlation of profiling information back to the Ir structures of the compiler. Additionally, our approach is orthogonal to existing automatic parallelisation approaches and additional fine-grain parallelism may be exploited. We demonstrate the applicability and versatility of the proposed methodology using two studies that target different forms of parallelism. First, we focus on the exploitation of loop-level parallelism that is abundant in many scientific and embedded applications. We evaluate our parallelisation strategy against the Nas and Spec Fp benchmarks and two different multi-core platforms (a shared-memory Intel Xeon Smp and a heterogeneous distributed-memory Ibm Cell blade). Empirical evaluation shows that our approach not only yields significant improvements when compared with state-of- the-art parallelising compilers, but comes close to and sometimes exceeds the performance of manually parallelised codes. On average, our methodology achieves 96% of the performance of the hand-tuned parallel benchmarks on the Intel Xeon platform, and a significant speedup for the Cell platform. The second study, addresses the problem of partially sequential loops, typically found in implementations of multimedia codecs. We develop a more powerful whole-program representation based on the Program Dependence Graph (Pdg) that supports profiling, partitioning and codegeneration for pipeline parallelism. In addition we demonstrate how this enhances conventional pipeline parallelisation by incorporating support for multi-level loops and pipeline stage replication in a uniform and automatic way. Experimental results using a set of complex multimedia and stream processing benchmarks confirm the effectiveness of the proposed methodology that yields speedups up to 4.7 on a eight-core Intel Xeon machine.
|
3 |
Parallel solution of linear programsSmith, Edmund January 2013 (has links)
The factors limiting the performance of computer software periodically undergo sudden shifts, resulting from technological progress, and these shifts can have profound implications for the design of high performance codes. At the present time, the speed with which hardware can execute a single stream of instructions has reached a plateau. It is now the number of instruction streams that may be executed concurrently which underpins estimates of compute power, and with this change, a critical limitation on the performance of software has come to be the degree to which it can be parallelised. The research in this thesis is concerned with the means by which codes for linear programming may be adapted to this new hardware. For the most part, it is codes implementing the simplex method which will be discussed, though these have typically lower performance for single solves than those implementing interior point methods. However, the ability of the simplex method to rapidly re-solve a problem makes it at present indispensable as a subroutine for mixed integer programming. The long history of the simplex method as a practical technique, with applications in many industries and government, has led to such codes reaching a great level of sophistication. It would be unexpected in a research project such as this one to match the performance of top commercial codes with many years of development behind them. The simplex codes described in this thesis are, however, able to solve real problems of small to moderate size, rather than being confined to random or otherwise artificially generated instances. The remainder of this thesis is structured as follows. The rest of this chapter gives a brief overview of the essential elements of modern parallel hardware and of the linear programming problem. Both the simplex method and interior point methods are discussed, along with some of the key algorithmic enhancements required for such systems to solve real-world problems. Some background on the parallelisation of both types of code is given. The next chapter describes two standard simplex codes designed to exploit the current generation of hardware. i6 is a parallel standard simplex solver capable of being applied to a range of real problems, and showing exceptional performance for dense, square programs. i8 is also a parallel, standard simplex solver, but now implemented for graphics processing units (GPUs).
|
4 |
A Parallel Navier Stokes Solver for Natural Convection and Free Surface FlowNorris, Stuart Edward January 2001 (has links)
A parallel numerical method has been implemented for solving the Navier Stokes equations on Cartesian and non-orthogonal meshes. To ensure the accuracy of the code first, second and third order differencing schemes, with and without flux-limiters, have been implemented and tested. The most computationally expensive task in the code is the solution of linear equations, and a number of linear solvers have been tested to determine the most efficient. Krylov space, incomplete factorisation, and other iterative and direct solvers from the literature have been implemented, and have been compared with a novel black-box multigrid linear solver that has been developed both as a solver and as a preconditioner for the Krylov space methods. To further reduce execution time the code was parallelised, after a series of experiments comparing the suitability of different parallelisation techniques and computer architectures for the Navier Stokes solver. The code has been applied to the solution of two classes of problem. Two natural convection flows were studied, with an initial study of two dimensional Rayleigh Benard convection being followed by a study of a transient three dimensional flow, in both cases the results being compared with experiment. The second class of problems modelled were free surface flows. A two dimensional free surface driven cavity, and a two dimensional flume flow were modelled, the latter being compared with analytic theory. Finally a three dimensional ship flow was modelled, with the flow about a Wigley hull being simulated for a range of Reynolds and Froude numbers.
|
5 |
MDRIP: A Hybrid Approach to Parallelisation of Discrete Event SimulationChao, Daphne (Yu Fen) January 2006 (has links)
The research project reported in this thesis considers Multiple Distributed Replications in Parallel (MDRIP), a hybrid approach to parallelisation of quantitative stochastic discrete-event simulation. Parallel Discrete-Event Simulation (PDES) generally covers distributed simulation or simulation with replicated trials. Distributed simulation requires model partitioning and synchronisation among submodels. Simulation with replicated trials can be executed on-line by applying Multiple Replications in Parallel (MRIP). MDRIP has been proposed for overcoming problems related to the large size of simulated models and their complexity, as well as with the problem of controlling the accuracy of the final simulation results. A survey of PDES investigates several primary issues which are directly related to the parallelisation of DES. A secondary issue related to implementation efficiency is also covered. Statistical analysis as a supporting issue is described. The AKAROA2 package is an implementation of making such supporting issue effortless. Existing solutions proposed for PDES have exclusively focused on collecting of output data during simulation and conducting analysis of these data when simulation is finished. Such off-line statistical analysis of output data offers no control of statistical errors of the final estimates. On-line control of statistical errors during simulation has been successfully implemented in AKAROA2, an automated controller of output data analysis during simulation executed in MRIP. However, AKAROA2 cannot be applied directly to distributed simulation. This thesis reports results of a research project aimed at employing AKAROA2 for launching multiple replications of distributed simulation models and for on-line sequential control of statistical errors associated with a distributed performance measure; i.e. with a performance measure which depends on output data being generated by a number of submodels of distributed simulation. We report changes required in the architecture of AKAROA2 to make MDRIP possible. A new MDRIP-related component of AKAROA2, a distributed simulation engine mdrip engine, is introduced. Stochastic simulation in its MDRIP version, as implemented in AKAROA2, has been tested in a number of simulation scenarios. We discuss two specific simulation models employed in our tests: (i) a model consisting of independent queues, and (ii) a queueing network consisting of tandem connection of queueing systems. In the first case, we look at the correctness of message orderings from the distributed messages. In the second case, we look at the correctness of output data analysis when the analysed performance measures require data from all submodels of a given (distributed) simulation model. Our tests confirm correctness of our mdrip engine design in the cases considered; i.e. in models in which causality errors do not occur. However, we argue that the same design principles should be applicable in the case of distributed simulation models with (potential) causality errors.
|
6 |
A Parallel Navier Stokes Solver for Natural Convection and Free Surface FlowNorris, Stuart Edward January 2001 (has links)
A parallel numerical method has been implemented for solving the Navier Stokes equations on Cartesian and non-orthogonal meshes. To ensure the accuracy of the code first, second and third order differencing schemes, with and without flux-limiters, have been implemented and tested. The most computationally expensive task in the code is the solution of linear equations, and a number of linear solvers have been tested to determine the most efficient. Krylov space, incomplete factorisation, and other iterative and direct solvers from the literature have been implemented, and have been compared with a novel black-box multigrid linear solver that has been developed both as a solver and as a preconditioner for the Krylov space methods. To further reduce execution time the code was parallelised, after a series of experiments comparing the suitability of different parallelisation techniques and computer architectures for the Navier Stokes solver. The code has been applied to the solution of two classes of problem. Two natural convection flows were studied, with an initial study of two dimensional Rayleigh Benard convection being followed by a study of a transient three dimensional flow, in both cases the results being compared with experiment. The second class of problems modelled were free surface flows. A two dimensional free surface driven cavity, and a two dimensional flume flow were modelled, the latter being compared with analytic theory. Finally a three dimensional ship flow was modelled, with the flow about a Wigley hull being simulated for a range of Reynolds and Froude numbers.
|
7 |
Parallelising High OrderTransform of Point SpreadFunction and TemplateSubtraction for AstronomicImage Subtraction : The implementation of BACHWång, Annie, Lells, Victor January 2023 (has links)
This thesis explores possible improvements, using parallel computing, to the PSF-alignment and image subtraction algorithm found in HOTPANTS. In time-domain astronomy the PSF-alignment and image subtraction algorithm OIS is used to identify transient events. hotpants is a software package based on OIS, the software package ISIS, and other subsequent research done to improve OIS. A parallel GPU implementation of the algorithm from HOTPANTS – henceforth known as BACH –was created for this thesis. The goal of this thesis is to answer the questions: “what parts of HOTPANTS are most suited for parallelisation?” and “how does bach perform compared to HOTPANTS and SFFT?”, another PSF-alignment and image subtraction tool. The authors found that the parts most susceptible to parallelisation were the convolution and subtraction steps. However, the subtraction did not display a significant improvement to its sequential counterpart. The other parts of HOTPANTS were deemed too complex to implement in parallel on the GPU. However, some parts could probably either be partly parallelised on the GPU or parallelised usingthe CPU. BACH was always as fast as or faster than HOTPANTS; it was generally 2 times faster, but was up to 4.5 times faster in some test cases. It was also faster than SFFT, but this result was not equivalent to the result presented in [15], which is why the authors of this thesis believe something was wrong with either the installation of SFFT or the hardware used to test it.
|
8 |
Fuzzy Gradual Pattern Mining Based on Multi-Core Architectures / Fouille de motifs graduels flous basée sur architectures multi-coeurQuintero Flores, Perfecto Malaquias 19 March 2013 (has links)
Les motifs graduels visent à décrire des co-variations au sein des données et sont de la forme plus l'âge est important, plus le salaire est élevé. Ces motifs ont fait l'objet de nombreux travaux en fouille de données ces dernières années, du point de vue des définitions que peuvent avoir de tels motifs et d'un point de vue algorithmique pour les extraire efficacement. Ces définitions et algorithmes considèrent qu'il est possible d'ordonner de manière stricte les valeurs (par exemple l'âge, le salaire). Or, dans de nombreux champs applicatifs, il est difficile voire impossible d'ordonner de cette manière. Par exemple, quand l'on considère l'expression de gènes, dire que l'expression d'un gène est plus importante que l'expression d'un autre gène quand leurs expressions ne diffèrent qu'à la dixième décimale n'a pas de sens d'un point de vue biologique. Ainsi, nous proposons dans cette thèse une approche fondée sur les ordres flous. Les algorithmes étant très consommateurs tant en mémoire qu'en temps de calcul, nous proposons des optimisations d'une part du stockage des degrés flous et d'autre part de calcul parallélisé. Les expérimentations que nous avons menées sur des bases de données synthétiques et réelles montrent l'intérêt de notre approche. / Gradual patterns aim at describing co-variations of data such as the older, the higher the salary. They have been more and more studied from the data mining point of view in recent years, leading to several ways of defining their meaning and and several algorithms to automatically extract them.They consider that data can be ordered regarding the values taken on the attributes (e.g. the age and the salary).However, in many application domains, it is hardly possible to consider that data values are crisply ordered. For instance, when considering gene expression, it is not true, from the biological point of view, to say that Gene 1 is more expressed than Gene 2 if the levels of expression only differ from the tenth decimal. This thesis thus considers fuzzy orderings and propose both formal definitions and algorithms to extract gradual patterns considering fuzzy orderings. As these algorithms are both time and memory consuming, we propose some optimizations based on an efficient storage of the fuzzy ordering informationcoupled with parallel algorithms. Experimental results run on synthetic and real database show the interest or our proposal.
|
9 |
Simulations parallèles de Monte Carlo appliquées à la Physique des Hautes Energies pour plates-formes manycore et multicore : mise au point, optimisation, reproductibilité / Monte Carlo parallel simulations applied to the High Energy Physics for manycore and multicore platforms : development, optimisation, reproducibilitySchweitzer, Pierre 19 October 2015 (has links)
Lors de cette thèse, nous nous sommes focalisés sur le calcul à haute performance, dans le domaine très précis des simulations de Monte Carlo appliquées à la physique des hautes énergies, et plus particulièrement, aux simulations pour la propagation de particules dans un milieu. Les simulations de Monte Carlo sont des simulations particulièrement consommatrices en ressources, temps de calcul, capacité mémoire. Dans le cas précis sur lequel nous nous sommes penchés, la première simulation de Monte Carlo existante prenait plus de temps à simuler le phénomène physique que le phénomène lui-même n’en prenait pour se dérouler dans les conditions expérimentales. Cela posait donc un sévère problème de performance. L’objectif technique minimal était d’avoir une simulation prenant autant de temps que le phénomène réel observé, l’objectif maximal était d’avoir une simulation bien plus rapide. En effet, ces simulations sont importantes pour vérifier la bonne compréhension de ce qui est observé dans les conditions expérimentales. Plus nous disposons d’échantillons statistiques simulés, meilleurs sont les résultats. Cet état initial des simulations ouvrait donc de nombreuses perspectives d’un point de vue optimisation et calcul à haute performance. Par ailleurs, dans notre cas, le gain de performance étant proprement inutile s’il n’est pas accompagné d’une reproductibilité des résultats, la reproductibilité numérique de la simulation est de ce fait un aspect que nous devons prendre en compte.C’est ainsi que dans le cadre de cette thèse, après un état de l’art sur le profilage, l’optimisation et la reproductibilité, nous avons proposé plusieurs stratégies visant à obtenir plus de performances pour nos simulations. Dans tous les cas, les optimisations proposées étaient précédées d’un profilage. On n’optimise jamais sans avoir profilé. Par la suite, nous nous intéressés à la création d’un profileur parallèle en programmation orientée aspect pour nos besoins très spécifiques, enfin, nous avons considéré la problématique de nos simulations sous un angle nouveau : plutôt que d’optimiser une simulation existante, nous avons proposé des méthodes permettant d’en créer une nouvelle, très spécifique à notre domaine, qui soit d’emblée reproductible, statistiquement correcte et qui puisse passer à l’échelle. Dans toutes les propositions, de façon transverse, nous nous sommes intéressés aux architectures multicore et manycore d’Intel pour évaluer les performances à travers une architecture orientée serveur et une architecture orientée calcul à haute performance. Ainsi, grâce à la mise en application de nos propositions, nous avons pu optimiser une des simulations de Monte Carlo, nous permettant d’obtenir un gain de performance de l’ordre de 400X, une fois optimisée et parallélisée sur un nœud de calcul avec 32 cœurs physiques. De même, nous avons pu proposer l’implémentation d’un profileur, programmé à l’aide d’aspects et capable de gérer le parallélisme à la fois de la machine sur laquelle il est exécuté mais aussi de l’application qu’il profile. De plus, parce qu’il emploi les aspects, il est portable et n’est pas fixé à une architecture matérielle en particulier. Enfin, nous avons implémenté la simulation prévue pour être reproductible, performante et ayant des résultats statistiquement viables. Nous avons pu constater que ces objectifs étaient atteints quelle que soit l’architecture cible pour l’exécution. Cela nous a permis de valider notamment notre méthode de vérification de la reproductibilité numérique d’une simulation. / During this thesis, we focused on High Performance Computing, specifically on Monte Carlo simulations applied to High Energy Physics. We worked on simulations dedicated to the propagation of particles through matter. Monte Carlo simulations require significant CPU time and memory footprint. Our first Monte Carlo simulation was taking more time to simulate the physical phenomenon than the said phenomenon required to happen in the experimental conditions. It raised a real performance issue. The minimal technical aim of the thesis was to have a simulation requiring as much time as the real observed phenomenon. Our maximal target was to have a much faster simulation. Indeed, these simulations are critical to asses our correct understanding of what is observed during experimentation. The more we have simulated statistics samples, the better are our results. This initial state of our simulation was allowing numerous perspectives regarding optimisation, and high performance computing. Furthermore, in our case, increasing the performance of the simulation was pointless if it was at the cost of losing results reproducibility. The numerical reproducibility of the simulation was then an aspect we had to take into account. In this manuscript, after a state of the art about profiling, optimisation and reproducibility, we proposed several strategies to gain more performance in our simulations. In each case, all the proposed optimisations followed a profiling step. One never optimises without having profiled first. Then, we looked at the design of a parallel profiler using aspect-oriented programming for our specific needs. Finally, we took a new look at the issues raised by our Monte Carlo simulations: instead of optimising existing simulations, we proposed methods for developing a new simulation from scratch, having in mind it is for High Performance Computing and it has to be statistically sound, reproducible and scalable. In all our proposals, we looked at both multicore and manycore architectures from Intel to benchmark the performance on server-oriented architecture and High Performance Computing oriented architecture. Through the implementation of our proposals, we were able to optimise one of the Monte Carlo simulations, permitting us to achieve a 400X speedup, once optimised and parallelised on a computing node with 32 physical cores. We were also able to implement a profiler with aspects, able to deal with the parallelism of its computer and of the application it profiles. Moreover, because it relies on aspects, it is portable and not tied to any specific architecture. Finally, we implemented the simulation designed to be reproducible, scalable and to have statistically sound results. We observed that these goals could be achieved, whatever the target architecture for execution. This enabled us to assess our method for validating the numerical reproducibility of a simulation.
|
10 |
A Study of the Performance Benefits of Controlling Parallel Asynochrous Iteractive ApplicationsJoseph, P J 09 1900 (has links)
High performance networks of workstation are becoming increasingly popular a parallel computing platform because of their lower cost. Both message passing and software distributed shared memory (DSM) programming paradigms have been developed and employed on such distributed hardware platforms. An important performance bottleneck in these systems is the effective data transmission latency, which is poorer than in high-speed parallel computer interconnection networks.
Iterative algorithms are used in a large class of applications like solution of partial algorithms are used, optimization problems, solutions to systems of linear equations, and so on. These can be parallelized in a straight-forward fashion with cad1 node computing a part of the data set and also synchronizing and exchanging data with the other nodes as required. But these synchronous version perform poorly when message transmission delays are high, as is the case in network of workstations. The asynchronous parallel version of these algorithms provide an additional degree of freedom to address large data transmission latencies. These algorithms do not synchronize, and behave correctly in the presence of losses and delays in the propagation of updates. Thus, in shared memory systems they do not synchronize accesses to shared data and they will work correctly even in the presence of delays and losses in updates. This gives synchronous algorithms a performance advantage over their synchronous counterparts since synchronization costs are avoided and further computation can be overlapped with communication.
The message generation rate of asynchronous algorithms is however greater than that of their synchronous counterparts. Uncontrolled asynchronous runs can create a large network load resulting in large queuing delays, which in turn can increase the message generation of the asynchronous algorithms. This is especially possible in lower bandwidth network like that in network of workstations. Such a positive feedback loop leads to unstable network conditions.
Recent work has tried to improve the performance of asynchronous algorithms on a distributed shared memory (DSM) system by adaptively buffering shared memory updates depending on the network load, and transmitting multiple updates together. This reduces congestion and message transmission overhead, but could still result in slow convergence since nothing is guaranteed about, the update propagation delay. Also, although adaptive throttling of message will kick in once the network gets heavily loaded, it cannot, prevent the initial flooding. Furthermore, the processor is not freed when computation with the available values does not result in much further convergence.
In this thesis we present an alternate method of controlling iterative methods and present performance results for the same. We propose a new system-supported blocking read primitive, termed Global Read that is guaranteed to return a value of acceptable age of the specified location in a DSM system. The main idea is to enforce an upper bound on the age of shared updates seen by a node in a manner visible to the underlying system (DSM). Information about processes being blocked can be used for adapting the underlying system, especially the network, towards better performance. A reading process is throttled until its Global-Read is satisfied, thus implementing program-level flow control and also freeing the processor. The Global-Read can also help in communication-based scheduling of, processes.
Performance evaluation using a benchmark from Cray, on a network of workstations and on the IBM SP2 parallel computer, showed good performance improvements. We also present results of a systematic study wherein we implement parallel code for different iterative techniques for the solution of Lap lace equation wing PVM, anti characterize when controlled asynchrony work befit. We studied the improvements in computation time and analyzed the sources of this improvement, using various input and parallelism, on both IBM SP2 and a network of workstations. We find significant performance improvements for controlling asynchrony when the traffic generated by the asynchronous algorithm becomes more than what can be sustained by the network. But we found that the scalability of the applications is limited by the software overhead for messages. Systems which have reduced software overhead will show very good scalable performance for controlled asynchrony.
|
Page generated in 0.0174 seconds