31 |
Tools and Methods for Analysis, Debugging, and Performance Improvement of Equation-Based ModelsSjölund, Martin January 2015 (has links)
Equation-based object-oriented (EOO) modeling languages such as Modelica provide a convenient, declarative method for describing models of cyber-physical systems. Because of the ease of use of EOO languages, large and complex models can be built with limited effort. However, current state-of-the-art tools do not provide the user with enough information when errors appear or simulation results are wrong. It is of paramount importance that such tools should give the user enough information to correct errors or understand where the problems that lead to wrong simulation results are located. However, understanding the model translation process of an EOO compiler is a daunting task that not only requires knowledge of the numerical algorithms that the tool executes during simulation, but also the complex symbolic transformations being performed. As part of this work, methods have been developed and explored where the EOO tool, an enhanced Modelica compiler, records the transformations during the translation process in order to provide better diagnostics, explanations, and analysis. This information is used to generate better error-messages during translation. It is also used to provide better debugging for a simulation that produces unexpected results or where numerical methods fail. Meeting deadlines is particularly important for real-time applications. It is usually essential to identify possible bottlenecks and either simplify the model or give hints to the compiler that enable it to generate faster code. When profiling and measuring execution times of parts of the model the recorded information can also be used to find out why a particular system model executes slowly. Combined with debugging information, it is possible to find out why this system of equations is slow to solve, which helps understanding what can be done to simplify the model. A tool with a graphical user interface has been developed to make debugging and performance profiling easier. Both debugging and profiling have been combined into a single view so that performance metrics are mapped to equations, which are mapped to debugging information. The algorithmic part of Modelica was extended with meta-modeling constructs (MetaModelica) for language modeling. In this context a quite general approach to debugging and compilation from (extended) Modelica to C code was developed. That makes it possible to use the same executable format for simulation executables as for compiler bootstrapping when the compiler written in MetaModelica compiles itself. Finally, a method and tool prototype suitable for speeding up simulations has been developed. It works by partitioning the model at appropriate places and compiling a simulation executable for a suitable parallel platform.
|
32 |
Μοντελοποίηση επεξεργαστών με εκτέλεση εκτός σειράςΉλκος, Ιωάννης 25 February 2010 (has links)
Η σχεδίαση μικροεπεξεργαστών, ειδικά τα τελευταία χρόνια στη διάρκεια των οποίων οι εξελίξεις στην αρχιτεκτονική υπολογιστών και στην τεχνολογία ημιαγωγών ήταν ραγδαίες, είναι μια πολύπλοκη και δύσκολη διαδικασία. Παραδοσιακά οι σχεδιαστές για εκτιμήσουν την αποδοτικότητα του συστήματος που αναπτύσσουν χρησιμοποιούν πλήρη προσομοίωση κύκλο-προς-κύκλο. Δυστυχώς αυτή η διαδικασία είναι πολύπλοκη σχεδιαστικά, χρονοβόρα και δεν παρέχει κανενός είδους πληροφορία για τις διεργασίες και τις αλληλεπιδράσεις που συμβαίνουν στο εσωτερικό του επεξεργαστή.
Σε αυτή την εργασία παρουσιάζεται η γενική δομή ενός υπερβαθμωτού επεξεργαστή με εκτέλεση εκτός σειράς. Πάνω σε αυτή τη δομή χτίζεται ένα αναλυτικό μοντέλο για τις επιδόσεις του επεξεργαστή σε σχέση με τον κώδικα που εκτελεί και τα δομικά του χαρακτηριστικά. Η μοντελοποίηση αυτή βασίζεται στο ότι ένας υπερβαθμωτός επεξεργαστής διατηρεί σταθερή την απόδοσή του πέρα από εξαιρετικά γεγονότα (cache misses, branch mispredictions). Παρουσιάζεται το αναλυτικό μοντέλο σταθερής απόδοσης και ο αντίκτυπος του κάθε είδους miss event ξεχωριστά. Τελικά, επιτυγχάνεται μια συνολική εκτίμηση των επιδόσεων του συστήματος. / The last few years the advances in the fields of computer architecture and semiconductor technology have rendered microprocessor design a very complex and difficult procedure. Traditionally, in order to assess the efficiency of the system under development designers have used full cycle-based simulation. Unfortunately this process is complex, time-consuming and provides no insight on the interaction between the building blocks of a modern processor.
In this thesis, we present a generic design of a superscalar out-of-order processor. Based on this design, we build an analytical performance model derived from the parallelism of the code to be executed and the processor design parameters. The foundation of this model is that a well-designed superscalar processor maintains a steady performance level at all times - with the occurrence of miss events (cache misses, branch mispredictions) a sole exception. Therefore, we present a steady-state performance model and we model each type of miss event and its impact in isolation. Finally, we assess the performance of a generic out-of-order processor.
|
33 |
Méthodes In-Situ et In-Transit : vers un continuum entre les applications interactives et offines à grande échelle / In-Situ and In-Transit methods : toward a continuum between interactive and offline application at high scaleDreher, Matthieu 25 February 2015 (has links)
Les simulations paralllèles sont devenues des outils indispensables dans de nombreux domaines scientifiques. Pour simuler des phénomènes complexes, ces simulations sont exécutées sur de grandes machines parallèles. La puissance de calcul de ces machines n'a cessé de monter permettant ainsi le traitement de simulations de plus en plus imposantes. En revanche, les systèmes d'I/O nécessaires à la sauvegarde des données produites par les simulations ont suivit une croissance beaucoup plus faible. Actuellement déjà, il est difficile pour les scientifiques de sauvegarder l'ensemble des données désirées et d'avoir suffisament de puissance de calcul pour les analyser par la suite. A l'ère de l'Exascale, on estime que moins de 1% des données produites par une simulation pourronts être sauvegardées. Ces données sont pourtant une des clés vers des découvertes scientifiques majeures. Les traitements in-situ sont une solution prometteuse à ce problème. Le principe est d'effectuer des analyses alors que la simulation est en cours d'exécution et que les données sont encore en mémoire. Cette approche permet d'une part d'éviter le goulot d'étranglement au niveau des I/O mais aussi de profiter de la puissance de calcul offerte par les machines parallèles pour effectuer des traitements lourds en calcul. Dans cette thèse, nous proposons d'utiliser le paradigme du dataflow pour permettre la construction d'applications in-situ complexes. Pour cela, nous utilisons l'intergiciel FlowVR permettant de coupler des codes parallèles hétérogènes en créant des canaux de communication entre eux afin de former un graphe. FlowVR dispose de suffisament de flexibilité pour permettre plusieurs stratégies de placement des processus d'analyses que cela soit sur les nœuds de la simulation, sur des cœurs dédiés ou des nœuds dédiés. De plus, les traitements in-situ peuvent être exécutés de manière asynchrone permettant ainsi un faible impact sur les performances de la simulation. Pour démontrer la flexibilité de notre approche, nous nous sommes intéressés au cas à la dynamique moléculaire et plus particulièrement Gromacs, un code de simulation de dynamique moléculaire couramment utilisé par les biologistes pouvant passer à l'échelle sur plusieurs milliers de coeurs. En étroite collaboration avec des experts du domaine biologique, nous avons contruit plusieurs applications. Notre première application consiste à permettre à un utilisateur de guider une simulation de dynamique moléculaire vers une configuration souhaitée. Pour cela, nous avons couplé Gromacs à un visualiseur et un bras haptique. Grâce à l'intégration de forces émises par l'utilisateur, celui ci peut guider des systèmes moléculaires de plus d'un million d'atomes. Notre deuxième application se concentre sur les simulations longues sur les grandes machines parallèles. Nous proposons de remplacer la méthode native d'écriture de Gromacs et de la déporter dans notre infrastructure en utilisant deux méthodes distinctes. Nous proposons également un algorithme de rendu parallèle pouvant s'adapter à différentes configurations de placements. Notre troisième application vise à étudier les usages que peuvent avoir les biologistes avec les applications in-situ. Nous avons développé une infrastructure unifiée permettant d'effectuer des traitements aussi bien sur des simulations intéractives, des simulations longues et en post-mortem. / Parallel simulations have become a powerwul tool in several scientific areas. To simulate complex phenomena, these simulations are running on large parallel machines. The computational power available on those machines has increased a lot in the last years allowing to simulate very large models. Unfortunately, the I/O capabilities necessary to save the data produced by simulation has not grown at the same pace. Nowadays, it is already difficult to save all the needed data and to have enough computational power to analyse them afterwards. At the exascale time frame, it is expected that less than 1% of the total data produced by simulations will be saved. Yet, these data may lead to major discoveries. In-situ analytics are a promising solution to this problem. The idea is to treat the data while the simulation is still running and the data are in memory. This way, the I/O bottleneck is avoided and the computational power avaible on parallel machines can be used as well for analytics. In this thesis, we propose to use the dataflow paradigm to enable the construction of complex in-situ applications. We rely on the FlowVR middleware which is designed to couple parallel heterogeneous codes by creating communication channels between them to form a graph. FlowVR is flexible enough to allow several placement strategies on simulation nodes, dedicated cores or dedicated nodes. Moreover, in-situ analytics are executed asynchronously leading to a low impact on the simulation performances. To demonstrate the flexibility of our approach, we used Gromacs, a commonly used parallel molecular dynamic simulation package, as application target. With the help of biology experts, we have built several realistic applications. The first one is allowing a user to steer a molecular simulation toward a desired state. To do so, we have couple Gromacs with a live viewer and an haptic device. The user can then apply forces to drive molecular systems of more than 1 million atoms. Our second application focus on long simulation running in batch mode on supercomputers. We replace the native writing method of Gromacs by two methods in our infrastructure. We also propose a implemented a flexible rendering algorithm able to able to various placement strategies. Finally, we study the possible usage o biologists with our infrastructure. We propose a unifed framework able to run treatments on interactive simulation, long simulations and in post-process.
|
34 |
Parallel programming in Go and Scala : A performance comparisonJohnell, Carl January 2015 (has links)
This thesis provides a performance comparison of parallel programming in Go and Scala. Go supports concurrency through goroutines and channels. Scala have parallel collections, futures and actors that can be used for concurrent and parallel programming. The experiment used two different types of algorithms to compare the performance between Go and Scala. Parallel versions of matrix multiplication and matrix chain multiplication were implemented with goroutines and channels in Go. Matrix multiplication was implemented with parallel collections and futures in Scala, and chain multiplication was implemented with actors. The results from the study shows that Scala has better performance than Go, parallel matrix multiplication was about 3x faster in Scala. However, goroutines and channels are more efficient than actors. Go performed better than Scala when the number of goroutines and actors increased in the benchmark for parallel chain multiplication. Both Go and Scala have features that makes parallel programming easier, but I found Go as a language was easier to learn and understand than Scala. I recommend anyone interested in Go to try it out because of its ease of use.
|
35 |
Parallel Reservoir Simulations with Sparse Grid Techniques and Applications to Wormhole PropagationWu, Yuanqing 08 September 2015 (has links)
In this work, two topics of reservoir simulations are discussed. The first topic is the two-phase compositional flow simulation in hydrocarbon reservoir. The major obstacle that impedes the applicability of the simulation code is the long run time of the simulation procedure, and thus speeding up the simulation code is necessary. Two means are demonstrated to address the problem: parallelism in physical space and the application of sparse grids in parameter space. The parallel code can gain satisfactory scalability, and the sparse grids can remove the bottleneck of flash calculations. Instead of carrying out the flash calculation in each time step of the simulation, a sparse grid approximation of all possible results of the flash calculation is generated before the simulation. Then the constructed surrogate model is evaluated to approximate the flash calculation results during the simulation. The second topic is the wormhole propagation simulation in carbonate reservoir. In this work, different from the traditional simulation technique relying on the Darcy framework, we propose a new framework called Darcy-Brinkman-Forchheimer framework to simulate wormhole propagation. Furthermore, to process the large quantity of cells in the simulation grid and shorten the long simulation time of the traditional serial code, standard domain-based parallelism is employed, using the Hypre multigrid library. In addition to that, a new technique called “experimenting field approach” to set coefficients in the model equations is introduced. In the 2D dissolution experiments, different configurations of wormholes and a series of properties simulated by both frameworks are compared. We conclude that the numerical results of the DBF framework are more like wormholes and more stable than the Darcy framework, which is a demonstration of the advantages of the DBF framework. The scalability of the parallel code is also evaluated, and good scalability can be achieved. Finally, a mixed finite element scheme is proposed for the wormhole simulation.
|
36 |
Concurrency Optimization for Integrative Network AnalysisBarnes, Robert Otto II 12 June 2013 (has links)
Virginia Tech\'s Computational Bioinformatics and Bio-imaging Laboratory (CBIL) is exploring integrative network analysis techniques to identify subnetworks or genetic pathways that contribute to various cancers. Chen et. al. developed a bagging Markov random field (BMRF)-based approach which examines gene expression data with prior biological information to reliably identify significant genes and proteins. Using random resampling with replacement (bootstrapping or bagging) is essential to confident results but is computationally demanding as multiple iterations of the network identification (by simulated annealing) is required. The MATLAB implementation is computationally demanding, employs limited concurrency, and thus time prohibitive. Using strong software development discipline we optimize BMRF using algorithmic, compiler, and concurrency techniques (including Nvidia GPUs) to alleviate the wall clock time needed for analysis of large-scale genomic data. Particularly, we decompose the BMRF algorithm into functional blocks, implement the algorithm in C/C++ and further explore the C/C++ implementation with concurrency optimization. Experiments are conducted with simulation and real data to demonstrate that a significant speedup of BMRF can be achieved by exploiting concurrency opportunities. We believe that the experience gained by this research shall help pave the way for us to develop computationally efficient algorithms leveraging concurrency, enabling researchers to efficiently analyze larger-scale data sets essential for furthering cancer research. / Master of Science
|
37 |
Excessive Parallelism in Protein Evolution of Lake Baikal Amphipod Species FlockBurskaia, Valentina, Naumenko, Sergey, Schelkunov, Mikhail, Bedulina, Daria, Neretina, Tatyana, Kondrashov, Alexey, Yampolsky, Lev, Bazykin, Georgii A. 01 September 2020 (has links)
Repeated emergence of similar adaptations is often explained by parallel evolution of underlying genes. However, evidence of parallel evolution at amino acid level is limited. When the analyzed species are highly divergent, this can be due to epistatic interactions underlying the dynamic nature of the amino acid preferences: The same amino acid substitution may have different phenotypic effects on different genetic backgrounds. Distantly related species also often inhabit radically different environments, which makes the emergence of parallel adaptations less likely. Here, we hypothesize that parallel molecular adaptations are more prevalent between closely related species. We analyze the rate of parallel evolution in genome-size sets of orthologous genes in three groups of species with widely ranging levels of divergence: 46 species of the relatively recent lake Baikal amphipod radiation, a species flock of very closely related cichlids, and a set of significantly more divergent vertebrates. Strikingly, in genes of amphipods, the rate of parallel substitutions at nonsynonymous sites exceeded that at synonymous sites, suggesting rampant selection driving parallel adaptation. At sites of parallel substitutions, the intraspecies polymorphism is low, suggesting that parallelism has been driven by positive selection and is therefore adaptive. By contrast, in cichlids, the rate of nonsynonymous parallel evolution was similar to that at synonymous sites, whereas in vertebrates, this rate was lower than that at synonymous sites, indicating that in these groups of species, parallel substitutions are mainly fixed by drift.
|
38 |
FleXilicon: a New Coarse-grained Reconfigurable Architecture for Multimedia and Wireless CommunicationsLee, Jong-Suk Mark 23 March 2010 (has links)
High computing power and flexibility are important design factors for multimedia and wireless communication applications due to the demand for high quality services and frequent evolution of standards. The ASIC (Application Specific Integrated Circuit) approach provides an area efficient, high performance solution, but is inflexible. In contrast, the general purpose processor approach is flexible, but often fails to provide sufficient computing power. Reconfigurable architectures, which have been introduced as a compromise between the two extreme solutions, have been applied successfully for multimedia and wireless communication applications.
In this thesis, we investigated a new coarse-grained reconfigurable architecture called FleXilicon which is designed to execute critical loops efficiently, and is embedded in an SOC with a host processor. FleXilicon improves resource utilization and achieves a high degree of loop level parallelism (LLP). The proposed architecture aims to mitigate major shortcomings with existing architectures through adoption of three schemes, (i) wider memory bandwidth, (ii) adoption of a reconfigurable controller, and (iii) flexible wordlength support. Increased memory bandwidth satisfies memory access requirement in LLP execution. New design of reconfigurable controller minimizes overhead in reconfiguration and improves area efficiency and reconfiguration overhead. Flexible word-length support improves LLP by increasing the number of processing elements executable. The simulation results indicate that FleXilicon reduces the number of clock cycles and increases the speed for all five applications simulated. The speedup ratios compared with conventional architectures are as large as two orders of magnitude for some applications. VLSI implementation of FleXilicon in 65 nm CMOS process indicates that the proposed architecture can operate at a high frequency up to 1 GHz with moderate silicon area. / Ph. D.
|
39 |
Exploring Performance Portability for Accelerators via High-level Parallel PatternsHou, Kaixi 27 August 2018 (has links)
Nowadays, parallel accelerators have become prominent and ubiquitous, e.g., multi-core CPUs, many-core GPUs (Graphics Processing Units) and Intel Xeon Phi. The performance gains from them can be as high as many orders of magnitude, attracting extensive interest from many scientific domains. However, the gains are closely followed by two main problems: (1) A complete redesign of existing codes might be required if a new parallel platform is used, leading to a nightmare for developers. (2) Parallel codes that execute efficiently on one platform might be either inefficient or even non-executable for another platform, causing portability issues.
To handle these problems, in this dissertation, we propose a general approach using parallel patterns, an effective and abstracted layer to ease the generating efficient parallel codes for given algorithms and across architectures. From algorithms to parallel patterns, we exploit the domain expertise to analyze the computational and communication patterns in the core computations and represent them in DSL (Domain Specific Language) or algorithmic skeletons. This preserves the essential information, such as data dependencies, types, etc., for subsequent parallelization and optimization. From parallel patterns to actual codes, we use a series of automation frameworks and transformations to determine which levels of parallelism can be used, what optimal instruction sequences are, how the implementation change to match different architectures, etc. Experiments show that our approaches by investigating a couple of important computational kernels, including sort (and segmented sort), sequence alignment, stencils, etc., across various parallel platforms (CPUs, GPUs, Intel Xeon Phi). / Ph. D. / Nowadays, parallel accelerators have become prominent and ubiquitous, e.g., multi-core CPUs, many-core GPUs (Graphics Processing Units) and Intel Xeon Phi. The performance gains from them can be as high as many orders of magnitude, attracting extensive interest from many scientific domains. However, the gains are closely followed by two main problems: (1) A complete redesign of existing codes might be required if a new parallel platform is used, leading to a nightmare for developers. (2) Parallel codes that execute efficiently on one platform might be either inefficient or even non-executable for another platform, causing portability issues.
To handle these problems, in this dissertation, we propose a general approach using parallel patterns, an effective and abstracted layer to ease the generating efficient parallel codes for given algorithms and across architectures. From algorithms to parallel patterns, we exploit the domain expertise to analyze the computational and communication patterns in the core computations and represent them in DSL (Domain Specific Language) or algorithmic skeletons. This preserves the essential information, such as data dependencies, types, etc., for subsequent parallelization and optimization. From parallel patterns to actual codes, we use a series of automation frameworks and transformations to determine which levels of parallelism can be used, what optimal instruction sequences are, how the implementation change to match different architectures, etc. Experiments show that our approaches by investigating a couple of important computational kernels, including sort (and segmented sort), sequence alignment, stencils, etc., across various parallel platforms (CPUs, GPUs, Intel Xeon Phi).
|
40 |
Analysis and Abstraction of Parallel Sequence SearchGoddard, Christopher Joseph 03 October 2007 (has links)
The ability to compare two biological sequences is extremely valuable, as matches can suggest evolutionary origins of genes or the purposes of particular amino acids. Results of such comparisons can be used in the creation of drugs, can help combat newly discovered viruses, or can assist in treating diseases.
Unfortunately, the rate of sequence acquisition is outpacing our ability to compute on these data. Further, traditional dynamic programming algorithms are too slow to meet the needs of biologists, who wish to compare millions of sequences daily. While heuristic algorithms improve upon the performance of these dated applications, they still cannot keep up with the steadily expanding search space.
Parallel sequence search implementations were developed to address this issue. By partitioning databases into work units for distributed computation, applications like mpiBLAST are able to achieve super-linear speedup over their sequential counterparts. However, such implementations are limited to clusters and require significant effort to work in a grid environment. Further, their parallelization strategies are typically specific to the target sequence search, so future applications require a reimplementation if they wish to run in parallel.
This thesis analyzes the performance of two versions of mpiBLAST, noting trends as well as differences between them. Results suggest that these embarrassingly parallel applications are dominated by the time required to search vast amounts of data, and not by the communication necessary to support such searches. Consequently, a framework named gridRuby is introduced which alleviates two main issues with current parallel sequence search applications; namely, the requirement of a tightly knit computing environment and the specific, hand-crafted nature of parallelization. Results show that gridRuby can parallelize an application across a set of machines through minimal implementation effort, and can still exhibit super-linear speedup. / Master of Science
|
Page generated in 0.068 seconds