• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 164
  • 57
  • 44
  • 17
  • 15
  • 11
  • 10
  • 6
  • 5
  • 3
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 382
  • 110
  • 90
  • 80
  • 66
  • 63
  • 61
  • 56
  • 51
  • 43
  • 42
  • 41
  • 39
  • 37
  • 36
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
271

Evaluation of publicly available Barrier-Algorithms and Improvement of the Barrier-Operation for large-­scale Cluster-Systems with special Attention on InfiniBand Networks

Hoefler, Torsten 01 April 2005 (has links)
The MPI_Barrier-collective operation, as a part of the MPI-1.1 standard, is extremely important for all parallel applications using it. The latency of this operation increases the application run time and can not be overlaid. Thus, the whole MPI performance can be decreased by unsatisfactory barrier latency. The main goals of this work are to lower the barrier latency for InfiniBand networks by analyzing well known barrier algorithms with regards to their suitability within InfiniBand networks, to enhance the barrier operation by utilizing standard InfiniBand operations as much as possible, and to design a constant time barrier for InfiniBand with special hardware support. This partition into three main steps is retained throughout the whole thesis. The first part evaluates publicly known models and proposes a new more accurate model (LoP) for InfiniBand. All barrier algorithms are evaluated within the well known LogP and this new model. Two new algorithms which promise a better performance have been developed. A constant time barrier integrated into InfiniBand as well as a cheap separate barrier network is proposed in the hardware section. All results have been implemented inside the Open MPI framework. This work led to three new Open MPI collective modules. The first one implements different barrier algorithms which are dynamically benchmarked and selected during the startup phase to maximize the performance. The second one offers a special barrier implementation for InfiniBand with RDMA and performs up to 40% better than the best solution that has been published so far. The third implementation offers a constant time barrier in a separate network, leveraging commodity components, with a latency of only 2.5 microseconds. All components have their specialty and can be used to enhance the barrier performance significantly.
272

Optimizing Point-to-Point Ethernet Cluster Communication

Reinhardt, Mirko 28 February 2006 (has links)
This work covers the implementation of a raw Ethernet communication module for the Open MPI message passing library. Thereby it focuses on both the reduction of the communication latency for small messages and maximum possible compatibility. Especially the need for particular network devices, adapted network device drivers or kernel patches is avoided. The work is divided into three major parts: First, the networking subsystem of the version 2.6 Linux kernel is analyzed. Second, an Ethernet protocol family is implemented as a loadable kernel module, consisting of a basic datagram protocol (EDP), providing connection-less and unreliable datagram transport, and a streaming protocol (ESP), providing connection-oriented, sequenced and reliable byte streams. The protocols use the standard device driver interface of the Linux kernel for data transmission and reception. Their services are made available to user-space applications through the standard socket interface. Last, the existing Open MPI TCP communication module is ported atop the ESP. With bare EDP/ESP sockets a message latency of about 30 us could be achieved for small messages, which compared to the TCP latency of about 40 us is a reduction of 25 %.
273

Efficient Broadcast for Multicast-Capable Interconnection Networks

Siebert, Christian 30 September 2006 (has links)
The broadcast function MPI_Bcast() from the MPI-1.1 standard is one of the most heavily used collective operations for the message passing programming paradigm. This diploma thesis makes use of a feature called "Multicast", which is supported by several network technologies (like Ethernet or InfiniBand), to create an efficient MPI_Bcast() implementation, especially for large communicators and small-sized messages. A preceding analysis of existing real-world applications leads to an algorithm which does not only perform well for synthetical benchmarks but also even better for a wide class of parallel applications. The finally derived broadcast has been implemented for the open source MPI library "Open MPI" using IP multicast. The achieved results prove that the new broadcast is usually always better than existing point-to-point implementations, as soon as the number of MPI processes exceeds the 8 node boundary. The performance gain reaches a factor of 4.9 on 342 nodes, because the new algorithm scales practically independently of the number of involved processes. / Die Broadcastfunktion MPI_Bcast() aus dem MPI-1.1 Standard ist eine der meistgenutzten kollektiven Kommunikationsoperationen des nachrichtenbasierten Programmierparadigmas. Diese Diplomarbeit nutzt die Multicastfähigkeit, die von mehreren Netzwerktechnologien (wie Ethernet oder InfiniBand) bereitgestellt wird, um eine effiziente MPI_Bcast() Implementation zu erschaffen, insbesondere für große Kommunikatoren und kleinere Nachrichtengrößen. Eine vorhergehende Analyse von existierenden parallelen Anwendungen führte dazu, dass der neue Algorithmus nicht nur bei synthetischen Benchmarks gut abschneidet, sondern sein Potential bei echten Anwendungen noch besser entfalten kann. Der letztendlich daraus entstandene Broadcast wurde für die Open-Source MPI Bibliothek "Open MPI" entwickelt und basiert auf IP Multicast. Die erreichten Ergebnisse belegen, dass der neue Broadcast üblicherweise immer besser als jegliche Punkt-zu-Punkt Implementierungen ist, sobald die Anzahl von MPI Prozessen die Grenze von 8 Knoten überschreitet. Der Geschwindigkeitszuwachs erreicht einen Faktor von 4,9 bei 342 Knoten, da der neue Algorithmus praktisch unabhängig von der Knotenzahl skaliert.
274

ESPGOAL: A Dependency Driven Communication Framework

Schneider, Timo, Eckelmann, Sven 01 June 2011 (has links)
Optimized implementations of blocking and nonblocking collective operations are most important for scalable high-performance applications. Offloading such collective operations into the communication layer can improve performance and asynchronous progression of the operations. However, it is most important that such offloading schemes remain flexible in order to support user-defined (sparse neighbor) collective communications. In this work, we describe an operating system kernel-based architecture for implementing an interpreter for the flexible Group Operation Assembly Language (GOAL) framework to offload collective communications. We describe an optimized scheme to store the schedules that define the collective operations and show an extension to profile the performance of the kernel layer. Our microbenchmarks demonstrate the effectiveness of the approach and we show performance improvements over traditional progression in user-space. We also discuss complications with the design and offloading strategies in general.:1 Introduction 1.1 Related Work 2 The GOAL API 2.1 API Conventions 2.2 Basic GOAL Functionality 2.2.1 Initialization 2.2.2 Graph Creation 2.2.3 Adding Operations 2.2.4 Adding Dependencies 2.2.5 Scratchpad Buffer 2.2.6 Schedule Compilation 2.2.7 Schedule Execution 2.3 GOAL-Extensions 3 ESP Transport Layer 3.1 Receive Handling 3.2 Transfer Management 3.2.1 Known Problems 4 The Architecture of ESPGOAL 4.1 Control Flow 4.1.1 Loading the Kernel Module 4.1.2 Adding a Communicator 4.1.3 Starting a Schedule 4.1.4 Schedule Progression 4.1.5 Progression by ESP 4.1.6 Unloading the Kernel Module 4.2 Data Structures 4.2.1 Starting a Schedule 4.2.2 Transfer Management 4.2.3 Stack Overflow Avoidance 4.3 Interpreting a GOAL Schedule 5 Implementing Collectives in GOAL 5.1 Recursive Doubling 5.2 Bruck's Algorithm 5.3 Binomial Trees 5.4 MPI_Barrier 5.5 MPI_Gather 6 Benchmarks 6.1 Testbed 6.2 Interrupt coalescing parameters 6.3 Benchmarking Point to Point Latency 6.4 Benchmarking Local Operations 6.5 Benchmarking Collective Communication Latency 6.6 Benchmarking Collective Communication Host Overhead 6.7 Comparing Different Ways to use Ethernet NICs 7 Conclusions and Future Work 8 Acknowledgments
275

Scalable critical-path analysis and optimization guidance for hybrid MPI-CUDA applications

Schmitt, Felix, Dietrich, Robert, Juckeland, Guido 29 October 2019 (has links)
The use of accelerators in heterogeneous systems is an established approach in designing petascale applications. Today, Compute Unified Device Architecture (CUDA) offers a rich programming interface for GPU accelerators but requires developers to incorporate several layers of parallelism on both the CPU and the GPU. From this increasing program complexity emerges the need for sophisticated performance tools. This work contributes by analyzing hybrid MPICUDA programs for properties based on wait states, such as the critical path, a metric proven to identify application bottlenecks effectively. We developed a tool to construct a dependency graph based on an execution trace and the inherent dependencies of the programming models CUDA and Message Passing Interface (MPI). Thereafter, it detects wait states and attributes blame to responsible activities. Together with the property of being on the critical path, we can identify activities that are most viable for optimization. To evaluate the global impact of optimizations to critical activities, we predict the program execution using a graph-based performance projection. The developed approach has been demonstrated with suitable examples to be both scalable and correct. Furthermore, we establish a new categorization of CUDA inefficiency patterns ensuing from the dependencies between CUDA activities.
276

Optimization of checkpointing and execution model for an implementation of OpenMP on distributed memory architectures / Optimisation des checkpoints et du modèle d'exécution pour une implémentation de OpenMP sur architectures à mémoire distribuée

Tran, Van Long 14 November 2018 (has links)
OpenMP et MPI sont devenus les outils standards pour développer des programmes parallèles sur une architecture à mémoire partagée et à mémoire distribuée respectivement. Comparé à MPI, OpenMP est plus facile à utiliser. Ceci est dû au fait qu’OpenMP génère automatiquement le code parallèle et synchronise les résultats à l’aide de directives, clauses et fonctions d’exécution, tandis que MPI exige que les programmeurs fassent ce travail manuellement. Par conséquent, des efforts ont été faits pour porter OpenMP sur les architectures à mémoire distribuée. Cependant, à l’exclusion de CAPE, aucune solution ne satisfait les deux exigences suivantes: 1) être totalement conforme à la norme OpenMP et 2) être hautement performant. CAPE (Checkpointing-Aided Parallel Execution) est un framework qui traduit et fournit automatiquement des fonctions d’exécution pour exécuter un programme OpenMP sur une architecture à mémoire distribuée basé sur des techniques de checkpoint. Afin d’exécuter un programme OpenMP sur un système à mémoire distribuée, CAPE utilise un ensemble de modèles pour traduire le code source OpenMP en code source CAPE, puis le code source CAPE est compilé par un compilateur C/C++ classique. Fondamentalement, l’idée de CAPE est que le programme s’exécute d’abord sur un ensemble de nœuds du système, chaque nœud fonctionnant comme un processus. Chaque fois que le programme rencontre une section parallèle, le maître distribue les tâches aux processus esclaves en utilisant des checkpoints incrémentaux discontinus (DICKPT). Après l’envoi des checkpoints, le maître attend les résultats renvoyés par les esclaves. L’étape suivante au niveau du maître consiste à recevoir et à fusionner le résultat des checkpoints avant de les injecter dans sa mémoire. Les nœuds esclaves quant à eux reçoivent les différents checkpoints, puis l’injectent dans leur mémoire pour effectuer le travail assigné. Le résultat est ensuite renvoyé au master en utilisant DICKPT. À la fin de la région parallèle, le maître envoie le résultat du checkpoint à chaque esclave pour synchroniser l’espace mémoire du programme. Dans certaines expériences, CAPE a montré des performances élevées sur les systèmes à mémoire distribuée et constitue une solution viable entièrement compatible avec OpenMP. Cependant, CAPE reste en phase de développement, ses checkpoints et son modèle d’exécution devant être optimisés pour améliorer les performances, les capacités et la fiabilité. Cette thèse vise à présenter les approches proposées pour optimiser et améliorer la capacité des checkpoints, concevoir et mettre en œuvre un nouveau modèle d’exécution, et améliorer la capacité de CAPE. Tout d’abord, nous avons proposé une arithmétique sur les checkpoints qui modélise la structure de leurs données et ses opérations. Cette modélisation contribue à optimiser leur taille et à réduire le temps nécessaire à la fusion, tout en améliorant leur capacité. Deuxièmement, nous avons développé TICKPT (Time-Stamp Incremental Checkpointing) une implémentation de l’arithmétique sur les checkpoints. TICKPT est une amélioration de DICKPT, il a ajouté l’horodatage aux checkpoints pour en identifier l’ordre. L’analyse et les expériences comparées montrent TICKPT sont non seulement plus petites, mais qu’ils ont également moins d’impact sur les performances du programme. Troisièmement, nous avons conçu et implémenté un nouveau modèle d’exécution et de nouveaux prototypes pour CAPE basés sur TICKPT. Le nouveau modèle d’exécution permet à CAPE d’utiliser les ressources efficacement, d’éviter les risques de goulots d’étranglement et de satisfaire à l’exigence des les conditions de Bernstein. Au final, ces approches améliorent significativement les performances de CAPE, ses capacités et sa fiabilité. Le partage des données implémenté sur CAPE et basé sur l’arithmétique sur des checkpoints est ouvert et basé sur TICKPT. Cela démontre également la bonne direction que nous avons prise et rend CAPE plus complet / OpenMP and MPI have become the standard tools to develop parallel programs on shared-memory and distributed-memory architectures respectively. As compared to MPI, OpenMP is easier to use. This is due to the ability of OpenMP to automatically execute code in parallel and synchronize results using its directives, clauses, and runtime functions while MPI requires programmers do all this manually. Therefore, some efforts have been made to port OpenMP on distributed-memory architectures. However, excluding CAPE, no solution has successfully met both requirements: 1) to be fully compliant with the OpenMP standard and 2) high performance. CAPE stands for Checkpointing-Aided Parallel Execution. It is a framework that automatically translates and provides runtime functions to execute OpenMP program on distributed-memory architectures based on checkpointing techniques. In order to execute an OpenMP program on distributed-memory system, CAPE uses a set of templates to translate OpenMP source code to CAPE source code, and then, the CAPE source code is compiled by a C/C++ compiler. This code can be executed on distributed-memory systems under the support of the CAPE framework. Basically, the idea of CAPE is the following: the program first run on a set of nodes on the system, each node being executed as a process. Whenever the program meets a parallel section, the master distributes the jobs to the slave processes by using a Discontinuous Incremental Checkpoint (DICKPT). After sending the checkpoints, the master waits for the returned results from the slaves. The next step on the master is the reception and merging of the resulting checkpoints before injecting them into the memory. For slave nodes, they receive different checkpoints, and then, they inject it into their memory to compute the divided job. The result is sent back to the master using DICKPTs. At the end of the parallel region, the master sends the result of the checkpoint to every slaves to synchronize the memory space of the program as a whole. In some experiments, CAPE has shown very high-performance on distributed-memory systems and is a viable and fully compatible with OpenMP solution. However, CAPE is in the development stage. Its checkpoint mechanism and execution model need to be optimized in order to improve the performance, ability, and reliability. This thesis aims at presenting the approaches that were proposed to optimize and improve checkpoints, design and implement a new execution model, and improve the ability for CAPE. First, we proposed arithmetics on checkpoints, which aims at modeling checkpoint’s data structure and its operations. This modeling contributes to optimize checkpoint size and reduces the time when merging, as well as improve checkpoints capability. Second, we developed TICKPT which stands for Time-stamp Incremental Checkpointing as an instance of arithmetics on checkpoints. TICKPT is an improvement of DICKPT. It adds a timestamp to checkpoints to identify the checkpoints order. The analysis and experiments to compare it to DICKPT show that TICKPT do not only provide smaller in checkpoint size, but also has less impact on the performance of the program using checkpointing. Third, we designed and implemented a new execution model and new prototypes for CAPE based on TICKPT. The new execution model allows CAPE to use resources efficiently, avoid the risk of bottlenecks, overcome the requirement of matching the Bernstein’s conditions. As a result, these approaches make CAPE improving the performance, ability as well as reliability. Four, Open Data-sharing attributes are implemented on CAPE based on arithmetics on checkpoints and TICKPT. This also demonstrates the right direction that we took, and makes CAPE more complete
277

Message Passing Interface parallelization of a multi-block structured numerical solver. Application to the numerical simulation of various typical Electro-Hydro-Dynamic flows / Parallélisation d'un solver multi-blocs structurés avec la librairie Message Passing Interface. Application à la simulation numérique de divers écoulements électro-hydro-dynamiques typiques

Seth, Umesh Kumar 29 March 2019 (has links)
Plusieurs types d’applications industrielles complexes, relèvent du domaine multidisciplinaire de l’Electro-Hydro-Dynamique (EHD) où les interactions entre des particules chargées et des particules neutres sont étudiées dans le contexte couplé de la dynamique des fluides et de l’électrostatique. Dans cette thèse, nous avons étudié par voie de simulation numérique certains phénomènes Electro-Hydro-Dynamiques comme l’injection unipolaire, le phénomène de conduction dans les liquides peu conducteurs et le contrôle d’écoulement avec des actionneurs plasma à barrières diélectriques (DBD). La résolution de tels systèmes physiques complexes exige des ressources de calculs importantes ainsi que des solveurs CFD parallèles dans la mesure où ces modèles EHD sont mathématiquement raides et très consommateurs en temps de calculs en raison des gammes d’échelles de temps et d’espace impliquées. Cette thèse vise à accroitre les capacités de simulations numériques du groupe Electro-Fluido-Dynamique de l’Institut Pprime en développant un solveur parallèle haute performance basé sur des modèles EHD avancés. Dans une première partie de cette thèse, la parallélisation de notre solveur EHD a été réalisée avec des protocoles MPI avancés comme la topologie Cartésienne et les Inter-communicateurs. En particulier, une stratégie spécifique a été conçue pour prendre en compte la caractéristique multi-blocs structurés du code. La nouvelle version parallèle du code a été entièrement validée au travers de plusieurs benchmarks. Les tests de scalabilité menés sur notre cluster de 1200 cœurs ont montré d’excellentes performances. La deuxième partie de cette thèse est consacrée à la simulation numérique de plusieurs écoulements EHD typiques. Nous nous sommes intéressés entre autres à l’électroconvection induite par l'injection unipolaire entre deux électrodes plates parallèles, à l’étude des panaches électroconvectifs dans une configuration d'électrodes lame-plan, au mécanisme de conduction basé sur la dissociation de molécules neutres d'un liquide faiblement conducteur. Certains de ces nouveaux résultats ont été validés avec des simulations numériques entreprises avec le code commercial Comsol. Enfin, le contrôle d’écoulements grâce à un actionneur DBD a été simulé à l’aide du modèle Suzen-Huang dans diverses configurations. Les effets de l’épaisseur du diélectrique, de l’espacement inter-électrodes, de la fréquence de la tension appliquée et sa forme d’onde, sur la vitesse maximale du vent ionique induit ainsi que sur la force électrique moyenne ont été étudiés. / Several intricately coupled applications of modern industries fall under the multi-disciplinary domain of Electrohydrodynamics (EHD), where the interactions among charged and neutral particles are studied in context of both fluid dynamics and electrostatics together. The charge particles in fluids are generated with various physical mechanisms, and they move under the influence of external electric field and the fluid velocity. Generally, with sufficient electric force magnitudes, momentum transfer occurs from the charged species to the neutral particles also. This coupled system is solved with the Maxwell equations, charge transport equations and Navier-Stokes equations simulated sequentially in a common time loop. The charge transport is solved considering convection, diffusion, source terms and other relevant mechanisms for species. Then, the bulk fluid motion is simulated considering the induced electric force as a source term in the Navier-Stokes equations, thus, coupling the electrostatic system with the fluid. In this thesis, we numerically investigated some EHD phenomena like unipolar injection, conduction phenomenon in weakly conducting liquids and flow control with dielectric barrier discharge (DBD) plasma actuators.Solving such complex physical systems numerically requires high-end computing resources and parallel CFD solvers, as these large EHD models are mathematically stiff and highly time consuming due to the range of time and length scales involved. This thesis contributes towards advancing the capability of numerical simulations carried out within the EFD group at Institut Pprime by developing a high performance parallel solver with advanced EHD models. Being the most popular and specific technology, developed for the distributed memory platforms, Message Passing Interface (MPI) was used to parallelize our multi-block structured EHD solver. In the first part the parallelization of our numerical EHD solver with advanced MPI protocols such as Cartesian topology and Inter-Communicators is undertaken. In particular a specific strategy has been designed and detailed to account for the multi-block structured grids feature of the code. The parallel code has been fully validated through several benchmarks, and scalability tests carried out on up to 1200 cores on our local cluster showed excellent parallel speed-ups with our approach. A trustworthy database containing all these validation tests carried out on multiple cores is provided to assist in future developments. The second part of this thesis deals with the numerical simulations of several typical EHD flows. We have examined three-dimensional electroconvection induced by unipolar injection between two planar-parallel electrodes. Unsteady hexagonal cells were observed in our study. 3D flow phenomenon with electro-convective plumes was also studied in the blade-plane electrode configuration considering both autonomous and non-autonomous injection laws. Conduction mechanism based on the dissociation of neutral molecules of a weakly conductive liquid has been successfully simulated. Our results have been validated with some numerical computations undertaken with the commercial code Comsol. Physical implications of Robin boundary condition and Onsager effect on the charge species were highlighted in electro-conduction in a rectangular channel. Finally, flow control using Dielectric Barrier Discharge plasma actuator has been simulated using the Suzen-Huang model. Impacts of dielectric thickness, gap between the electrodes, frequency and waveform of applied voltage etc. were investigated in terms of their effect on the induced maximum ionic wind velocity and average body force. Flow control simulations with backward facing step showed that a laminar flow separation could be drastically controlled by placing the actuator at the tip of the step with both electrodes perpendicular to each other.
278

Efficient state space exploration for parallel test generation

Ramasamy Kandasamy, Manimozhian 03 September 2009 (has links)
Automating the generation of test cases for software is an active area of research. Specification based test generation is an approach in which a formal representation of a method is analyzed to generate valid test cases. Constraint solving and state space exploration are important aspects of the specification based test generation. One problem with specification based testing is that the size of the state space explodes when we apply this approach to a code of practical size. Hence finding ways to reduce the number of candidates to explore within the state space is important to make this approach practical in industry. Korat is a tool which generates test cases for Java programs based on predicates that validate the inputs to the method. Various ongoing researches intend to increase the tools effectiveness in handling large state space. Parallelizing Korat and minimizing the exploration of invalid candidates are the active research directions. This report surveys the basic algorithms of Korat, PKorat, and Fast Korat. PKorat is a parallel version of Korat and aims to take advantage of multi-processor and multicore systems available. Fast Korat implements four optimizations which reduce the number of candidate explored to generate validate candidates and reduce the amount of time required to explore each candidate. This report also presents the execution time results for generating test candidates for binary tree, doubly linked list, and sorted singly linked list, from their respective predicates. / text
279

銀行產業經營效率與生產力分析

王凱平 Unknown Date (has links)
在經濟體系中,銀行佔有很重要的地位,他能讓資金有效率的配置到需要的地方,活絡經濟體系內資金的流通。因此,銀行業表現的好壞對於經濟體系的發展有著重大的影響。 在衡量銀行的經營績效時,外在環境、營運狀態的差異,有可能影響到銀行管理者在生產過程中的決策以及對於投入的控制能力,進而影響了銀行真實的效率值,我們稱為管理效率(managerial efficiency)。此外,逾期放款是銀行生產過程中不可避免的非意欲產出(undesirable output)在衡量時有必要將之列入考量。 我們採用四階段DEA分析方法以及Seiford(2002)的概念,將逾放比率視為非意欲產出。採取本國一般銀行民國八十八年至九十一年共四年的資料進行分析。本文的研究目的如下: 1. 探討各銀行間的經營表現,推究其無效率來源是因為純技術效率或是規模效率,並建議銀行是否應該擴大規模、維持或降低營業規模來增加經營效率。 2. 比較加入非意欲產出之後的效率值是否有顯著不同。 3. 推估各銀行的管理效率,探討銀行業的效率表現和管理效率之間的關係。 實證結果如下: 1. 台灣官股銀行無效率的主因為規模無效率。其他民營銀行無效率的主因為技術無效率。 2. 銀行產業的管理效率會受到景氣波動的影響。外生操作環境的不同在整個產業面對負向衝擊時,會有不同的表現。 3. 在景氣不佳時,銀行的管理效率會有被低估的現象;景氣活絡時則管理效率有被高估的情況。 4. 加入逾放比率為非意欲產出之後,整個銀行產業會受到經濟環境的波動影響。 5. 加入逾放比率與調整外生環境後,能夠更真實的表現出銀行的經營績效。
280

High performance computing for the discontinuous Galerkin methods

Mukhamedov, Farukh January 2018 (has links)
Discontinuous Galerkin methods form a class of numerical methods to find a solution of partial differential equations by combining features of finite element and finite volume methods. Methods are defined using a weak form of a particular model problem, allowing for discontinuities in the discrete trial and test spaces. Using a discontinuous discrete space mesh provides proper flexibility and a compact discretisation pattern, allowing a multidomain and multiphysics simulation. Discontinuous Galerkin methods with a higher approximation polynomial order, the socalled p-version, performs better in terms of convergence rate, compared with the low order h-version with smaller element sizes and bigger mesh. However, the condition number of the Galerkin system grows subsequently. This causes surge in the amount of required storage, computational complexity and in the time required for computation. We use the following three approaches to keep the advantages and eliminate the disadvantages. The first approach will be a specific choice of basis functions which we call C1 polynomials. These ensure that the majority of integrals over the edge of the mesh elements disappears. This reduces the total number of non-zero elements in the resulting system. This decreases the computational complexity without loss in precision. This approach does not affect the number of iterations required by chosen Conjugate Gradients method when compared to the other choice of basis functions. It actually decreases the total number of algebraic operations performed. The second approach is the introduction of suitable preconditioners. In our case, the Additive two-layer Schwarz method, developed in [4], for the iterative Conjugate Gradients method is considered. This directly affects the spectral condition number of the system matrix and decreases the number of iterations required for the computation. This approach, however, increases the total number of algebraic operations and might require more operational time. To tackle the rise in the number of algebraic operations, we introduced a modified Additive two-layer non-overlapping Schwarz method with a Multigrid process. This using a fixed low-order approximation polynomial degree on a coarse grid. We show that this approach is spectrally equivalent to the first preconditioner, and requires less time for computation. The third approach is a development of an efficient mathematical framework for distributed data structure. This allows a high performance, massively parallel, implementation of the discontinuous Galerkin method. We demonstrate that it is possible to exploit properties of the system matrix and C1 polynomials as basis functions to optimize the parallel structures. The previously mentioned parallel data structure allows us to parallelize at the same time both the matrix-vector multiplication routines for the Conjugate Gradients method, as well as the preconditioner routines on the solver level. This minimizes the transfer ratio amongst the distributed system. Finally, we combined all three approaches and created a framework, which allowed us to successfully implement all of the above.

Page generated in 0.0493 seconds