Global ETD Search

21	SkePU 2: Language Embedding and Compiler Support for Flexible and Type-Safe Skeleton Programming Ernstsson, August January 2016 (has links) This thesis presents SkePU 2, the next generation of the SkePU C++ framework for programming of heterogeneous parallel systems using the skeleton programming concept. SkePU 2 is presented after a thorough study of the state of parallel programming models, frameworks and tools, including other skeleton programming systems. The advancements in SkePU 2 include a modern C++11 foundation, a native syntax for skeleton parameterization with user functions, and an entirely new source-to-source translator based on Clang compiler front-end libraries. SkePU 2 extends the functionality of SkePU 1 by embracing metaprogramming techniques and C++11 features, such as variadic templates and lambda expressions. The results are improved programmability and performance in many situations, as shown in both a usability survey and performance evaluations on high-performance computing hardware. SkePU’s skeleton programming model is also extended with a new construct, Call, unique in the sense that it does not impose any predefined skeleton structure and can encapsulate arbitrary user-defined multi-backend computations. We conclude that SkePU 2 is a promising new direction for the SkePU project, and a solid basis for future work, for example in performance optimization. Skeleton programming SkePU Source-to-source transformation C++11 Heterogeneous parallel systems Portability Computer Sciences Datavetenskap (datalogi)
22	ChipCflow - uma ferramenta para execução de algoritmos utilizando o modelo a fluxo de dados dinâmico em hardware reconfigurável / ChipCflow - a tool to executing algorithms using dynamic dataflow architecture in FPGA Joelmir José Lopes 29 June 2012 (has links) Devido à complexidade das aplicações, a demanda crescente por sistemas que usam milhões de transistores e hardware complexo; tem sido desenvolvidas ferramentas que convertem C em Linguagem de Descrição de Hardware, tais como VHDL e Verilog. Neste contexto, esta tese apresenta o projeto ChipCflow, o qual usa arquitetura a fluxo de dados, para implementar lógica de alto desempenho em Field Programmable Gate Array (FPGA). Maquinas a fluxo de dados são computadores programáveis, cujo hardware é otimizado para computação paralela de granularidade fina dirigida por dados. Em outras palavras, a execução de programas é determinado pela disponibilidade dos dados, assim, o paralelismo é intrínseco neste sistema. Por outro lado, com o avanço da tecnologia da microeletrônica, o FPGA tem sido utilizado principalmente devido a sua flexibilidade, facilidade para implementar sistemas complexos e paralelismo intrínseco. Um dos desafios é criar ferramentas para programadores que usam linguagem de alto nível (HLL), como a linguagem C, e produzir hardware diretamente. Essas ferramentas devem usar a máxima experiência dos programadores, o paralelismo das arquiteturas a fluxo de dados dinâmica, a flexibilidade e o paralelismo do FPGA, para produzir um hardware eficiente, otimizado para alto desempenho e baixo consumo de energia. O projeto ChipCflow é uma ferramenta que converte os programas de aplicação escritos em linguagem C para a linguagem VHDL, baseado na arquitetura a fluxo de dados dinâmica. O principal objetivo dessa tese é definir e implementar os operadores do ChipCflow, usando a arquitetura a fluxo de dados dinâmica em FPGA. Esses operadores usam tagged tokens para identificar dados, com base em instâncias de operadores. A implementação dos operadores e das instâncias usam um modelo de implementação assíncrono em FPGA para obter maior velocidade e menor consumo / Due to the complexity of applications, the growing demand for both systems using millions of transistors and consecutive complex hardware, tools that convert C into a Hardware Description Language (HDL), as VHDL and Verilog, have been developed. In this context this thesis presents the ChipCflow project, which uses dataflow architecture to implement high-performance logics in Field Programmable Gate Array (FPGA). Dataflow machines are programmable computers whose hardware is optimized for fine-grain data-flow parallel computation. In other words the execution of programs is determined by data availability, thus parallelism is intrinsic in these systems. On the other hand, with the advance of technology of microelectronics, the FPGA has been used mainly because of its flexibility, facilities to implement complex systems and intrinsic parallelism. One of the challenges is to create tools for programmers who use HLL (High Level Language), such as C language, producing hardware directly. These tools should use the utmost experience of the programmers, the parallelism of dynamic dataflow architecture and the flexibility and parallelism of FPGA to produce efficient hardware optimized for high performance and lower power consumption. The ChipCflow project is a tool that converts application programs written in C language into VHDL, based on the dynamic dataflow architecture. The main goal in this thesis is to define and implement the operators of ChipCflow using dynamic dataflow architecture in FPGA. These operators use tagged tokens to identify data based on instances of operators and their implementation and instances use an asynchronous implementation model in FPGA to achieve faster speed and lower consumption Arquiteturas a fluxo de dados dinâmicas Sistemas paralelos Dynamic dataflow architecture Parallel systems
23	Optimizing Data Accesses for Scaling Data-intensive Scientific Applications Yeom, Jae-seung 30 May 2014 (has links) Data-intensive scientific applications often process an enormous amount of data. The scalability of such applications depends critically on how to manage the locality of data. Our study explores two common types of applications that are vastly different in terms of memory access pattern and workload variation. One includes those with multi-stride accesses in regular nested parallel loops. The other is for processing large-scale irregular social network graphs. In the former case, the memory location or the data item accessed in a loop is predictable and the load on processing a unit work (an array element) is relatively uniform with no significant variation. On the other hand, in the latter case, the data access per unit work (a vertex) is highly irregular in terms of the number of accesses and the locations being accessed. This property is further tied to the load and presents significant challenges in the scalability of the application performance. Designing platforms to support extreme performance scaling requires understanding of how application specific information can be used to control the locality and improve the performance. Such insights are necessary to determine which control and which abstraction to provide for interfacing an underlying system and an application as well as for designing a new system. Our goal is to expose common requirements of data-intensive scientific applications for scalability. For the former type of applications, those with regular accesses and uniform workload, we contribute new methods to improve the temporal locality of software-managed local memories, and optimize the critical path of scheduling data transfers for multi-dimensional arrays in nested loops. In particular, we provide a runtime framework allowing transparent optimization by source-to-source compilers or automatic fine tuning by programmers. Finally, we demonstrate the effectiveness of the approach by comparing against a state-of-the-art language-based framework. For the latter type, those with irregular accesses and non-uniform workload, we analyze how the heavy-tailed property of input graphs limits the scalability of the application. Then, we introduce an application-specific workload model as well as a decomposition method that allows us to optimize locality with the custom load balancing constraints of the application. Finally, we demonstrate unprecedented strong scaling of a contagion simulation on two state-of-the-art high performance computing platforms. / Ph. D. Parallel systems Software-managed memories Distributed memories Data locality Scalability Parallel discrete event simulation Social networks Contagion
24	Optimizing Multi-Queue in Parallel Systems with Task Batching Ronestjärna, Jakob January 2024 (has links) Multi-queue has been a proven solution for problems related to high input and output for any type of hardware. Therefore, improving it with task batching may increase performance and reduce overhead. While multi-queue and task batching have been studied, combining them has only been mentioned briefly, specifically for priority queues. For this experiment, system metrics of timers and block misses will show potential areas where task batching is beneficial. The result from the performed experiment show that the size of batches has similarities with the number of input workers. The result gives some insight into further improvements of the multi-queue, and the overhead was found to be reduced whenever task batching was used. multi-queue task batching batching parallel systems thread pool overhead scaling scaling vertically non-blocking Computer Sciences Datavetenskap (datalogi)
25	A Parallel Programming Language Cox, Richard D. 05 1900 (has links) The problem of programming a parallel processor is discussed. Previous methods of programming a parallel processor, analyzing a program for parallel paths, and special language features are discussed. Graph theory is used to define the three basic programming constructs: choice, sequence, repetition. The concept of mechanized programming is expanded to allow for total separation of control and computational sections of a program. A definition of a language is presented which provides for this separation. A method for developing the program graph is discussed. The control graph and data graph are developed separately. The two graphs illustrate control and data predecessor relationships used in determining parallel elements of a program. parallel processing parallel systems computer programming Control or Program statement computation and control modules Multiprogramming (Electronic computers)
26	Fouille et classement d'ensembles fermés dans des données transactionnelles de grande échelle / Mining and ranking closed itemsets from large-scale transactional datasets Kirchgessner, Martin 26 September 2016 (has links) Les algorithmes actuels pour la fouille d’ensembles fréquents sont dépassés par l’augmentation des volumes de données. Dans cette thèse nous nous intéressons plus particulièrement aux données transactionnelles (des collections d’ensembles d’objets, par exemple des tickets de caisse) qui contiennent au moins un million de transactions portant sur au moins des centaines de milliers d’objets. Les jeux de données de cette taille suivent généralement une distribution dite en "longue traine": alors que quelques objets sont très fréquents, la plupart sont rares. Ces distributions sont le plus souvent tronquées par les algorithmes de fouille d’ensembles fréquents, dont les résultats ne portent que sur une infime partie des objets disponibles (les plus fréquents). Les méthodes existantes ne permettent donc pas de découvrir des associations concises et pertinentes au sein d’un grand jeu de données. Nous proposons donc une nouvelle sémantique, plus intuitive pour l’analyste: parcourir les associations par objet, au plus une centaine à la fois, et ce pour chaque objet présent dans les données.Afin de parvenir à couvrir tous les objets, notre première contribution consiste à définir la fouille centrée sur les objets. Cela consiste à calculer, pour chaque objet trouvé dans les données, les k ensembles d’objets les plus fréquents qui le contiennent. Nous présentons un algorithme effectuant ce calcul, TopPI. Nous montrons que TopPI calcule efficacement des résultats intéressants sur nos jeux de données. Il est plus performant que des solutions naives ou des émulations reposant sur des algorithms existants, aussi bien en termes de rapidité que de complétude des résultats. Nous décrivons et expérimentons deux versions parallèles de TopPI (l’une sur des machines multi-coeurs, l’autre sur des grappes Hadoop) qui permettent d’accélerer le calcul à grande échelle.Notre seconde contribution est CAPA, un système permettant d’étudier quelle mesure de qualité des règles d’association serait la plus appropriée pour trier nos résultats. Cela s’applique aussi bien aux résultats issus de TopPI que de jLCM, notre implémentation d’un algorithme récent de fouille d’ensembles fréquents fermés (LCM). Notre étude quantitative montre que les 39 mesures que nous comparons peuvent être regroupées en 5 familles, d’après la similarité des classements de règles qu’elles produisent. Nous invitons aussi des experts en marketing à participer à une étude qualitative, afin de déterminer laquelle des 5 familles que nous proposons met en avant les associations d’objets les plus pertinentes dans leur domaine.Notre collaboration avec Intermarché, partenaire industriel dans le cadre du projet Datalyse, nous permet de présenter des expériences complètes et portant sur des données réelles issues de supermarchés dans toute la France. Nous décrivons un flux d’analyse complet, à même de répondre à cette application. Nous présentons également des expériences portant sur des données issues d’Internet; grâce à la généricité du modèle des ensembles d’objets, nos contributions peuvent s’appliquer dans d’autres domaines.Nos contributions permettent donc aux analystes de découvrir des associations d’objets au milieu de grandes masses de données. Nos travaux ouvrent aussi la voie vers la fouille d’associations interactive à large échelle, afin d’analyser des données hautement dynamiques ou de réduire la portion du fichier à analyser à celle qui intéresse le plus l’analyste. / The recent increase of data volumes raises new challenges for itemset mining algorithms. In this thesis, we focus on transactional datasets (collections of items sets, for example supermarket tickets) containing at least a million transactions over hundreds of thousands items. These datasets usually follow a "long tail" distribution: a few items are very frequent, and most items appear rarely. Such distributions are often truncated by existing itemset mining algorithms, whose results concern only a very small portion of the available items (the most frequents, usually). Thus, existing methods fail to concisely provide relevant insights on large datasets. We therefore introduce a new semantics which is more intuitive for the analyst: browsing associations per item, for any item, and less than a hundred associations at once.To address the items' coverage challenge, our first contribution is the item-centric mining problem. It consists in computing, for each item in the dataset, the k most frequent closed itemsets containing this item. We present an algorithm to solve it, TopPI. We show that TopPI computes efficiently interesting results over our datasets, outperforming simpler solutions or emulations based on existing algorithms, both in terms of run-time and result completeness. We also show and empirically validate how TopPI can be parallelized, on multi-core machines and on Hadoop clusters, in order to speed-up computation on large scale datasets.Our second contribution is CAPA, a framework allowing us to study which existing measures of association rules' quality are relevant to rank results. This concerns results obtained from TopPI or from jLCM, our implementation of a state-of-the-art frequent closed itemsets mining algorithm (LCM). Our quantitative study shows that the 39 quality measures we compare can be grouped into 5 families, based on the similarity of the rankings they produce. We also involve marketing experts in a qualitative study, in order to discover which of the 5 families we propose highlights the most interesting associations for their domain.Our close collaboration with Intermarché, one of our industrial partners in the Datalyse project, allows us to show extensive experiments on real, nation-wide supermarket data. We present a complete analytics workflow addressing this use case. We also experiment on Web data. Our contributions can be relevant in various other fields, thanks to the genericity of transactional datasets.Altogether our contributions allow analysts to discover associations of interest in modern datasets. We pave the way for a more reactive discovery of items' associations in large-scale datasets, whether on highly dynamic data or for interactive exploration systems. Fouille de données Grandes masses de données Systèmes parallèles Règles d'association Mesures de qualité Data mining Big data Parallel systems Association rules Quality measures 004
27	Pokrytelnosti pro paralelní programy / Coverability for Parallel Programs Turoňová, Lenka January 2015 (has links) This work is focusing on automatic verification of systems with parallel running processes. We discuss the existing methods and certain possibilities of optimizing them. Existing techniques are essentially based on finding an inductive invariant (for instance using a variant of counterexample-guided abstract refinement (CEGAR)). The effectiveness of these methods depends on the size of the invariant. In this thesis, we explored the possibility of improving the methods by focusing on finding invariants of minimal size. We implemented a tool that facilitates exploring the space of invariants of the system under scrutiny. Our experimental results show that many practical existing systems indeed have invariants that are much smaller than what can be found by the existing methods. The conjectures and the results of the work will serve as a basis of future research of an efficient method for finding small invariants of parallel systems.
28	PhD_ShunjiangTao_May2023.pdf Shunjiang Tao (15209053) 12 April 2023 (has links) <p>The broad implementation of three-dimensional full-core modeling, with pin-resolved detail, for computational simulation and analysis of nuclear reactors highlights the importance of accuracy and efficiency in simulation codes for accurate and precise analysis. The primary objective of this dissertation is to develop a high-fidelity code capable of solving time-dependent neutron transport problems with 3D whole-core pin-resolved detail in nuclear reactor cores. Additionally, the dissertation explores the optimization of the code's parallelism to enhance its computational efficiency. To reduce the computational intensity associated with the direct 3D calculation of the neutron transport equation, a high-fidelity neutron transport code called PANDAS-MOC is developed using the 2D/1D approach. The 2D radial solution is obtained using the 2D Method of Characteristics (MOC), the axial 1D solution is determined through the Nodal Expansion Method (NEM), and then two solutions are coupled using transverse leakages to find the 3D solution. The convergence of the iterative scheme is accelerated using the multi-level coarse finite different mesh (ML-CMFD) technique. The code's validation and verification are carried out using the C5G7-TD benchmark exercises.</p> <p><br></p> <p>The significant and innovative aspect of this work involves parallelizing and optimizing the PANDAS-MOC code. Three parallel models are developed and evaluated based on the distributed memory and shared memory architecture: MPI parallel model (PMPI), Segment OpenMP threading hybrid model (SGP), and Whole-code OpenMP threading hybrid model (WCP). When computing the steady state of the C5G7 3D core with the same resources, the obtained speedup relationship between the three models is PMPI \(>\) WCP \(>\) SGP, whereas the WCP model only consumed 60\% of the memory of the PMPI model. Furthermore, the hybrid reduction in the ML-CMFD solver and the parallelism design of the MOC sweep are significant issues that decreased the speedup of WCP. Therefore, this study also addresses further optimizations of these two modules.</p> <p><br></p> <p>Concerning the MOC parallelism, two improvements are discussed: No-atomic schedule and Additional Axial Decomposition (AAD) parallelism. The No-atomic schedule evenly distributed the workload among threads and removes the \textit{omp atomic} clause from the code by predefining the MOC calculation sequence for each launched OpenMP thread while ensuring a thread-safe parallel environment. It can significantly reduce the calculation time and improve parallel efficiency. Furthermore, AAD divides the axial layers and OpenMP threads into multiple groups and restricts each thread to work on the layers designated to the same group. </p> <p>Meanwhile, Flag-Save-Update reduction is designed to increase the computational efficiency of the hybrid MPI/OpenMP reduction operations in the ML-CMFD module. It is accomplished by using the global arrays and status flags and establishing a tree configuration of all threads, and it includes no implicit and explicit barriers. In the case of the C5G7 3D core, the parallel efficiency of the MOC solver is about 0.872 when using 32 threads (=\#MPI \(\times\)\#OpenMP), and the Flag-Save-Update reduction yielded better speedup than the traditional hybrid MPI/OpenMP reduction, and its superiority is more obvious as more OpenMP threads are utilized. As a result, the WCP model outperforms the PMPI model for the overall steady-state calculation.</p> <p><br></p> <p>This research also investigates parallelizable preconditioners to accelerate the convergence of the generalized minimal residual method (GMRES) in the CMFD solver. Preconditioners such as Incomplete LU factorization (ILU), Symmetric Successive Over-relaxation (SOR), and Reduced Symmetric Successive Over-Relaxation (RSOR), are implemented in PANDAS-MOC. Except for RSOR, others are unsuitable for hybrid MPI/OpenMP parallel machines due to their inherent sequential nature and dependency on computation order. Their counterparts using the Red-Black ordering algorithm, namely RB-SOR, RB-RSOR, and RB-ILU, are formatted and examined on benchmark reactors such as TWIGL-2D, C5G7-2D, C5G7-3D, and their corresponding subplane models (TWIGL-2D(5S), C5G7-2D(5S), C5G7-3D(5S)), with relaxed convergence criteria (\(10^{-3}\)). Results show that all preconditioners significantly reduce the required number of iterations to converge the GMRES solutions, and RB-SOR is the best one for most reactors. In the case of C5G7-3D(5S), preconditioners exhibit similar sublinear speedup but demonstrate varying runtimes across all tests for both MG-GMRES and 1G-GMRES. However, the speedup results in 1G-GMRES are more than twice as high as those in MG-GMRES. RB-RSOR has an optimal efficiency of 0.6967 at (4,8), while RB-SOR and RB-ILU have optimal efficiencies of 0.6855 and 0.7275 at (32,1), respectively.</p> Numerical analysis Nuclear physics Neutron transport Parallel computing PANDAS-MOC Large-scale Linear System Method of characteristics CMFD acceleration Preconditioner 2D/1D method Reactor physics
29	Approches anytime et distribuées pour l'appariment de graphes / Anytime and distributed approaches for graph matching Abu-Aisheh, Zeina 25 May 2016 (has links) En raison de la capacité et de l'amélioration des performances informatiques, les représentations structurelles sont devenues de plus en plus populaires dans le domaine de la reconnaissance de formes (RF). Quand les objets sont structurés à base de graphes, le problme de la comparaison d'objets revient à un problme d'appariement de graphes (Graph Matching). Au cours de la dernière décennie, les chercheurs travaillant dans le domaine de l'appariement de graphes ont porté une attention particulière à la distance d'édition entre graphes (GED), notamment pour sa capacité à traiter différent types de graphes. GED a été ainsi appliquée sur des problématiques spécifiques qui varient de la reconnaissance de molécules à la classi fication d'images. / Due to the inherent genericity of graph-based representations, and thanks to the improvement of computer capacities, structural representations have become more and more popular in the field of Pattern Recognition (PR). In a graph-based representation, vertices and their attributes describe objects (or part of them) while edges represent interrelationships between the objects. Representing objects by graphs turns the problem of object comparison into graph matching (GM) where correspondences between vertices and edges of two graphs have to be found. Reconnaissance de formes Appariement de graphes Distance d'édition Branch-and-Bound Systèmes parallèle et distribué Équilibrage de charge Évaluation de performance Contraintes de temps Pattern Recognition Graph Matching Graph Edit Distance Branchand-Bound Distributed and Parallel Systems Load Balancing Performance Evaluation Metrics Anytime Graph Matching Time Constraints
30	Adaptive Fault Tolerance Strategies for Large Scale Systems George, Cijo January 2012 (has links) (PDF) Exascale systems of the future are predicted to have mean time between node failures (MTBF) of less than one hour. At such low MTBF, the number of processors available for execution of a long running application can widely vary throughout the execution of the application. Employing traditional fault tolerance strategies like periodic checkpointing in these highly dynamic environments may not be effective because of the high number of application failures, resulting in large amount of work lost due to rollbacks apart from the increased recovery overheads. In this context, it is highly necessary to have fault tolerance strategies that can adapt to the changing node availability and also help avoid significant number of application failures. In this thesis, we present two adaptive fault tolerance strategies that make use of node failure pre-diction mechanisms to provide proactive fault tolerance for long running parallel applications on large scale systems. The first part of the thesis deals with an adaptive fault tolerance strategy for malleable applications. We present ADFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. We first develop cost models that consider different factors like accuracy of node failure predictions and application scalability, for evaluating the benefits of various fault tolerance actions including check-pointing, live-migration and rescheduling. Our adaptive framework then uses the cost models to make runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to minimize application failures and maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in work done by the application in the presence of failures, and is effective even for petascale and exascale systems. In the second part of the thesis, we present a fault tolerance strategy using adaptive process replication that can provide fault tolerance for applications using partial replication of a set of application processes. This fault tolerance framework adaptively changes the set of replicated processes (replicated set) periodically based on node failure predictions to avoid application failures. We have developed an MPI prototype implementation, PAREP-MPI that allows dynamically changing the replicated set of processes for MPI applications. Experiments with real scientific applications on real systems have shown that the overhead of PAREP-MPI is minimal. We have shown using simulations with real and synthetic failure traces that our strategy involving adaptive process replication significantly outperforms existing mechanisms providing up to 20% improvement in application efficiency even for exascale systems. Significant observations are also made which can drive future research efforts in fault tolerance for large and very large scale systems. Fault-tolerant Computing Large Scale Systems Adaptive Fault Tolerance Adaptive Process Replication Large Scale Systems - Fault Tolerance Malleability and Rescheduling Large Scale Parallel Systems Proactive Fault Tolerance High Performance Computing Adaptive Fault Management Computer Science

Search results