Global ETD Search

131	HDArray: PARALLEL ARRAY INTERFACE FOR DISTRIBUTED HETEROGENEOUS DEVICES Hyun Dok Cho (18620491) 30 May 2024 (has links) <p dir="ltr">Heterogeneous clusters with nodes containing one or more accelerators, such as GPUs, have become common. While MPI provides inter-address space communication, and OpenCL provides a process with access to heterogeneous computational resources, programmers are forced to write hybrid programs that manage the interaction of both of these systems. This paper describes an array programming interface that provides users with automatic and manual distributions of data and work. Using work distribution and kernel def and use information, communication among processes and devices in a process is performed automatically. By providing a unified programming model to the user, program development is simplified.</p> Distributed systems and algorithms High performance computing Programming languages Distributed Shared Memory Parallel programming (Computer science) Heterogeneous Systems MPI communication OpenCL programming models Array Programs
132	Pic-Vert : une implémentation de la méthode particulaire pour architectures multi-coeurs / Pic-Vert : a particle-in-cell implementation for multi-core architectures Barsamian, Yann 31 October 2018 (has links) Cette thèse a pour contexte la résolution numérique du système de Vlasov–Poisson (modèle utilisé en physique des plasmas, par exemple dans le cadre du projet ITER) par les méthodes classiques particulaires (PIC pour "Particle-in-Cell") et semi-Lagrangiennes. La contribution principale de notre thèse est une implémentation efficace de la méthode PIC pour architectures multi-coeurs, écrite dans le langage C, dont le nom est Pic-Vert. Notre implémentation (a) atteint un nombre quasi-minimal de transferts mémoires avec la mémoire principale, (b) exploite les instructions vectorielles (SIMD) pour les calculs numériques, et (c) expose une quantité suffisante de parallélisme, en mémoire partagée. Pour mettre notre travail en perspective avec l'état de l'art, nous proposons une métrique permettant de comparer différentes implémentations sur différentes architectures. Notre implémentation est 3 fois plus rapide que d'autres implémentations récentes sur la même architecture (Intel Haswell). / In this thesis, we are interested in solving the Vlasov–Poisson system of equations (useful in the domain of plasma physics, for example within the ITER project), thanks to classical Particle-in-Cell (PIC) and semi-Lagrangian methods. The main contribution of our thesis is an efficient implementation of the PIC method on multi-core architectures, written in C, called Pic-Vert. Our implementation (a) achieves close-to-minimal number of memory transfers with the main memory, (b) exploits SIMD instructions for numerical computations, and (c) exhibits a high degree of shared memory parallelism. To put our work in perspective with respect to the state-of-the-art, we propose a metric to compare the efficiency of different PIC implementations when using different multi-core architectures. Our implementation is 3 times faster than other recent implementations on the same architecture (Intel Haswell). Informatique Parallélisme Méthode particulaire Méthode semi-Lagrangienne Physique des plasmas Multi-coeurs Architecture SIMD Mémoire partagée Computer science Parallelism Particle-in-cell Semi-Lagrangian Plasma physics Multi-core SIMD architecture Shared memory 005.4
133	Cascaded All-Optical Shared-Memory Architecture Packet Switches Using Channel Grouping Under Bursty Traffic Shell, Michael David 01 December 2004 (has links) This work develops an exact logical operation model to predict the performance of the all-optical shared-memory architecture (OSMA) class of packet switches and provides a means to obtain a reasonable approximation of OSMA switch performance within certain types of networks, including the Banyan family. All-optical packet switches have the potential to far exceed the bandwidth capability of their current electronic counterparts. However, all-optical switching technology is currently not mature. Consequently, all-optical switch fabrics and buffers are more constrained in size and can cost several orders of magnitude more than those of electronic switches. The use of shared-memory buffers and/or links with multiple parallel channels (channel grouping) have been suggested as ways to maximize switch performance with buffers of limited size. However, analysis of shared-memory switches is far more difficult than for other commonly used buffering strategies. Obtaining packet loss performance by simulation is often not a viable alternative to modeling if low loss rates or large networks are encountered. Published models of electronic shared-memory packet switches (ESMP) have primarily involved approximate models to allow analysis of switches with a large number of ports and/or buffer cells. Because most ESMP models become inaccurate for small switches, and OSMA switches, unlike ESMP switches, do not buffer packets unless contention occurs, existing ESMP models cannot be applied to OSMA switches. Previous models of OSMA switches were confined to isolated (non-networked), symmetric OSMA switches using channel grouping under random traffic. This work is far more general in that it also encompasses OSMA switches that (1) are subjected to bursty traffic and/or with input links that have arbitrary occupancy probability distributions, (2) are interconnected to form a network and (3) are asymmetric. Packet switching Optical buffering Networks Markov models Channel grouping Bursty traffic Banyan networks Asymmetric Analytical modeling All-optical networks Shared-memory switches Computer networks Markov processes Optical communications Optical data processing
134	Tuned and asynchronous stencil kernels for CPU/GPU systems Venkatasubramanian, Sundaresan 18 May 2009 (has links) We describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi's iterative method for the 2-D Poisson equation on a structured grid, in both single- and double-precision. Properly tuned, our best implementation achieves 98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060. Motivated to find a still faster implementation, we further consider "wildly asynchronous" implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on the principle of a chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations, thereby potentially trading off more flops (via more iterations to converge) for a higher degree of asynchronous parallelism. Our relaxed-synchronization implementations on a GPU can be 1.2-2.5x faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly "fast-and-loose" algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs. Hybrid High performance computing Architecture Chaotic relaxation Tesla Linear system of equations Numerical methods Occupancy Algorithms Experimentation Performance Scientific computing Gauss siedel Shared memory Coalesced memory Bank conflicts GPU CUDA Nvidia Heterogenous CPU Iterative methods (Mathematics) Kernel functions
135	Iterative and Adaptive PDE Solvers for Shared Memory Architectures / Iterativa och adaptiva PDE-lösare för parallelldatorer med gemensam minnesorganisation Löf, Henrik January 2006 (has links) Scientific computing is used frequently in an increasing number of disciplines to accelerate scientific discovery. Many such computing problems involve the numerical solution of partial differential equations (PDE). In this thesis we explore and develop methodology for high-performance implementations of PDE solvers for shared-memory multiprocessor architectures. We consider three realistic PDE settings: solution of the Maxwell equations in 3D using an unstructured grid and the method of conjugate gradients, solution of the Poisson equation in 3D using a geometric multigrid method, and solution of an advection equation in 2D using structured adaptive mesh refinement. We apply software optimization techniques to increase both parallel efficiency and the degree of data locality. In our evaluation we use several different shared-memory architectures ranging from symmetric multiprocessors and distributed shared-memory architectures to chip-multiprocessors. For distributed shared-memory systems we explore methods of data distribution to increase the amount of geographical locality. We evaluate automatic and transparent page migration based on runtime sampling, user-initiated page migration using a directive with an affinity-on-next-touch semantic, and algorithmic optimizations for page-placement policies. Our results show that page migration increases the amount of geographical locality and that the parallel overhead related to page migration can be amortized over the iterations needed to reach convergence. This is especially true for the affinity-on-next-touch methodology whereby page migration can be initiated at an early stage in the algorithms. We also develop and explore methodology for other forms of data locality and conclude that the effect on performance is significant and that this effect will increase for future shared-memory architectures. Our overall conclusion is that, if the involved locality issues are addressed, the shared-memory programming model provides an efficient and productive environment for solving many important PDE problems. partial differential equations iterative methods finite elements conjugate gradients adaptive mesh refinement multigrid cc-NUMA distributed shared memory OpenMP page migration TLB shoot-down bandwidth minimization reverse Cuthill-McKee migrate-on-next-touch affinity temporal locality chip multiprocessors CMP
136	Numerical Quality and High Performance In Interval Linear Algebra on Multi-Core Processors / Algèbre linéaire d'intervalles - Qualité Numérique et Hautes Performances sur Processeurs Multi-Cœurs Theveny, Philippe 31 October 2014 (has links) L'objet est de comparer des algorithmes de multiplication de matrices à coefficients intervalles et leurs implémentations.Le premier axe est la mesure de la précision numérique. Les précédentes analyses d'erreur se limitent à établir une borne sur la surestimation du rayon du résultat en négligeant les erreurs dues au calcul en virgule flottante. Après examen des différentes possibilités pour quantifier l'erreur d'approximation entre deux intervalles, l'erreur d'arrondi est intégrée dans l'erreur globale. À partir de jeux de données aléatoires, la dispersion expérimentale de l'erreur globale permet d'éclairer l'importance des différentes erreurs (de méthode et d'arrondi) en fonction de plusieurs facteurs : valeur et homogénéité des précisions relatives des entrées, dimensions des matrices, précision de travail. Cette démarche conduit à un nouvel algorithme moins coûteux et tout aussi précis dans certains cas déterminés.Le deuxième axe est d'exploiter le parallélisme des opérations. Les implémentations précédentes se ramènent à des produits de matrices de nombres flottants. Pour contourner les limitations d'une telle approche sur la validité du résultat et sur la capacité à monter en charge, je propose une implémentation par blocs réalisée avec des threads OpenMP qui exécutent des noyaux de calcul utilisant les instructions vectorielles. L'analyse des temps d'exécution sur une machine de 4 octo-coeurs montre que les coûts de calcul sont du même ordre de grandeur sur des matrices intervalles et numériques de même dimension et que l'implémentation par bloc passe mieux à l'échelle que l'implémentation avec plusieurs appels aux routines BLAS. / This work aims at determining suitable scopes for several algorithms of interval matrices multiplication.First, we quantify the numerical quality. Former error analyses of interval matrix products establish bounds on the radius overestimation by neglecting the roundoff error. We discuss here several possible measures for interval approximations. We then bound the roundoff error and compare experimentally this bound with the global error distribution on several random data sets. This approach enlightens the relative importance of the roundoff and arithmetic errors depending on the value and homogeneity of relative accuracies of inputs, on the matrix dimension, and on the working precision. This also leads to a new algorithm that is cheaper yet as accurate as previous ones under well-identified conditions.Second, we exploit the parallelism of linear algebra. Previous implementations use calls to BLAS routines on numerical matrices. We show that this may lead to wrong interval results and also restrict the scalability of the performance when the core count increases. To overcome these problems, we implement a blocking version with OpenMP threads executing block kernels with vector instructions. The timings on a 4-octo-core machine show that this implementation is more scalable than the BLAS one and that the cost of numerical and interval matrix products are comparable. Algèbre linéaire numérique Multiplication de matrices Implémentation parallèle Processeurs multi-cœurs Memoire partagée Virgule flottante Analyse d’erreur Arithmétique d’intervalles Numerical linear algebra Matrix multiplication Parallel implementation Multi-core processors Shared memory Floating-point number Error analysis Interval arithmetic
137	Targeted Client Synthesis for Detecting Concurrency Bugs Samak, Malavika January 2016 (has links) (PDF) Detecting concurrency bugs can be challenging due to the intricacies associated with their manifestation. These intricacies correspond to identifying the methods that need to be invoked concurrently, the inputs passed to these methods and the interleaving of the threads that cause the erroneous behavior. Neither fuzzing-based testing techniques nor over-approximate static analyses are well positioned to detect subtle concurrency defects while retaining high accuracy alongside satisfactory coverage. While dynamic analysis techniques have been proposed to overcome some of the challenges in detecting concurrency bugs, we observe that their success is critically dependent on the availability of eﬀective multithreaded clients. Without a priori knowledge of the defects, manually constructing defect-revealing multithreaded clients is non-trivial. In this thesis, we design an approach to address the problem of automatically generate clients for detecting concurrency bugs in multithreaded libraries. The key insight underlying our design is that a subset of the properties observed when the defects manifest in a concur-rent execution can also be observed in a sequential execution. The input to our approach is a library implementation and a sequential testsuite, and the output is a set of multithreaded clients that can be used to reveal defects in the input library implementation. Dynamic defect detectors can execute the clients and analyze the resulting traces to report various kinds of defects including deadlocks, data races and atomicity violations. Furthermore, the clients can also be used by testing frameworks to report assertion violations. We propose two variants of our design – (a) path-agnostic client generation, and (b) path-aware client generation. The path-agnostic client generation process helps in detection of potential bugs present in the paths executed by the input sequential testsuite. It does not attempt to explore newer paths by satisfying path conditions either by modifying the input or by scheduling the threads appropriately. The generated clients are used to expose deadlocks, data races and atomicity violations. Our analysis analyzes the execution traces obtained from executing the input sequential clients and produces a concurrent client program that drives shared objects via library methods calls to states conducive for triggering deadlocks, data races or atomicity violations. For path-aware client generation, our approach explores newer paths that are not covered by the input sequential testsuite to generate clients. For this purpose, we design a directed, iterative and scalable engine that combines the strengths of static and dynamic analysis to help synthesize both multithreaded clients and schedules that violate complex correctness conditions expressed by the developer. Apart from the library implementation and the sequential testsuite as input, this engine also accepts a specification of correctness as input. Then, it iteratively refines each client from the input sequential testsuite to generate an ex-ecution that can break the input specification. Each step of the iterative process includes statically identifying sub-goals towards the goal of failing the specification, generating a plan toward meeting these goals, and merging of the paths traversed dynamically with the plan computed statically via constraint solving to generate a new client. The engine reports full reproduction scenarios, guaranteed to be true, for the bugs it finds. We have implemented prototypes that incorporate the aforementioned ideas and validated them by applying them on 29 well-tested concurrent classes from popular Java libraries, including the latest version of JDK. We are able to automatically generate clients that helped expose more than 300 concurrency bugs including deadlocks, data races, atomicity violations and assertion violations. We reported many previously unknown bugs to the developers of these libraries resulting in either fixes to the code or changes to the documentation pertaining to the thread-safe behavior of the relevant classes. On average, the time taken to analyze a class and generate clients for it is less than two minutes. We believe that the demonstrated eﬀectiveness of our prototypes in helping expose deep bugs in popular Java libraries makes the design, proposed in this thesis, a vital cog in the future development and deployment of dynamic concurrency bug detectors. Concurrency Bugs Dynamic Analysis Program Synthesis Deadlock Detection Bytecode Instrumentation Targeted Client Synthesis Dynamic Deadlock Detection Multithreaded Client Synthesis Atomicity Violations Shared Memory Dynamic Concurrency Bug Detectors Racy Tests Bug Detection Computer Science
138	Contributions au rendement des protocoles de diffusion à ordre total et aux réseaux tolérants aux délais à base de RFID / Contributions to efficiency of total order broadcast protocols and to RFID-based delay tolerant networks Simatic, Michel 04 October 2012 (has links) Dans les systèmes répartis asynchrones, l'horloge logique et le vecteur d'horloges sont deux outils fondamentaux pour gérer la communication et le partage de données entre les entités constitutives de ces systèmes. L'objectif de cette thèse est d'exploiter ces outils avec une perspective d'implantation. Dans une première partie, nous nous concentrons sur la communication de données et contribuons au domaine de la diffusion uniforme à ordre total. Nous proposons le protocole des trains : des jetons (appelés trains) circulent en parallèle entre les processus participants répartis sur un anneau virtuel. Chaque train est équipé d'une horloge logique utilisée pour retrouver les train(s) perdu(s) en cas de défaillance de processus. Nous prouvons que le protocole des trains est un protocole de diffusion uniforme à ordre total. Puis, nous créons une nouvelle métrique : le rendement en termes de débit. Cette métrique nous permet de montrer que le protocole des trains a un rendement supérieur au meilleur, en termes de débit, des protocoles présentés dans la littérature. Par ailleurs, cette métrique fournit une limite théorique du débit maximum atteignable en implantant un protocole de diffusion donné. Il est ainsi possible d'évaluer la qualité d'une implantation de protocole. Les performances en termes de débit du protocole des trains, notamment pour les messages de petites tailles, en font un candidat remarquable pour le partage de données entre coeurs d'un même processeur. De plus, sa sobriété en termes de surcoût réseau en font un candidat privilégié pour la réplication de données entre serveurs dans le cloud. Une partie de ces travaux a été implantée dans un système de contrôle-commande et de supervision déployé sur plusieurs dizaines de sites industriels. Dans une seconde partie, nous nous concentrons sur le partage de données et contribuons au domaine de la RFID. Nous proposons une mémoire répartie partagée basée sur des étiquettes RFID. Cette mémoire permet de s'affranchir d'un réseau informatique global. Pour ce faire, elle s'appuie sur des vecteurs d'horloges et exploite le réseau formé par les utilisateurs mobiles de l'application répartie. Ainsi, ces derniers peuvent lire le contenu d'étiquettes RFID distantes. Notre mémoire répartie partagée à base de RFID apporte une alternative aux trois architectures à base de RFID disponibles dans la littérature. Notre mémoire répartie partagée a été implantée dans un jeu pervasif qui a été expérimenté par un millier de personnes. / In asynchronous distributed systems, logical clock and vector clocks are two core tools to manage data communication and data sharing between entities of these systems. The goal of this PhD thesis is to exploit these tools with a coding viewpoint. In the first part of this thesis, we focus on data communication and contribute to the total order broadcast domain. We propose trains protocol: Tokens (called trains) rotate in parallel between participating processes distributed on a virtual ring. Each train contains a logical clock to recover lost train(s) in case of process(es) failure. We prove that trains protocol is a uniform and totally ordered broadcast protocol. Afterwards, we create a new metric: the throughput efficiency. With this metric, we are able to prove that, from a throughput point of view, trains protocol performs better than protocols presented in literature. Moreover, this metric gives the maximal theoretical throughput which can be reached when coding a given protocol. Thus, it is possible to evaluate the quality of the coding of a protocol. Thanks to its throughput performances, in particular for small messages, trains protocol is a remarkable candidate for data sharing between the cores of a processor. Moreover, thanks to its temperance concerning network usage, it can be worthwhile for data replication between servers in the cloud. Part of this work was implemented inside a control-command and supervision system deployed among several dozens of industrial sites. In the second part of this thesis, we focus on data sharing and contribute to RFID domain. We propose a distributed shared memory based on RFID tags. Thanks to this memory, we can avoid installing a computerized global network. This is possible because this memory uses vector clocks and relies on the network made by the mobile users of the distributed application. Thus, the users are able to read the contents of remote RFID tags. Our RFID-based distributed shared memory is an alternative to the three RFID-based architectures available in the literature. This distributed shared memory was implemented in a pervasive game tested by one thousand users. Systèmes répartis Horloge logique Vecteur d'horloges Mémoire distribuée partagée Rfid Nfc Diffusion uniforme à ordre total Réseaux tolérants aux délais Distributed systems Logical clock Vector clocks Distributed shared memory Rfid Nfc Totally ordered broadcast Delay tolerant networks 004
139	Code Optimization on GPUs Hong, Changwan 30 October 2019 (has links) No description available. Computer Science GPU performance modeling optimization SpMV SpMM SDDMM sparse matrix graph processing tiling multicore manycore matrix multiplication tensor stencil SIMD data locality CSR parallel load balance shared memory graph analytics
140	Secure Communication in a Multi-OS-Environment Bathe, Shivraj Gajanan 25 January 2016 (has links) Current trend in automotive industry is moving towards adopting the multicore microcontrollers in Electronic Control Units (ECUs). Multicore microcontrollers give an opportunity to run a number of separated and dedicated operating systems on a single ECU. When two heterogeneous operating systems run in parallel on a multicore environment, the inter OS communication between these operating systems become the key factor in the overall performance. The inter OS communication based on shared memory is studied in this thesis work. In a setup where two operating systems namely EB Autocore OS which is based on AUTomotive Open System Architecture standard and Android are considered. Android being the gateway to the internet and due to its open nature and the increased connectivity features of a connected car, many attack surfaces are introduced to the system. As safety and security go hand in hand, the security aspects of the communication channel are taken into account. A portable prototype for multi OS communication based on shared memory communication with security considerations is developed as a plugin for EB tresos Studio. info:eu-repo/classification/ddc/004 ddc:004 Informatik; AUTOSAR; LINUX

Search results