Global ETD Search

101	A distributed kernel summation framework for machine learning and scientific applications Lee, Dong Ryeol 11 May 2012 (has links) The class of computational problems I consider in this thesis share the common trait of requiring consideration of pairs (or higher-order tuples) of data points. I focus on the problem of kernel summation operations ubiquitous in many data mining and scientific algorithms. In machine learning, kernel summations appear in popular kernel methods which can model nonlinear structures in data. Kernel methods include many non-parametric methods such as kernel density estimation, kernel regression, Gaussian process regression, kernel PCA, and kernel support vector machines (SVM). In computational physics, kernel summations occur inside the classical N-body problem for simulating positions of a set of celestial bodies or atoms. This thesis attempts to marry, for the first time, the best relevant techniques in parallel computing, where kernel summations are in low dimensions, with the best general-dimension algorithms from the machine learning literature. We provide a unified, efficient parallel kernel summation framework that can utilize: (1) various types of deterministic and probabilistic approximations that may be suitable for both low and high-dimensional problems with a large number of data points; (2) indexing the data using any multi-dimensional binary tree with both distributed memory (MPI) and shared memory (OpenMP/Intel TBB) parallelism; (3) a dynamic load balancing scheme to adjust work imbalances during the computation. I will first summarize my previous research in serial kernel summation algorithms. This work started from Greengard/Rokhlin's earlier work on fast multipole methods for the purpose of approximating potential sums of many particles. The contributions of this part of this thesis include the followings: (1) reinterpretation of Greengard/Rokhlin's work for the computer science community; (2) the extension of the algorithms to use a larger class of approximation strategies, i.e. probabilistic error bounds via Monte Carlo techniques; (3) the multibody series expansion: the generalization of the theory of fast multipole methods to handle interactions of more than two entities; (4) the first O(N) proof of the batch approximate kernel summation using a notion of intrinsic dimensionality. Then I move onto the problem of parallelization of the kernel summations and tackling the scaling of two other kernel methods, Gaussian process regression (kernel matrix inversion) and kernel PCA (kernel matrix eigendecomposition). The artifact of this thesis has contributed to an open-source machine learning package called MLPACK which has been first demonstrated at the NIPS 2008 and subsequently at the NIPS 2011 Big Learning Workshop. Completing a portion of this thesis involved utilization of high performance computing resource at XSEDE (eXtreme Science and Engineering Discovery Environment) and NERSC (National Energy Research Scientific Computing Center). Parallel multitree methods Fast Gauss transforms Fast multipole methods Parallel machine learning Parallel kernel methods Multidimensional trees Kernel functions Machine learning Algorithms
102	Checkpointing Algorithms for Parallel Computers Kalaiselvi, S 02 1900 (has links) Checkpointing is a technique widely used in parallel/distributed computers for rollback error recovery. Checkpointing is defined as the coordinated saving of process state information at specified time instances. Checkpoints help in restoring the computation from the latest saved state, in case of failure. In addition to fault recovery, checkpointing has applications in fault detection, distributed debugging and process migration. Checkpointing in uniprocessor systems is easy due to the fact that there is a single clock and events occur with respect to this clock. There is a clear demarcation of events that happens before a checkpoint and events that happens after a checkpoint. In parallel computers a large number of computers coordinate to solve a single problem. Since there might be multiple streams of execution, checkpoints have to be introduced along all these streams simultaneously. Absence of a global clock necessitates explicit coordination to obtain a consistent global state. Events occurring in a distributed system, can be ordered partially using Lamport's happens before relation. Lamport's happens before relation ->is a partial ordering relation to identify dependent and concurrent events occurring in a distributed system. It is defined as follows: ·If two events a and b happen in the same process, and if a happens before b, then a->b ·If a is the sending event of a message and b is the receiving event of the same message then a -> b ·If neither a à b nor b -> a, then a and b are said to be concurrent. A consistent global state may have concurrent checkpoints. In the first chapter of the thesis we discuss issues regarding ordering of events in a parallel computer, need for coordination among checkpoints and other aspects related to checkpointing. Checkpointing locations can either be identified statically or dynamically. The static approach assumes that a representation of a program to be checkpointed is available with information that enables a programmer to specify the places where checkpoints are to be taken. The dynamic approach identifies the checkpointing locations at run time. In this thesis, we have proposed algorithms for both static and dynamic checkpointing. The main contributions of this thesis are as follows: 1. Parallel computers that are being built now have faster communication and hence more efficient clock synchronisation compared to those built a few years ago. Based on efficient clock synchronisation protocols, the clock drift in current machines can be maintained within a few microseconds. We have proposed a dynamic checkpointing algorithm for parallel computers assuming bounded clock drifts. 2. The shared memory paradigm is convenient for programming while message passing paradigm is easy to scale. Distributed Shared Memory (DSM) systems combine the advantage of both paradigms and can be visualized easily on top of a network of workstations. IEEE has recently proposed an interconnect standard called Scalable Coherent Interface (SCI), to con6gure computers as a Distributed Shared Memory system. A periodic dynamic checkpointing algorithm has been proposed in the thesis for a DSM system which uses the SCI standard. 3. When information about a parallel program is available one can make use of this knowledge to perform efficient checkpointing. A static checkpointing approach based on task graphs is proposed for parallel programs. The proposed task graph based static checkpointing approach has been implemented on a Parallel Virtual Machine (PVM) platform. We now give a gist of various chapters of the thesis. Chapter 2 of the thesis gives a classification of existing checkpointing algorithms. The chapter surveys algorithm that have been reported in literature for checkpointing parallel/distributed systems. A point to be noted is that most of the algorithms published for checkpointing message passing systems are based on the seminal article by Chandy & Lamport. A large number of checkpointing algorithms have been published by relaxing the assumptions made in the above mentioned article and by extending the features to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Chapter 2 concludes with brief comments on the desirable features of a checkpointing algorithm. In Chapter 3, we develop a dynamic checkpointing algorithm for message passing systems assuming that the clock drift of processors in the system is bounded. Efficient clock synchronisation protocols have been implemented on recent parallel computers owing to the fact that communication between processors is very fast. Based on efficient clock synchronisation protocols, clock skew can be limited to a few microseconds. The algorithm proposed in the thesis uses clocks for checkpoint coordination and vector counts for identifying messages to be logged. The algorithm is a periodic, distributed algorithm. We prove correctness of the algorithm and compare it with similar clock based algorithms. Distributed Shared Memory (DSM) systems provide the benefit of ease of programming in a scalable system. The recently proposed IEEE Scalable Coherent Interface (SCI) standard, facilitates the construction of scalable coherent systems. In Chapter 4 we discuss a checkpointing algorithm for an SCI based DSM system. SCI maintains cache coherence in hardware using a distributed cache directory which scales with the number of processors in the system. SCI recommends a two phase transaction protocol for communication. Our algorithm is a two phase centralised coordinated algorithm. Phase one initiates checkpoints and the checkpointing activity is completed in phase two. The correctness of the algorithm is established theoretically. The chapter concludes with the discussion of the features of SCI exploited by the checkpointing algorithm proposed in the thesis. In Chapter 5, a static checkpointing algorithm is developed assuming that the program to be executed on a parallel computer is given as a directed acyclic task graph. We assume that the estimates of the time to execute each task in the task graph is given. Given the timing at which checkpoints are to be taken, the algorithm identifies a set of edges where checkpointing tasks can be placed ensuring that they form a consistent global checkpoint. The proposed algorithm eliminates coordination overhead at run time. It significantly reduces the context saving overhead by taking checkpoints along edges of the task graph. The algorithm is used as a preprocessing step before scheduling the tasks to processors. The algorithm complexity is O(km) where m is the number of edges in the graph and k the maximum number of global checkpoints to be taken. The static algorithm is implemented on a parallel computer with a PVM environment as it is widely available and portable. The task graph of a program can be constructed manually or through program development tools. Our implementation is a collection of preprocessing and run time routines. The preprocessing routines operate on the task graph information to generate a set of edges to be checkpointed for each global checkpoint and write the information on disk. The run time routines save the context along the marked edges. In case of recovery, the recovery algorithms read the information from stable storage and reconstruct the context. The limitation of our static checkpointing algorithm is that it can operate only on deterministic task graphs. To demonstrate the practical feasibility of the proposed approach, case studies of checkpointing some parallel programs are included in the thesis. We conclude the thesis with a summary of proposed algorithms and possible directions to continue research in the area of checkpointing. Computer and Information Science Parallel Computers Parallel Program Checkpointing Algorithms Scalable Coherent Interface (SCI) Static Checkpointing Algorithms Distributed Shared Memory (DSM) Uniprocessor Systems
103	Formalisation et automatisation de YAO, générateur de code pour l’assimilation variationnelle de données Nardi, Luigi 08 March 2011 (has links) L’assimilation variationnelle de données 4D-Var est une technique très utilisée en géophysique, notamment en météorologie et océanographie. Elle consiste à estimer des paramètres d’un modèle numérique direct, en minimisant une fonction de coût mesurant l’écart entre les sorties du modèle et les mesures observées. La minimisation, qui est basée sur une méthode de gradient, nécessite le calcul du modèle adjoint (produit de la transposée de la matrice jacobienne avec le vecteur dérivé de la fonction de coût aux points d’observation). Lors de la mise en œuvre de l’AD 4D-Var, il faut faire face à des problèmes d’implémentation informatique complexes, notamment concernant le modèle adjoint, la parallélisation du code et la gestion efficace de la mémoire. Aﬁn d’aider au développement d’applications d’AD 4D-Var, le logiciel YAO qui a été développé au LOCEAN, propose de modéliser le modèle direct sous la forme d’un graphe de ﬂot de calcul appelé graphe modulaire. Les modules représentent des unités de calcul et les arcs décrivent les transferts des données entre ces modules. YAO est doté de directives de description qui permettent à un utilisateur de décrire son modèle direct, ce qui lui permet de générer ensuite le graphe modulaire associé à ce modèle. Deux algorithmes, le premier de type propagation sur le graphe et le second de type rétropropagation sur le graphe permettent, respectivement, de calculer les sorties du modèle direct ainsi que celles de son modèle adjoint. YAO génère alors le code du modèle direct et de son adjoint. En plus, il permet d’implémenter divers scénarios pour la mise en œuvre de sessions d’assimilation.Au cours de cette thèse, un travail de recherche en informatique a été entrepris dans le cadre du logiciel YAO. Nous avons d’abord formalisé d’une manière plus générale les spécifications deYAO. Par la suite, des algorithmes permettant l’automatisation de certaines tâches importantes ont été proposés tels que la génération automatique d’un parcours “optimal” de l’ordre des calculs et la parallélisation automatique en mémoire partagée du code généré en utilisant des directives OpenMP. L’objectif à moyen terme, des résultats de cette thèse, est d’établir les bases permettant de faire évoluer YAO vers une plateforme générale et opérationnelle pour l’assimilation de données 4D-Var, capable de traiter des applications réelles et de grandes tailles. / Variational data assimilation 4D-Var is a well-known technique used in geophysics, and in particular in meteorology and oceanography. This technique consists in estimating the control parameters of a direct numerical model, by minimizing a cost function which measures the misﬁt between the forecast values and some actual observations. The minimization, which is based on a gradient method, requires the computation of the adjoint model (product of the transpose Jacobian matrix and the derivative vector of the cost function at the observation points). In order to perform the 4DVar technique, we have to cope with complex program implementations, in particular concerning the adjoint model, the parallelization of the code and an efﬁcient memory management. To address these difﬁculties and to facilitate the implementation of 4D-Var applications, LOCEAN is developing the YAO framework. YAO proposes to represent a direct model with a computation ﬂow graph called modular graph. Modules depict computation units and edges between modules represent data transfer. Description directives proper to YAO allow a user to describe its direct model and to generate the modular graph associated to this model. YAO contains two core algorithms. The ﬁrst one is a forward propagation algorithm on the graph that computes the output of the numerical model; the second one is a back propagation algorithm on the graph that computes the adjoint model. The main advantage of the YAO framework, is that the direct and adjoint model programming codes are automatically generated once the modular graph has been conceived by the user. Moreover, YAO allows to cope with many scenarios for running different data assimilation sessions.This thesis introduces a computer science research on the YAO framework. In a ﬁrst step, we have formalized in a more general way the existing YAO speciﬁcations. Then algorithms allowing the automatization of some tasks have been proposed such as the automatic generation of an “optimal” computational ordering and the automatic parallelization of the generated code on shared memory architectures using OpenMP directives. This thesis permits to lay the foundations which, at medium term, will make of YAO a general and operational platform for data assimilation 4D-Var, allowing to process applications of high dimensions. Assimilation variationnelle de données Modèle numérique Modèle adjoint Génération automatique Parallélisation automatique Mémoire partagée OpenMP Variational data assimilation Numerical model Adjoint model Automatic generation Automatic parallelization Shared memory OpenMP
104	Design by transformation : from domain knowledge to optimized program generation Marker, Bryan Andrew 20 June 2014 (has links) Expert design knowledge is essential to develop a library of high-performance software. This includes how to implement and parallelize domain operations, how to optimize implementations, and estimates of which implementation choices are best. An expert repeatedly applies his knowledge, often in a rote and tedious way, to develop all of the related functionality expected from a domain-specific library. Expert knowledge is hard to gain and is easily lost over time when an expert forgets or when a new engineer starts developing code. The domain of dense linear algebra (DLA) is a prime example with software that is so well designed that much of experts' important work has become tediously rote in many ways. In this dissertation, we demonstrate how one can encode design knowledge for DLA so it can be automatically applied to generate code as an expert would or to generate better code. Further, the knowledge is encoded for perpetuity, so it can be reused to make implementing functionality on new hardware easier or it can be used to teach how software is designed to a non-expert. We call this approach to software engineering (encoding expert knowledge and automatically applying it) Design by Transformation (DxT). We present our vision, the methodology, a prototype code generation system, and possibilities when applying DxT to the domain of dense linear algebra. / text Automatic program generation Software engineering Code transformation High-performance computing Distributed-memory computing Shared-memory computing Dense linear algebra Design by transformation
105	Υλοποίηση συστήματος κοινής ιδεατής μνήμης για συστάδες πολυεπεξεργαστικών συστημάτων / Software distributed shared memory for clusters of multiprocessors Τουρναβίτης, Γεώργιος 16 May 2007 (has links) Οι συστάδες υπολογιστών αποτελούν μία σύγχρονη ευρέως χρησιμοποιούμενη και ιδιαίτερα ανταγωνιστική αρχιτεκτονική για την υλοποίηση υπολογιστικών συστημάτων υψηλών επιδόσεων με χαμηλό κόστος. Παράλληλα, η ευρεία εμπορική διάθεση πολυεπεξεργαστικών συστημάτων μικρής κλίμακας, επιτρέπει τον συνδυασμό τους σε υβριδικά σχήματα συστάδων πολυεπεξεργαστών. Παρά την ευελιξία που παρέχεται στη σχεδίαση τους, η απαίτηση για χρήση κατανεμημένων μοντέλων προγραμματισμού αυξάνει σημαντικά την πολυπλοκότητα της ανάπτυξης εφαρμογών. Μία εναλλακτική προσέγγιση αποτελούν τα συστήματα κοινής ιδεατής μνήμης. Τα συστήματα κοινής ιδεατής μνήμης παρέχουν στις εφαρμογές, που εκτελούνται σε διαφορετικούς κόμβους της συστάδας, πρόσβαση σε έναν διαμοιραζόμενο χώρο διευθύνσεων αποκρύπτοντας την υποκείμενη κατανεμημένη αρχιτεκτονική. Βασικότερο περιορισμό της πλειονότητας των υπαρχόντων υλοποιήσεων αποτελεί η απουσία υποστήριξης πολυνηματισμού. Το χαρακτηριστικό αυτό έχει ως άμεση συνέπεια τη χαμηλή χρησιμοποίηση των σύγχρονων πολυεπεξεργαστικών υπολογιστικών μονάδων, καθώς ούτε η εφαρμογή αλλά ούτε και οι μηχανισμοί που εξασφαλίζουν τη συνέπεια της κοινής μνήμης εκτελούνται παράλληλα. Στα πλαίσια της παρούσας μεταπτυχιακής εργασίας παρουσιάζεται η σχεδίαση και η υλοποίηση μίας πλατφόρμας κοινής ιδεατής μνήμης χρησιμοποιώντας μηχανισμούς υλοποιημένους αποκλειστικά σε λογισμικό. Το προτεινόμενο σύστημα στοχεύει στην αποδοτικότερη χρησιμοποίηση των πόρων των πολυεπεξεργαστικών μονάδων της συστάδας, υποστηρίζοντας την πολυνηματική εκτέλεση της εφαρμογής σε κάθε κόμβο. Τόσο το πρωτόκολλο συνέπειας της κατανεμημένης μνήμης, όσο και το υποσύστημα επικοινωνίας, επανασχεδιάστηκαν ώστε να χρησιμοποιούν πολλαπλά νήματα εκτέλεσης. Επιπλέον παρουσιάζονται και αξιολογούνται εναλλακτικοί ιεραρχικοί αλγόριθμοι συγχρονισμού που επιτρέπουν την αποδοτικότερη χρήση της υβριδικής οργάνωσης των συστάδων. / Software Distributed Shared Memory (SDSM) systems provide an abstraction layer of shared memory semantics on top of a distributed set of computational nodes. The use of small-scale Symmetric Multiprocessor (SMP) nodes has the potential for bridging the performance-cost gap between the low-end SMPs and high-end Distributed Shared Memory (DSM) systems, using a hybrid software and hardware coherency model presented in this thesis. We present the design and discuss the main architectural choices involved in our implementation of a multithreaded SDSM system. Our implementation was developed on top of Pthreads and the TCP/IP network protocol, employing a simple yet efficient design. Finally, we evaluate and analyze the performance of the multithreading SDSM platform, using a wide set of benchmark applications. Πληροφορική Κοινή ιδεατή μνήμη Πολυνηματισμός Συστάδες υπολογιστών 004.35 Computer Science Parallel computing Software distribute shared memory Shared virtual memory Multithreading Clusters
106	Software Techniques for Distributed Shared Memory Radovic, Zoran January 2005 (has links) In large multiprocessors, the access to shared memory is often nonuniform, and may vary as much as ten times for some distributed shared-memory architectures (DSMs). This dissertation identifies another important nonuniform property of DSM systems: nonuniform communication architecture, NUCA. High-end hardware-coherent machines built from large nodes, or from chip multiprocessors, are typical NUCA systems, since they have a lower penalty for reading recently written data from a neighbor's cache than from a remote cache. This dissertation identifies node affinity as an important property for scalable general-purpose locks. Several software-based hierarchical lock implementations exploiting NUCAs are presented and evaluated. NUCA-aware locks are shown to be almost twice as efficient for contended critical sections compared to traditional lock implementations. The shared-memory “illusion”' provided by some large DSM systems may be implemented using either hardware, software or a combination thereof. A software-based implementation can enable cheap cluster hardware to be used, but typically suffers from poor and unpredictable performance characteristics. This dissertation advocates a new software-hardware trade-off design point based on a new combination of techniques. The two low-level techniques, fine-grain deterministic coherence and synchronous protocol execution, as well as profile-guided protocol flexibility, are evaluated in isolation as well as in a combined setting using all-software implementations. Finally, a minimum of hardware trap support is suggested to further improve the performance of coherence protocols across cluster nodes. It is shown that all these techniques combined could result in a fairly stable performance on par with hardware-based coherence. synchronization distributed shared memory write permission cache nonuniform communication architecture node affinity locality hardware-software trade-off profiling flexibility trap-based memory architecture Computer engineering Datorteknik
107	Methods for Creating and Exploiting Data Locality Wallin, Dan January 2006 (has links) The gap between processor speed and memory latency has led to the use of caches in the memory systems of modern computers. Programs must use the caches efficiently and exploit data locality for maximum performance. Multiprocessors, built from many processing units, are becoming commonplace not only in large servers but also in smaller systems such as personal computers. Multiprocessors require careful data locality optimizations since accesses from other processors can lead to invalidations and false sharing cache misses. This thesis explores hardware and software approaches for creating and exploiting temporal and spatial locality in multiprocessors. We propose the capacity prefetching technique, which efficiently reduces the number of cache misses but avoids false sharing by distinguishing between cache lines involved in communication from non-communicating cache lines at run-time. Prefetching techniques often lead to increased coherence and data traffic. The new bundling technique avoids one of these drawbacks and reduces the coherence traffic in multiprocessor prefetchers. This is especially important in snoop-based systems where the coherence bandwidth is a scarce resource. Most of the studies have been performed on advanced scientific algorithms. This thesis demonstrates that a cc-NUMA multiprocessor, with hardware data migration and replication optimizations, efficiently exploits the temporal locality in such codes. We further present a method of parallelizing a multigrid Gauss-Seidel partial differential equation solver, which creates temporal locality at the expense of increased communication. Our conclusion is that on modern chip multiprocessors, it is more important to optimize algorithms for data locality than to avoid communication, since communication can take place using a shared cache. data locality temporal locality spatial locality prefetching cache cache behavior cache coherence snooping protocols partial differential equation shared-memory multiprocessor chip multiprocessor simulation Computer engineering Datorteknik
108	Eidolon: adapting distributed applications to their environment. Potts, Daniel Paul, Computer Science & Engineering, Faculty of Engineering, UNSW January 2008 (has links) Grids, multi-clusters, NUMA systems, and ad-hoc collections of distributed computing devices all present diverse environments in which distributed computing applications can be run. Due to the diversity of features provided by these environments a distributed application that is to perform well must be specifically designed and optimised for the environment in which it is deployed. Such optimisations generally affect the application's communication structure, its consistency protocols, and its communication protocols. This thesis explores approaches to improving the ability of distributed applications to share consistent data efficiently and with improved functionality over wide-area and diverse environments. We identify a fundamental separation of concerns for distributed applications. This is used to propose a new model, called the view model, which is a hybrid, cost-conscious approach to remote data sharing. It provides the necessary mechanisms and interconnects to improve the flexibility and functionality of data sharing without defining new programming models or protocols. We employ the view model to adapt distributed applications to their run-time environment without modifying the application or inventing new consistency or communication protocols. We explore the use of view model properties on several programming models and their consistency protocols. In particular, we focus on programming models used in distributed-shared-memory middleware and applications, as these can benefit significantly from the properties of the view model. Our evaluation demonstrates the benefits, side effects and potential short-comings of the view model by comparing our model with traditional models when running distributed applications across several multi-clusters scenarios. In particular, we show that the view model improves the performance of distributed applications while reducing resource usage and communication overheads. Grids Distributed computing DSM Distributed shared memory MPI View model Distributed applications
109	Arquitetura de uma rede de interconexão com memória compartilhada baseada na topologia crossbar / Architecture of an interconnection network with shared memory based on the topology crossbar. Fábio Gonçalves Pessanha 22 March 2013 (has links) Multi-Processor System-on-Chip (MPSoC) possui vários processadores, em um único chip. Várias aplicações podem ser executadas de maneira paralela ou uma aplicação paralelizável pode ser particionada e alocada em cada processador, a fim de acelerar a sua execução. Um problema em MPSoCs é a comunicação entre os processadores, necessária para a execução destas aplicações. Neste trabalho, propomos uma arquitetura de rede de interconexão baseada na topologia crossbar, com memória compartilhada. Esta arquitetura é parametrizável, possuindo N processadores e N módulos de memórias. A troca de informação entre os processadores é feita via memória compartilhada. Neste tipo de implementação cada processador executa a sua aplicação em seu próprio módulo de memória. Através da rede, todos os processadores têm completo acesso a seus módulos de memória simultaneamente, permitindo que cada aplicação seja executada concorrentemente. Além disso, um processador pode acessar outros módulos de memória, sempre que necessite obter dados gerados por outro processador. A arquitetura proposta é modelada em VHDL e seu desempenho é analisado através da execução paralela de uma aplicação, em comparação à sua respectiva execução sequencial. A aplicação escolhida consiste na otimização de funções objetivo através do método de Otimização por Enxame de Partículas (Particle Swarm Optimization - PSO). Neste método, um enxame de partículas é distribuído igualmente entre os processadores da rede e, ao final de cada interação, um processador acessa o módulo de memória de outro processador, a fim de obter a melhor posição encontrada pelo enxame alocado neste. A comunicação entre processadores é baseada em três estratégias: anel, vizinhança e broadcast. Essa aplicação foi escolhida por ser computacionalmente intensiva e, dessa forma, uma forte candidata a paralelização. / Multi-Processor System-on-Chip (MPSoC) has multiple processors in a single chip. Multiple applications can be executed in parallel or a parallelizable application can be partitioned and allocated to each processor in order to accelerate their execution. One problem in MPSoCs is the communication between the processors required to implement these applications. In this work, we propose the architecture of an interconnection network based on the crossbar topology, with shared memory. This architecture is parameterizable, having N processors and N memory modules. The exchange of information between processors is done via shared memory. In this type of implementation each processor executes its application stored in its own memory module. Through the network, all processors have complete access to their own memory modules simultaneously allowing each application to run concurrently. Moreover, a processor can access other memory modules, whenever it needs to retrieve data generated by another processor. The proposed architecture is modelled in VHDL and its performance is analysed by the execution of a parallel aplication, in comparison to its sequencial one. The chosen application consists of optimizing some objetive functions by using the Particle Swarm Optimization method. In this method, particles of a swarm are distributed among the processors and, at the end of each iteration, a processor accesses the memory module of another one in order to obtain the best position found in the swarm. The communication between processors is based on three strategies: ring, neighbourhood and broadcast. This application was chosen due to its computational intensive characteristic and, therefore, a strong candidate for parallelization. Engenharia Eletrônica Redes de Interconexão Memória Compartilhada Redes Crossbar Arquitetura de redes Electronic Engineering Interconnection Network Shared Memory Crossbar Switch Network architecture ENGENHARIAS
110	Um ambiente de execução para suporte à programação paralela com variáveis compartilhadas em sistemas distribuídos heterogêneos. / A runtime system for parallel programing with shared memory paradigm over a heterogeneus distributed systems. Gisele da Silva Craveiro 31 October 2003 (has links) O avanço na tecnologia de hardware está permitindo que máquinas SMP de 2 a 8 processadores estejam disponíveis a um custo cada vez menor, possibilitando que a incorporação de tais máquinas em aglomerados de PC's ou até mesmo a composição de um aglomerado de SMP's sejam alternativas cada vez mais viáveis para computação de alto desempenho. O grande desafio é extrair o potencial que tal conjunto de máquinas oferece. Uma alternativa é usar um paradigma híbrido de programação para aproveitar a arquitetura de memória compartilhada através de multihreadeing e utilizar o modelo de troca de mensagens para comunicação entre os nós. Contudo, essa estratégia impõe uma tarefa árdua e pouco produtiva para o programador da aplicação. Este trabalho apresenta o sistema CPAR- Cluster que oferece uma abstração de memória compartilhada no topo de um aglomerado formado por nós mono e multiprocessadores. O sistema é implementado no nível de biblioteca e não faz uso de recursos especiais tais como hardware especializado ou alteração na camada de sistema operacional. Serão apresentados os modelos, estratégias, questões de implementação e os resultados obtidos através de testes realizados com a ferramenta e que apresentaram comportamento esperado. / The advance in hardware technologies is making small configuration SMP machines (from 2 to 8 processors) available at a low cost. For this reason, the inclusion of an SMP node into a cluster of PCs or even clusters of SMPs are becoming viable alternatives for high performance computing. The challenge is the exploitation of the computational resources that these platforms provide. A Hybrid programming paradigm which uses shared memory architecture through multihreading and also message passing model for inter node communication is an alternative. However, programming in such paradigm is very hard. This thesis presents CPAR- Cluster, a runtime system, that provides shared memory abstraction on top of a cluster composed by mono and multiprocessor nodes. Its implementation is at the library level and doesn't require special resources such as particular hardware or operating system moditfications. Models, strategies, implementation aspects and results will be presented. programação paralela CPAR sistema distribuído heterogêneo cluter of mono and multiprocessor nodes CPAR distributed shared memory heterogeneous distributed system parallel programing

Search results