Global ETD Search

41	Contech: a shared memory parallel program analysis framework Vassenkov, Phillip 13 January 2014 (has links) We are in the era of multicore machines, where we must exploit thread level parallelism for programs to run better, smarter, faster, and more efficiently. In order to increase instruction level parallelism, processors and compilers perform heavy dataflow analyses between instructions. However, there isn’t much work done in the area of inter-thread dataflow analysis. In order to pave the way and find new ways to conserve resources across a variety of domains (i.e., execution speed, chip die area, power efficiency, and computational throughput), we propose a novel framework, termed Contech, to facilitate the analysis of multithreaded program in terms of its communication and execution patterns. We focus the scope on shared memory programs rather than message passing programs, since it is more difficult to analyze the communication and execution patterns for these programs. Discovering patterns of shared memory programs has the potential to allow general purpose computing machines to turn on or off architectural tricks according to application-specific features. Our design of Contech is modular in nature, so we can glean a large variety of information from an architecturally independent representation of the program under examination. Data race Experimentation Instrumentation Lock Program analysis tools Pthreads Race detection Shared memory Task graph Trace Operating systems (Computers) Multiprocessors Distributed shared memory
42	Eidolon: adapting distributed applications to their environment. Potts, Daniel Paul, Computer Science & Engineering, Faculty of Engineering, UNSW January 2008 (has links) Grids, multi-clusters, NUMA systems, and ad-hoc collections of distributed computing devices all present diverse environments in which distributed computing applications can be run. Due to the diversity of features provided by these environments a distributed application that is to perform well must be specifically designed and optimised for the environment in which it is deployed. Such optimisations generally affect the application's communication structure, its consistency protocols, and its communication protocols. This thesis explores approaches to improving the ability of distributed applications to share consistent data efficiently and with improved functionality over wide-area and diverse environments. We identify a fundamental separation of concerns for distributed applications. This is used to propose a new model, called the view model, which is a hybrid, cost-conscious approach to remote data sharing. It provides the necessary mechanisms and interconnects to improve the flexibility and functionality of data sharing without defining new programming models or protocols. We employ the view model to adapt distributed applications to their run-time environment without modifying the application or inventing new consistency or communication protocols. We explore the use of view model properties on several programming models and their consistency protocols. In particular, we focus on programming models used in distributed-shared-memory middleware and applications, as these can benefit significantly from the properties of the view model. Our evaluation demonstrates the benefits, side effects and potential short-comings of the view model by comparing our model with traditional models when running distributed applications across several multi-clusters scenarios. In particular, we show that the view model improves the performance of distributed applications while reducing resource usage and communication overheads. Grids Distributed computing DSM Distributed shared memory MPI View model Distributed applications
43	Um ambiente de execução para suporte à programação paralela com variáveis compartilhadas em sistemas distribuídos heterogêneos. / A runtime system for parallel programing with shared memory paradigm over a heterogeneus distributed systems. Gisele da Silva Craveiro 31 October 2003 (has links) O avanço na tecnologia de hardware está permitindo que máquinas SMP de 2 a 8 processadores estejam disponíveis a um custo cada vez menor, possibilitando que a incorporação de tais máquinas em aglomerados de PC's ou até mesmo a composição de um aglomerado de SMP's sejam alternativas cada vez mais viáveis para computação de alto desempenho. O grande desafio é extrair o potencial que tal conjunto de máquinas oferece. Uma alternativa é usar um paradigma híbrido de programação para aproveitar a arquitetura de memória compartilhada através de multihreadeing e utilizar o modelo de troca de mensagens para comunicação entre os nós. Contudo, essa estratégia impõe uma tarefa árdua e pouco produtiva para o programador da aplicação. Este trabalho apresenta o sistema CPAR- Cluster que oferece uma abstração de memória compartilhada no topo de um aglomerado formado por nós mono e multiprocessadores. O sistema é implementado no nível de biblioteca e não faz uso de recursos especiais tais como hardware especializado ou alteração na camada de sistema operacional. Serão apresentados os modelos, estratégias, questões de implementação e os resultados obtidos através de testes realizados com a ferramenta e que apresentaram comportamento esperado. / The advance in hardware technologies is making small configuration SMP machines (from 2 to 8 processors) available at a low cost. For this reason, the inclusion of an SMP node into a cluster of PCs or even clusters of SMPs are becoming viable alternatives for high performance computing. The challenge is the exploitation of the computational resources that these platforms provide. A Hybrid programming paradigm which uses shared memory architecture through multihreading and also message passing model for inter node communication is an alternative. However, programming in such paradigm is very hard. This thesis presents CPAR- Cluster, a runtime system, that provides shared memory abstraction on top of a cluster composed by mono and multiprocessor nodes. Its implementation is at the library level and doesn't require special resources such as particular hardware or operating system moditfications. Models, strategies, implementation aspects and results will be presented. programação paralela CPAR sistema distribuído heterogêneo cluter of mono and multiprocessor nodes CPAR distributed shared memory heterogeneous distributed system parallel programing
44	Towards Low-Complexity Scalable Shared-Memory Architectures Zeffer, Håkan January 2006 (has links) <p>Plentiful research has addressed low-complexity software-based shared-memory systems since the idea was first introduced more than two decades ago. However, software-coherent systems have not been very successful in the commercial marketplace. We believe there are two main reasons for this: lack of performance and/or lack of binary compatibility.</p><p>This thesis studies multiple aspects of how to design future binary-compatible high-performance scalable shared-memory servers while keeping the hardware complexity at a minimum. It starts with a software-based distributed shared-memory system relying on no specific hardware support and gradually moves towards architectures with simple hardware support.</p><p>The evaluation is made in a modern chip-multiprocessor environment with both high-performance compute workloads and commercial applications. It shows that implementing the coherence-violation detection in hardware while solving the interchip coherence in software allows for high-performing binary-compatible systems with very low hardware complexity. Our second-generation hardware-software hybrid performs on par with, and often better than, traditional hardware-only designs.</p><p>Based on our results, we conclude that it is not only possible to design simple systems while maintaining performance and the binary-compatibility envelope, it is often possible to get better performance than in traditional and more complex designs.</p><p>We also explore two new techniques for evaluating a new shared-memory design throughout this work: adjustable simulation fidelity and statistical multiprocessor cache modeling.</p> shared memory distributed shared memory hardware-software trade-off software coherence coherence profiling remote access cache chip multiprocessor simultaneous multi threading simulation workload characterization statistical cache model Computer engineering Datorteknik
45	Towards Low-Complexity Scalable Shared-Memory Architectures Zeffer, Håkan January 2006 (has links) Plentiful research has addressed low-complexity software-based shared-memory systems since the idea was first introduced more than two decades ago. However, software-coherent systems have not been very successful in the commercial marketplace. We believe there are two main reasons for this: lack of performance and/or lack of binary compatibility. This thesis studies multiple aspects of how to design future binary-compatible high-performance scalable shared-memory servers while keeping the hardware complexity at a minimum. It starts with a software-based distributed shared-memory system relying on no specific hardware support and gradually moves towards architectures with simple hardware support. The evaluation is made in a modern chip-multiprocessor environment with both high-performance compute workloads and commercial applications. It shows that implementing the coherence-violation detection in hardware while solving the interchip coherence in software allows for high-performing binary-compatible systems with very low hardware complexity. Our second-generation hardware-software hybrid performs on par with, and often better than, traditional hardware-only designs. Based on our results, we conclude that it is not only possible to design simple systems while maintaining performance and the binary-compatibility envelope, it is often possible to get better performance than in traditional and more complex designs. We also explore two new techniques for evaluating a new shared-memory design throughout this work: adjustable simulation fidelity and statistical multiprocessor cache modeling. shared memory distributed shared memory hardware-software trade-off software coherence coherence profiling remote access cache chip multiprocessor simultaneous multi threading simulation workload characterization statistical cache model Computer engineering Datorteknik
46	Iterative and Adaptive PDE Solvers for Shared Memory Architectures / Iterativa och adaptiva PDE-lösare för parallelldatorer med gemensam minnesorganisation Löf, Henrik January 2006 (has links) Scientific computing is used frequently in an increasing number of disciplines to accelerate scientific discovery. Many such computing problems involve the numerical solution of partial differential equations (PDE). In this thesis we explore and develop methodology for high-performance implementations of PDE solvers for shared-memory multiprocessor architectures. We consider three realistic PDE settings: solution of the Maxwell equations in 3D using an unstructured grid and the method of conjugate gradients, solution of the Poisson equation in 3D using a geometric multigrid method, and solution of an advection equation in 2D using structured adaptive mesh refinement. We apply software optimization techniques to increase both parallel efficiency and the degree of data locality. In our evaluation we use several different shared-memory architectures ranging from symmetric multiprocessors and distributed shared-memory architectures to chip-multiprocessors. For distributed shared-memory systems we explore methods of data distribution to increase the amount of geographical locality. We evaluate automatic and transparent page migration based on runtime sampling, user-initiated page migration using a directive with an affinity-on-next-touch semantic, and algorithmic optimizations for page-placement policies. Our results show that page migration increases the amount of geographical locality and that the parallel overhead related to page migration can be amortized over the iterations needed to reach convergence. This is especially true for the affinity-on-next-touch methodology whereby page migration can be initiated at an early stage in the algorithms. We also develop and explore methodology for other forms of data locality and conclude that the effect on performance is significant and that this effect will increase for future shared-memory architectures. Our overall conclusion is that, if the involved locality issues are addressed, the shared-memory programming model provides an efficient and productive environment for solving many important PDE problems. partial differential equations iterative methods finite elements conjugate gradients adaptive mesh refinement multigrid cc-NUMA distributed shared memory OpenMP page migration TLB shoot-down bandwidth minimization reverse Cuthill-McKee migrate-on-next-touch affinity temporal locality chip multiprocessors CMP
47	Contributions au rendement des protocoles de diffusion à ordre total et aux réseaux tolérants aux délais à base de RFID / Contributions to efficiency of total order broadcast protocols and to RFID-based delay tolerant networks Simatic, Michel 04 October 2012 (has links) Dans les systèmes répartis asynchrones, l'horloge logique et le vecteur d'horloges sont deux outils fondamentaux pour gérer la communication et le partage de données entre les entités constitutives de ces systèmes. L'objectif de cette thèse est d'exploiter ces outils avec une perspective d'implantation. Dans une première partie, nous nous concentrons sur la communication de données et contribuons au domaine de la diffusion uniforme à ordre total. Nous proposons le protocole des trains : des jetons (appelés trains) circulent en parallèle entre les processus participants répartis sur un anneau virtuel. Chaque train est équipé d'une horloge logique utilisée pour retrouver les train(s) perdu(s) en cas de défaillance de processus. Nous prouvons que le protocole des trains est un protocole de diffusion uniforme à ordre total. Puis, nous créons une nouvelle métrique : le rendement en termes de débit. Cette métrique nous permet de montrer que le protocole des trains a un rendement supérieur au meilleur, en termes de débit, des protocoles présentés dans la littérature. Par ailleurs, cette métrique fournit une limite théorique du débit maximum atteignable en implantant un protocole de diffusion donné. Il est ainsi possible d'évaluer la qualité d'une implantation de protocole. Les performances en termes de débit du protocole des trains, notamment pour les messages de petites tailles, en font un candidat remarquable pour le partage de données entre coeurs d'un même processeur. De plus, sa sobriété en termes de surcoût réseau en font un candidat privilégié pour la réplication de données entre serveurs dans le cloud. Une partie de ces travaux a été implantée dans un système de contrôle-commande et de supervision déployé sur plusieurs dizaines de sites industriels. Dans une seconde partie, nous nous concentrons sur le partage de données et contribuons au domaine de la RFID. Nous proposons une mémoire répartie partagée basée sur des étiquettes RFID. Cette mémoire permet de s'affranchir d'un réseau informatique global. Pour ce faire, elle s'appuie sur des vecteurs d'horloges et exploite le réseau formé par les utilisateurs mobiles de l'application répartie. Ainsi, ces derniers peuvent lire le contenu d'étiquettes RFID distantes. Notre mémoire répartie partagée à base de RFID apporte une alternative aux trois architectures à base de RFID disponibles dans la littérature. Notre mémoire répartie partagée a été implantée dans un jeu pervasif qui a été expérimenté par un millier de personnes. / In asynchronous distributed systems, logical clock and vector clocks are two core tools to manage data communication and data sharing between entities of these systems. The goal of this PhD thesis is to exploit these tools with a coding viewpoint. In the first part of this thesis, we focus on data communication and contribute to the total order broadcast domain. We propose trains protocol: Tokens (called trains) rotate in parallel between participating processes distributed on a virtual ring. Each train contains a logical clock to recover lost train(s) in case of process(es) failure. We prove that trains protocol is a uniform and totally ordered broadcast protocol. Afterwards, we create a new metric: the throughput efficiency. With this metric, we are able to prove that, from a throughput point of view, trains protocol performs better than protocols presented in literature. Moreover, this metric gives the maximal theoretical throughput which can be reached when coding a given protocol. Thus, it is possible to evaluate the quality of the coding of a protocol. Thanks to its throughput performances, in particular for small messages, trains protocol is a remarkable candidate for data sharing between the cores of a processor. Moreover, thanks to its temperance concerning network usage, it can be worthwhile for data replication between servers in the cloud. Part of this work was implemented inside a control-command and supervision system deployed among several dozens of industrial sites. In the second part of this thesis, we focus on data sharing and contribute to RFID domain. We propose a distributed shared memory based on RFID tags. Thanks to this memory, we can avoid installing a computerized global network. This is possible because this memory uses vector clocks and relies on the network made by the mobile users of the distributed application. Thus, the users are able to read the contents of remote RFID tags. Our RFID-based distributed shared memory is an alternative to the three RFID-based architectures available in the literature. This distributed shared memory was implemented in a pervasive game tested by one thousand users. Systèmes répartis Horloge logique Vecteur d'horloges Mémoire distribuée partagée Rfid Nfc Diffusion uniforme à ordre total Réseaux tolérants aux délais Distributed systems Logical clock Vector clocks Distributed shared memory Rfid Nfc Totally ordered broadcast Delay tolerant networks 004

Page generated in 0.1323 seconds