• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 147
  • 30
  • 21
  • 15
  • 7
  • 6
  • 3
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 268
  • 76
  • 50
  • 50
  • 49
  • 38
  • 35
  • 35
  • 33
  • 32
  • 32
  • 30
  • 30
  • 30
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
61

An Interconnection Network Topology Generation Scheme for Multicore Systems

Phanibhushana, Bharath 01 January 2013 (has links) (PDF)
Multi-Processor System on Chip (MPSoC) consisting of multiple processing cores connected via a Network on Chip (NoC) has gained prominence over the last decade. Most common way of mapping applications to MPSoCs is by dividing the application into small tasks and representing them in the form of a task graph where the edges connecting the tasks represent the inter task communication. Task scheduling involves mapping task to processor cores so as to meet a specified deadline for the application/task graph. With increase in system complexity and application parallelism, task communication times are tending towards task execution times. Hence the NoC which forms the communication backbone for the cores plays a critical role in meeting the deadlines. In this thesis we present an approach to synthesize a minimal network connecting a set of cores in a MPSoC in the presence of deadlines. Given a task graph and a corresponding task to processor schedule, we have developed a partitioning methodology to generate an efficient interconnection network for the cores. We adopt a 2-phase design flow where we synthesize the network in first phase and in second phase we perform statistical analysis of the network thus generated. We compare our model with a simulated annealing based scheme, a static graph based greedy scheme and the standard mesh topology. The proposed solution offers significant area and performance benefits over the alternate solutions compared in this work.
62

Design and Optimization of OpenFOAM-based CFD Applications for Modern Hybrid and Heterogeneous HPC Platforms

AlOnazi, Amani 02 1900 (has links)
The progress of high performance computing platforms is dramatic, and most of the simulations carried out on these platforms result in improvements on one level, yet expose shortcomings of current CFD packages. Therefore, hardware-aware design and optimizations are crucial towards exploiting modern computing resources. This thesis proposes optimizations aimed at accelerating numerical simulations, which are illus- trated in OpenFOAM solvers. A hybrid MPI and GPGPU parallel conjugate gradient linear solver has been designed and implemented to solve the sparse linear algebraic kernel that derives from two CFD solver: icoFoam, which is an incompressible flow solver, and laplacianFoam, which solves the Poisson equation, for e.g., thermal dif- fusion. A load-balancing step is applied using heterogeneous decomposition, which decomposes the computations taking into account the performance of each comput- ing device and seeking to minimize communication. In addition, we implemented the recently developed pipeline conjugate gradient as an algorithmic improvement, and parallelized it using MPI, GPGPU, and a hybrid technique. While many questions of ultimately attainable per node performance and multi-node scaling remain, the ex- perimental results show that the hybrid implementation of both solvers significantly outperforms state-of-the-art implementations of a widely used open source package.
63

Optimum Microarchitectures for Neuromorphic Algorithms

Wang, Shu January 2011 (has links)
No description available.
64

Novel Methods to Improve the Energy Efficiency of Multi-core Synchronization Primitives

Vadambacheri Manian, Karthik January 2017 (has links)
No description available.
65

Reducing Cache Access Time in Multicore Architectures Using Hardware and Software Techniques

Avakian, Annie 27 September 2012 (has links)
No description available.
66

Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems

Daga, Mayank 06 June 2011 (has links)
The emergence of scientific applications embedded with multiple modes of parallelism has made heterogeneous computing systems indispensable in high performance computing. The popularity of such systems is evident from the fact that three out of the top five fastest supercomputers in the world employ heterogeneous computing, i.e., they use dissimilar computational units. A closer look at the performance of these supercomputers reveals that they achieve only around 50% of their theoretical peak performance. This suggests that applications that were tuned for erstwhile homogeneous computing may not be efficient for today's heterogeneous computing and hence, novel optimization strategies are required to be exercised. However, optimizing an application for heterogeneous computing systems is extremely challenging, primarily due to the architectural differences in computational units in such systems. This thesis intends to act as a cookbook for optimizing applications on heterogeneous computing systems that employ graphics processing units (GPUs) as the preferred mode of accelerators. We discuss optimization strategies for multicore CPUs as well as for the two popular GPU platforms, i.e., GPUs from AMD and NVIDIA. Optimization strategies for NVIDIA GPUs have been well studied but when applied on AMD GPUs, they fail to measurably improve performance because of the differences in underlying architecture. To the best of our knowledge, this research is the first to propose optimization strategies for AMD GPUs. Even on NVIDIA GPUs, there exists a lesser known but an extremely severe performance pitfall called partition camping, which can affect application performance by up to seven-fold. To facilitate the detection of this phenomenon, we have developed a performance prediction model that analyzes and characterizes the effect of partition camping in GPU applications. We have used a large-scale, molecular modeling application to validate and verify all the optimization strategies. Our results illustrate that if appropriately optimized, AMD and NVIDIA GPUs can provide 371-fold and 328-fold improvement, respectively, over a hand-tuned, SSE-optimized serial implementation. / Master of Science
67

Real time image processing : algorithm parallelization on multicore multithread architecture / Imagerie temps réel : parallélisation d’algorithmes sur plate-forme multi-processeurs

Mahmoudi, Ramzi 13 December 2011 (has links)
Les caractéristiques topologiques d'un objet sont fondamentales dans le traitement d'image. Dansplusieurs applications, notamment l'imagerie médicale, il est important de préserver ou de contrôlerla topologie de l'image. Cependant la conception de telles transformations qui préservent à la foi la topologie et les caractéristiques géométriques de l'image est une tache complexe, en particulier dans le cas du traitement parallèle.Le principal objectif du traitement parallèle est d'accélérer le calcul en partagent la charge de travail à réaliser entre plusieurs processeurs. Si on approche cet objectif sous l'angle de la conception algorithmique, les stratégies du calcul parallèle exploite l'ordre partiel des algorithmes, désigné également par le parallélisme naturel qui présent dans l'algorithme et qui fournit deux principales sources de parallélisme : le parallélisme de données et le parallélisme fonctionnelle.De point de vue conception architectural, il est essentiel de lier l'évolution spectaculaire desarchitectures parallèles et le traitement parallèle. En effet, si les stratégies de parallèlisation sont devenues nécessaire, c'est grâce à des améliorations considérables dans les systèmes de multitraitement ainsi que la montée des architectures multi-core. Toutes ces raisons font du calculeparallèle une approche très efficace. Dans le cas des machines à mémoire partagé, il existe un autreavantage à savoir le partage immédiat des données qui offre plus de souplesse, notamment avec l'évolution du système d'interconnexion entre processeurs, dans la conception de ces stratégies etl'exploitation du parallélisme de données et le parallélisme fonctionnel.Dans cette perspective, nous proposons une nouvelle stratégie de parallèlisation, baptisé SD&M(Split, Distribute and Merge) stratégie qui couvrent une large classe d'opérateurs topologiques.SD&M a été développée afin de fournir un traitement parallèle de tout opérateur basée sur latransformation topologique. Basé sur cette stratégie, nous avons proposé une série d'algorithmestopologiques parallèle (nouvelle version ou version adapté). Nos principales contributions sont :(1)Une nouvelle approche pour calculer la ligne de partage des eaux basée sur ‘MSF transform'.L'algorithme proposé est parallèle, préserve la topologie, n'a pas besoin d'extraction préalable deminima et adaptée pour les machines parallèle à mémoire partagée. Il utilise la même approchede calcule de flux proposé par Jean Cousty et il ne nécessite aucune étape de tri, ni l'utilisationd'une file d'attente hiérarchique. Cette contribution a été précédé par une étude intensive desalgorithmes de calcule de la ligne de partage des eaux dans le cas discret.(2)Une étude similaire sur les algorithmes d'amincissement a été menée. Elle concerne seizealgorithmes d'amincissement qui préservent la topologie. En sus des critères de performance,nous somme basé sur deux critères qualitative pour les comparer et les classés. Après cetteclassification, nous avons essayé d'obtenir de meilleurs résultats grâce avec une version adaptéede l'algorithme d'amincissement proposé par Michel Couprie.(3)Une méthode de calcul amélioré pour le lissage topologique grâce à la combinaison du calculparallèle de la distance euclidienne (en utilisant l'algorithme Meijster) et l'amincissement/épaississement parallèle (en utilisant la version adaptée de l'algorithme de Couprie déjàmentionné). / Topological features of an object are fundamental in image processing. In many applications,including medical imaging, it is important to maintain or control the topology of the image. Howeverthe design of such transformations that preserve topology and geometric characteristics of the inputimage is a complex task, especially in the case of parallel processing.Parallel processing is applied to accelerate computation by sharing the workload among multipleprocessors. In terms of algorithm design, parallel computing strategies profits from the naturalparallelism (called also partial order of algorithms) present in the algorithm which provides two main resources of parallelism: data and functional parallelism. Concerning architectural design, it is essential to link the spectacular evolution of parallel architectures and the parallel processing. In effect, if parallelization strategies become necessary, it is thanks to the considerable improvements in multiprocessing systems and the rise of multi-core processors. All these reasons make multiprocessing very practical. In the case of SMP machines, immediate sharing of data provides more flexibility in designing such strategies and exploiting data and functional parallelism, notably with the evolution of interconnection system between processors.In this perspective, we propose a new parallelization strategy, called SD&M (Split Distribute andMerge) strategy that cover a large class of topological operators. SD&M has been developed in orderto provide a parallel processing for many topological transformations.Based on this strategy, we proposed a series of parallel topological algorithm (new or adaptedversion). In the following we present our main contributions:(1)A new approach to compute watershed transform based on MSF transform, that is parallel,preserves the topology, does not need prior minima extraction and suited for SMP machines.Proposed algorithm makes use of Jean Cousty streaming approach and it does not require any sortingstep, or the use of any hierarchical queue. This contribution came after an intensive study of allexisting watershed transform in the discrete case.(2)A similar study on thinning transform was conducted. It concerns sixteen parallel thinningalgorithms that preserve topology. In addition to performance criteria, we introduce two qualitativecriteria, to compare and classify them. New classification criteria are based on the relationshipbetween the medial axis and the obtained homotopic skeleton. After this classification, we tried toget better results through the proposal of a new adapted version of Couprie's filtered thinningalgorithm by applying our strategy.(3)An enhanced computation method for topological smoothing through combining parallelcomputation of Euclidean Distance Transform using Meijster algorithm and parallel Thinning–Thickening processes using the adapted version of Couprie's algorithm already mentioned.
68

Anfragebearbeitung auf Mehrkern-Rechnerarchitekturen

Huber, Frank 24 May 2012 (has links)
Der Trend zu immer mehr parallelen Recheneinheiten innerhalb eines Prozessors stellt an die Softwareentwicklung neue Herausforderungen. Um die vorhandenen Ressourcen auszulasten und die stetige Steigerung der Parallelität in einen Leistungszuwachs umzusetzen, muss Software von der sequentiellen Verarbeitung in eine hochgradig parallele Verarbeitung übergehen. Diese Arbeit untersucht, wie solch eine parallele Verarbeitung in Bezug auf Relationale Datenbankmanagementsysteme umzusetzen ist. Dazu wird zunächst der gesamte Prozess der Anfragebearbeitung betrachtet und vier Problembereiche identifiziert, die für das Ziel der parallelen Anfragebearbeitung auf Mehrkern-Rechnerarchitekturen maßgeblich sind. Diese Bereiche sind die Hardware selbst, das physische Datenmodell sowie die Anfrageausführung und -optimierung. Diese vier Bereiche werden innerhalb eines Rahmenwerkes betrachtet. Nach einer Einführung, wird sich die Arbeit zunächst mit Grundlagen befassen. Dazu werden die Hardwarebestandteile Speicher und Prozessor betrachtet und ihre Funktionsweise erläutert. Auf diesem Wissen aufbauend, wird ein Hardwaremodell definiert. Es ermöglicht eine von der jeweiligen Hardwarearchitektur unabhängige Softwareentwicklung, ohne den Verlust an Funktionalität und Leistung. Im Weiteren wird das physische Datenmodell untersucht und analysiert, wie das physische Datenmodell eine optimale Anfrageausführung unterstützen kann. Die verwendeten Datenstrukturen müssen dafür einen effizienten und parallelen Zugriff erlauben. Die Analyse führt zur Entwicklung eines neuartigen Indexes, der die datenparallele Abarbeitung nutzt. Gefolgt wird dieser Teil von der Anfrageausführung, in der ein neues Anfrageausführungsmodell entwickelt wird, das auf der Verwendung des Taskkonzepts beruht und eine hohe und sehr leicht gewichtige Parallelität erlaubt. Den Abschluss stellt die Anfrageoptimierung dar, worin verschiedene Ideen für die Optimierung der Ressourcenverwaltung präsentiert werden. / The upcoming generation of many-core architectures poses several new challenges for software development: Software design and software implementation has to change from sequential execution to a highly parallel execution, such that it takes full advantage of the steadily growing number of cores on a single processor. With this thesis, we investigate such highly parallel program execution in the context of relational database management systems (RDBMSs). We consider the complete process of query processing and identify four problem areas which are crucial for efficient parallel query processing on many-core architectures. These four areas are: Hardware, physical data model, query execution, and query optimization. Furthermore, we present a framework which covers all four parts, one after another. First, we give a detailed survey of computer hardware with a special focus on memory and processors. Based on this survey we propose a hardware model. Our abstraction aims to simplify the task of software development on many-core hardware. Based on the hardware model, we investigate physical data models and evaluate how the physical data model may support optimal query execution by providing efficient and parallelizable data structures. Additionally, we design a new index structure that utilizes data parallel execution by using SIMD operations. The next layer within our framework is query execution, for which we present a new task based query execution model. Our query execution model allows for a lightweight parallelism. Finally, we cover query optimization by explaining approaches for optimizing resource utilization on a query local point of view as well as query global point of view.
69

Ambientes de execução para o modelo de atores em plataformas hierárquicas de memória compartilhada com processadores de múltiplos núcleos / Dealing with actor runtime environments on hierarchical shared memory multi-core platforms

Francesquini, Emilio de Camargo 16 May 2014 (has links)
O modelo de programação baseado em atores é frequentemente utilizado para o desenvolvimento de grandes aplicações e sistemas. Podemos citar como exemplo o serviço de bate-papo do Facebook ou ainda o WhatsApp. Estes sistemas dão suporte a milhares de usuários conectados simultaneamente levando em conta estritas restrições de desempenho e interatividade. Tais sistemas normalmente são amparados por infraestruturas de hardware com processadores de múltiplos núcleos. Normalmente, máquinas deste porte são baseadas em uma estrutura de memória compartilhada hierarquicamente (NUMA - Non-Uniform Memory Access). Nossa análise dos atuais ambientes de execução para atores e a pesquisa na literatura mostram que poucos estudos sobre a adequação deste ambientes a essas plataformas hierárquicas foram conduzidos. Estes ambientes de execução normalmente assumem que o espaço de memória é uniforme o que pode causar sérios problemas de desempenho. Nesta tese nós estudamos os desafios enfrentados por um ambiente de execução para atores quando da sua execução nestas plataformas. Estudamos particularmente os problemas de gerenciamento de memória, de escalonamento e de balanceamento de carga. Neste documento nós também analisamos e caracterizamos as aplicações baseadas no modelo de atores. Tal análise nos permitiu evidenciar o fato de que a execução de benchmarks e aplicações criam estruturas de comunicação peculiares entre os atores. Tais peculiaridades podem, então, ser utilizadas pelos ambientes de execução para otimizar o seu desempenho. A avaliação dos grafos de comunicação e a implementação da prova de conceito foram feitas utilizando um ambiente de execução real, a máquina virtual da linguagem Erlang. A linguagem Erlang utiliza o modelo de atores para concorrência com uma sintaxe clara e consistente. As modificações que nós efetuamos nesta máquina virtual permitiram uma melhora significativa no desempenho de certas aplicações através de uma melhor afinidade de comunicação entre os atores. O escalonamento e o balanceamento de carga também foram melhorados graças à utilização do conhecimento sobre o comportamento da aplicação e sobre a plataforma de hardware. / The actor model is present in several mission-critical systems, such as those supporting WhatsApp and Facebook Chat. These systems serve thousands of clients simultaneously, therefore demanding substantial computing resources usually provided by multi-processor and multi-core platforms. Non-Uniform Memory Access (NUMA) architectures account for an important share of these platforms. Yet, research on the suitability of the current actor runtime environments for these machines is very limited. Current runtime environments, in general, assume a flat memory space, thus not performing as well as they could. In this thesis we study the challenges hierarchical shared memory multi-core platforms present to actor runtime environments. In particular, we investigate aspects related to memory management, scheduling, and load-balancing. In this document, we analyze and characterize actor based applications to, in light of the above, propose improvements to actor runtime environments. This analysis highlighted the existence of peculiar communication structures. We argue that the comprehension of these structures and the knowledge about the underlying hardware architecture can be used in tandem to improve application performance. As a proof of concept, we implemented our proposal using a real actor runtime environment, the Erlang Virtual Machine (VM). Concurrency in Erlang is based on the actor model and the language has a consistent syntax for actor handling. Our modifications to the Erlang VM significantly improved the performance of some applications thanks to better informed decisions on scheduling and on load-balancing.
70

Ambientes de execução para o modelo de atores em plataformas hierárquicas de memória compartilhada com processadores de múltiplos núcleos / Dealing with actor runtime environments on hierarchical shared memory multi-core platforms

Emilio de Camargo Francesquini 16 May 2014 (has links)
O modelo de programação baseado em atores é frequentemente utilizado para o desenvolvimento de grandes aplicações e sistemas. Podemos citar como exemplo o serviço de bate-papo do Facebook ou ainda o WhatsApp. Estes sistemas dão suporte a milhares de usuários conectados simultaneamente levando em conta estritas restrições de desempenho e interatividade. Tais sistemas normalmente são amparados por infraestruturas de hardware com processadores de múltiplos núcleos. Normalmente, máquinas deste porte são baseadas em uma estrutura de memória compartilhada hierarquicamente (NUMA - Non-Uniform Memory Access). Nossa análise dos atuais ambientes de execução para atores e a pesquisa na literatura mostram que poucos estudos sobre a adequação deste ambientes a essas plataformas hierárquicas foram conduzidos. Estes ambientes de execução normalmente assumem que o espaço de memória é uniforme o que pode causar sérios problemas de desempenho. Nesta tese nós estudamos os desafios enfrentados por um ambiente de execução para atores quando da sua execução nestas plataformas. Estudamos particularmente os problemas de gerenciamento de memória, de escalonamento e de balanceamento de carga. Neste documento nós também analisamos e caracterizamos as aplicações baseadas no modelo de atores. Tal análise nos permitiu evidenciar o fato de que a execução de benchmarks e aplicações criam estruturas de comunicação peculiares entre os atores. Tais peculiaridades podem, então, ser utilizadas pelos ambientes de execução para otimizar o seu desempenho. A avaliação dos grafos de comunicação e a implementação da prova de conceito foram feitas utilizando um ambiente de execução real, a máquina virtual da linguagem Erlang. A linguagem Erlang utiliza o modelo de atores para concorrência com uma sintaxe clara e consistente. As modificações que nós efetuamos nesta máquina virtual permitiram uma melhora significativa no desempenho de certas aplicações através de uma melhor afinidade de comunicação entre os atores. O escalonamento e o balanceamento de carga também foram melhorados graças à utilização do conhecimento sobre o comportamento da aplicação e sobre a plataforma de hardware. / The actor model is present in several mission-critical systems, such as those supporting WhatsApp and Facebook Chat. These systems serve thousands of clients simultaneously, therefore demanding substantial computing resources usually provided by multi-processor and multi-core platforms. Non-Uniform Memory Access (NUMA) architectures account for an important share of these platforms. Yet, research on the suitability of the current actor runtime environments for these machines is very limited. Current runtime environments, in general, assume a flat memory space, thus not performing as well as they could. In this thesis we study the challenges hierarchical shared memory multi-core platforms present to actor runtime environments. In particular, we investigate aspects related to memory management, scheduling, and load-balancing. In this document, we analyze and characterize actor based applications to, in light of the above, propose improvements to actor runtime environments. This analysis highlighted the existence of peculiar communication structures. We argue that the comprehension of these structures and the knowledge about the underlying hardware architecture can be used in tandem to improve application performance. As a proof of concept, we implemented our proposal using a real actor runtime environment, the Erlang Virtual Machine (VM). Concurrency in Erlang is based on the actor model and the language has a consistent syntax for actor handling. Our modifications to the Erlang VM significantly improved the performance of some applications thanks to better informed decisions on scheduling and on load-balancing.

Page generated in 0.4367 seconds