• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 145
  • 30
  • 21
  • 15
  • 6
  • 6
  • 3
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 265
  • 76
  • 50
  • 50
  • 48
  • 38
  • 35
  • 35
  • 33
  • 32
  • 31
  • 30
  • 30
  • 29
  • 27
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
21

Providing Support for the Movidius Myriad1 Platform in the SkePU Skeleton Programming Framework

Cuello, Rosandra January 2014 (has links)
The Movidius Myriad1 Platform is a multicore embedded platform primed to offer high performance and power efficiency for computer vision applications in mobile devices. The challenges of programming multicore environments are well known and skeleton programming offers a high-level programming alternative for parallel computing, intended to hide the complexities of the system from the programmer. The SkePU Skeleton Programming Framework includes backend implementations for CPU and GPU systems and it has the capacity to support more platforms by extending its backend implementations. With this master thesis project we aim to extend the SkePU Skeleton Programming Framework to provide support for execution in the Movidius Myriad1 embedded platform. Our SkePU backend for Myriad1 consists on a set of macros and functions to compose the different elements of a Myriad1 application, data communication structures to exchange data between the host systems and Myriad1, and a helper script and auxiliary files to generate a Myriad1 application.Evaluation and testing demonstrate that our backend is usable, however further optimizations are needed to obtain good performance that would make it practical to use in real life applications, particularly when it comes to data communication. As part of this project, we have outlined some improvements that could be applied to obtain better performance overall in the future, addressing the issues found with the methods of data communication.
22

Efficiently mapping high-performance early vision algorithms onto multicore embedded platforms

Apewokin, Senyo 09 January 2009 (has links)
The combination of low-cost imaging chips and high-performance, multicore, embedded processors heralds a new era in portable vision systems. Early vision algorithms have the potential for highly data-parallel, integer execution. However, an implementation must operate within the constraints of embedded systems including low clock rate, low-power operation and with limited memory. This dissertation explores new approaches to adapt novel pixel-based vision algorithms for tomorrow's multicore embedded processors. It presents : - An adaptive, multimodal background modeling technique called Multimodal Mean that achieves high accuracy and frame rate performance with limited memory and a slow-clock, energy-efficient, integer processing core. - A new workload partitioning technique to optimize the execution of early vision algorithms on multi-core systems. - A novel data transfer technique called cat-tail dma that provides globally-ordered, non-blocking data transfers on a multicore system. By using efficient data representations, Multimodal Mean provides comparable accuracy to the widely used Mixture of Gaussians (MoG) multimodal method. However, it achieves a 6.2x improvement in performance while using 18% less storage than MoG while executing on a representative embedded platform. When this algorithm is adapted to a multicore execution environment, the new workload partitioning technique demonstrates an improvement in execution times of 25% with only a 125 ms system reaction time. It also reduced the overall number of data transfers by 50%. Finally, the cat-tail buffering technique reduces the data-transfer latency between execution cores and main memory by 32.8% over the baseline technique when executing Multimodal Mean. This technique concurrently performs data transfers with code execution on individual cores, while maintaining global ordering through low-overhead scheduling to prevent collisions.
23

Scratchpad Management in Software Managed Manycore Architectures

January 2017 (has links)
abstract: Caches have long been used to reduce memory access latency. However, the increased complexity of cache coherence brings significant challenges in processor design as the number of cores increases. While making caches scalable is still an important research problem, some researchers are exploring the possibility of a more power-efficient SRAM called scratchpad memories or SPMs. SPMs consume significantly less area, and are more energy-efficient per access than caches, and therefore make the design of on-chip memories much simpler. Unlike caches, which fetch data from memories automatically, an SPM requires explicit instructions for data transfers. SPM-only architectures are thus named as software managed manycore (SMM), since the data movements of such architectures rely on software. SMM processors have been widely used in different areas, such as embedded computing, network processing, or even high performance computing. While SMM processors provide a low-power platform, the hardware alone does not guarantee power efficiency, if applications on such processors deliver low performance. Efficient software techniques are therefore required. A big body of management techniques for SMM architectures are compiler-directed, as inserting data movement operations by hand forces programmers to trace flow of data, which can be error-prone and sometimes difficult if not impossible. This thesis develops compiler-directed techniques to manage data transfers for embedded applications on SMMs efficiently. The techniques analyze and find out the proper program points and insert data movement instructions accordingly. The techniques manage code, stack and heap data of applications, and reduce execution time by 14%, 52% and 80% respectively compared to their predecessors on typical embedded applications. On top of managing local data, a technique is also developed for shared data in SMM architectures. Experimental results show it achieves more than 2X speedup than the previous technique on average. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2017
24

Multicore Scalability Through Asynchronous Work

Mathew, Ajit 13 January 2020 (has links)
With the end of Moore's Law, computer architects have turned to multicore architecture to provide high performance. Unfortunately, to achieve higher performance, multicores require programs to be parallelized which is an untamed problem. Amdahl's law tells that the maximum theoretical speedup of a program is dictated by the size of the non-parallelizable section of a program. Hence to achieve higher performance, programmers need to reduce the size of sequential code in the program. This thesis explores asynchronous work as a means to reduce sequential portions of program. Using asynchronous work, a programmer can remove tasks which do not affect data consistency from the critical path and can be performed using background thread. Using this idea, the thesis introduces two systems. First, a synchronization mechanism, Multi-Version Read-Log-Update(MV-RLU), which extends Read-Log-Update (RLU) through multi-versioning. At the core of MV-RLU design is a concurrent garbage collection algorithm which reclaims obsolete versions asynchronously reducing blocking of threads. Second, a concurrent and highly scalable index-structure called Hydralist for multi-core. The key idea behind design of Hydralist is that an index-structure can be divided into two component (search layer and data layer) and updates to data layer can be done synchronously while updates to search layer can be propagated asynchronously using background threads. / Master of Science / Up until mid-2000s, Moore's law predicted that performance CPU doubled every two years. This is because improvement in transistor technology allowed smaller transistor which can switch at higher frequency leading to faster CPU clocks. But faster clock leads to higher heat dissipation and as chips reached their thermal limits, computer architects could no longer increase clock speeds. Hence they moved to multicore architecture, wherein a single die contains multiple CPUs, to allow higher performance. Now programmers are required to parallelize their code to take advangtage of all the CPUs in a chip which is a non trivial problem. The theoretical speedup achieved by a program on multicore architecture is dictated by Amdahl's law which describes the non parallelizable code in a program as the limiting factor for speedup. For example, a program with 99% parallelizable code can achieve speedup of 20 whereas a program with 50% parallelizable code can only achieve speedup of 2. Therefore to achieve high speedup, programmers need to reduce size of serial section in their program. One way to reduce sequential section in a program is to remove non-critical task from the sequential section and perform the tasks asynchronously using background thread. This thesis explores this technique in two systems. First, a synchronization mechanism which is used co-ordinate access to shared resource called Multi-Version Read-Log-Update (MV-RLU). MV-RLU achieves high performance by removing garbage collection from critical path and performing it asynchronously using background thread. Second, an index structure, Hydralist, which based on the insight that an index structure can be decomposed into two components, search layer and data layer, and decouples updates to both the layer which allows higher performance. Updates to search layer is done synchronously while updates to data layer is done asynchronously using background threads. Evaluation shows that both the systems perform better than state-of-the-art competitors in a variety of workloads.
25

Optimal Implementation of Simulink Models on Multicore Architectures with Partitioned Fixed Priority Scheduling

Bansal, Shamit 02 August 2018 (has links)
Model-based design based on the Simulink modeling formalism and the associated toolchain has gained its popularity in the development of complex embedded control systems. However,the current research on software synthesis for Simulink models has a critical gap for providing a deterministic, semantics-preserving implementation on multicore architectures with partitioned fixed-priority scheduling. In this thesis, we propose to judiciously assign task offset, task priority, and task communication mechanism, to avoid simultaneous access to shared memory by tasks on different cores, to preserve the model semantics, and to optimize the control performance. We develop two approaches to solve the problem: (a) a mixed integer linear programming (MILP) formulation; and (b) a problem specific exact algorithm that may run several magnitudes faster than MILP. / Master of Science / To save development time and money, automotive industries have been developing models using software, before implementing them directly on hardware. For reliability, the model generated from the software tool should behave in a well defined manner, coherent to the ideal design of the model. While the current tools are able to generate this reliable model for a single processor system, they are not able to do so for a system with multiple processors. When two or more processors contend to access the same resource at the same time, the existing tools are unable to provide a well defined execution order in their model. Since modern embedded systems need multiple processors to meet their increasing performance demands, it is imperative that the software tools scale up to multiple processors as well. In this work, we seek to bridge this gap by presenting two solutions that generate a deterministic software implementation of a system with multiple processors. In our solutions, we generate a model with well defined execution order by ensuring that at any given time, only one processor accesses a given resource. Furthermore, apart from ensuring determinism, we also improve upon the performance of the generated model by ensuring that there is minimal end-to-end latency in the system.
26

A Scalable, Memory Efficient Multicore TEIRESIAS Implementation

Nau, Lee J. 26 July 2011 (has links)
No description available.
27

Scalable and High Performance Collective Communication for Next Generation Multicore Infiniband Clusters

Mamidala, Amith Rajith 24 June 2008 (has links)
No description available.
28

Etude et amélioration de l'exploitation des architectures NUMA à travers des supports exécutifs / Studying and improving the use of NUMA architectures through runtime systems

Virouleau, Philippe 05 June 2018 (has links)
L'évolution du calcul haute performance est aujourd'hui dirigée par les besoins des applications de simulation numérique.Ces applications sont exécutées sur des supercalculateurs qui peuvent proposer plusieurs milliers de cœurs, et qui sont découpés en un très grand nombre de nœuds de calcul ayant eux un nombre de cœurs beaucoup plus faible.Chacun de ces nœuds de calcul repose sur une architecture à mémoire partagée, dont la mémoire est découpée en plusieurs blocs physiques différents : cela implique un temps d'accès dépendant à la fois de la donnée accédée ainsi que du processeur y accédant.On appelle ce genre d'architectures NUMA (pour emph{Non Uniform Memory Access}).La manière actuelle de les exploiter tend vers l'utilisation d'un modèle de programmation à base de tâches, qui permet de traiter des programmes irréguliers au delà du simple parallélisme de boucle.L'exploitation efficace des machines NUMA est critique pour l'amélioration globale des performances des supercalculateurs.Cette thèse a été axée sur l'amélioration des techniques usuelles pour leur exploitation : elle propose une réponse au compromis qu'il faut faire entre localité des données et équilibrage de charge, qui sont deux points critiques dans l'ordonnancement d'applications.Les contributions de cette thèse peuvent se découper en deux parties : une partie dédiée à fournir au programmeur les moyens de comprendre, analyser, et mieux spécifier le comportement des parties critiques de son application, et une autre partie dédiée à différentes améliorations du support exécutif.Cette seconde partie a été évaluée sur différentes applications, ce qui a permis de montrer des gains de performances significatifs. / Nowadays the evolution of High Performance Computing follows the needs of numerical simulations.These applications are executed on supercomputers which can offer several thousands of cores, split into a large number of computing nodes, which possess a relatively low number of cores.Each of these nodes consists of a shared memory architecture, in which the memory is physically split into several distinct blocks: this implies that the memory access time depends both on which data is accessed, and on which core tries to access it.These architectures are named NUMA (for Non Uniform Memory Access).The current way to exploit them tends to be through a tasks-based programming model, which can handle irregular applications beyond a simple loop-based parallelism.Efficient use of NUMA architectures is critical for the overall performance improvements of supercomputers.This thesis has been targetted at improving common techniques for their exploitation: it proposes an answer to the tradeoff that has to be made between data locality and load balancing, that are two critical aspects of applications scheduling.Contributions of this thesis can be split into two parts: the first part is dedicated to providing the programmer with means to understand, analyze, and better characterize the behavior of their applications' critical parts, and the second part is dedicated to several improvements made to the runtime systems.This last part has been evaluated on various applications, and has shown some significant performance gains.
29

Análisis de rendimiento y optimización de algoritmos paralelos Best-First Search sobre multicore y cluster de multicore

Sanz, Victoria María January 2015 (has links)
El objetivo general de esta tesis se centra en la investigación y desarrollo de algoritmos paralelos de búsqueda en grafos best-first search para arquitecturas multicore y cluster de multicore, que mejoran los existentes y se utilizan para resolver problemas de optimización combinatoria y de planificación, acompañado de un análisis de rendimiento (speedup, eficiencia, escalabilidad) de los mismos. La temática propuesta es de interés en la actualidad por la complejidad computacional de dichos algoritmos de búsqueda y las posibilidades que brindan las arquitecturas mencionadas. Los algoritmos presentados en esta tesis pueden aplicarse para resolver problemas reales como planificación de rutas óptimas, navegación automática de un robot o vehículo, alineamiento óptimo de secuencias, entre otros. Los temas de investigación derivados son múltiples y se refieren tanto a la paralelización de algoritmos sobre (a) arquitecturas de memoria compartida, como son los multicore (b) arquitecturas de memoria distribuida, como son los clusters (c) y también sobre arquitecturas híbridas, tal es el caso de los clusters de multicore. El aporte de la tesis es el desarrollo de dos algoritmos paralelos best-first-search propios, uno apto para su ejecución sobre máquinas de memoria compartida (multicore) y otro apto para máquinas de memoria distribuida (cluster), basados en el algoritmo HDA* (Hash Distributed A*), en los cuales se incluyen técnicas originales que optimizan su rendimiento. Asimismo, se presenta un análisis de rendimiento de los algoritmos desarrollados a medida que escala la carga de trabajo y la arquitectura paralela subyacente. Para finalizar, se compara la memoria consumida por ambos algoritmos y el rendimiento alcanzado cuando se los ejecuta sobre una máquina multicore; estos análisis presentan originalidad en el área. Los resultados arrojados indican que se obtendría un beneficio al convertir HDA* en una aplicación híbrida, cuando la arquitectura subyacente es un cluster de multicore, por lo que se sientan las bases para éste algoritmo híbrido.
30

Melhoria de desempenho da máquina virtual Java na plataforma Cell B.E. / Java virtual machine performance improvement in Cell B.E. architecture

Firmino, Raoni Fassina 16 August 2018 (has links)
Orientador: Rodolfo Jardim de Azevedo / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-16T21:29:21Z (GMT). No. of bitstreams: 1 Firmino_RaoniFassina_M.pdf: 582747 bytes, checksum: c50225f2dc75c4235a785d90a82d71b2 (MD5) Previous issue date: 2010 / Resumo: Esta dissertação concentra-se no atual momento de transição entre as atuais e as novas arquiteturas de processadores, oferecendo uma alternativa para minimizar o impacto desta mudança. Para tal utiliza-se a plataforma Java, que possibilita que o desenvolvimento de aplicações seja independente da arquitetura em que serão executadas. Considerando a arquitetura Cell B.E. como uma nova plataforma que promete desempenho elevado, este trabalho propõe melhorias na Máquina Virtual Java que propiciem um ganho de desempenho na execução de aplicações Java executadas sobre o processador Cell. O objetivo proposto é atingido por meio da utilização do ambiente disponível na própria plataforma Java, o Java Native Interface (JNI), para a implementação de interfaces entre bibliotecas nativas construídas para a arquitetura Cell - com a intenção de obter o máximo desempenho possível - e as aplicações Java. É proposto um modelo para porte e criação das interfaces para bibliotecas e mostra-se a viabilidade da abordagem proposta através de implementações de bibliotecas selecionadas, consolidando a metodologia utilizada. Duas bibliotecas foram portadas completamente como prova de conceito, uma multiplicação de matrizes grandes e o algoritmo RC5. A multiplicação de matrizes obteve um desempenho e escalablidade comparável ao código original em C e em escala muitas vezes superior ao código JNI para arquitetura x86 a ao código Java executando em arquiteturas x86 e Cell. O RC5 executou apenas aproximadamente 0,3 segundos mais lento que o código C original (perda citada em segundos pois se manteve constante independente do tempo levado para as diferentes configurações de execução) / Abstract: This dissertation focuses on the present moment of transition between the current and new processor architectures, offering an alternative to minimize the impact of this change. For this, we use the Java platform, which enables an architecture-independent application development. Considering the Cell BE architecture as a new platform that promises high performance, this paper proposes improvements in the Java Virtual Machine that provide performance gains in the execution of Java applications running on the Cell processor. The proposed objective is achieved through the use of the environment available on the Java platform itself, the Java Native Interface (JNI), to implement interfaces between native libraries built for the Cell architecture - with the intention of obtaining the maximum possible performance - and the Java applications. It is proposed a model to port and build interfaces to libraries and it shows the viability of the proposed methodology with the implementation of selected libraries, consolidating the used methodology. Two libraries were completely ported as proof of concept, a multiplication of large matrices and a RC5 algorithm implementation. The matrices multiplication achieved scalability and performance in the same basis as the native implementation and incomparable with JNI implementation targering x86 architecture and Java implementation running in x86 and Cell architectures. The RC5 was just 0.3 seconds slower than the original C code (the loss is put in seconds since it was constant, independent of the execution time taken by different configurations of execution) / Mestrado / Computação / Mestre em Ciência da Computação

Page generated in 0.0779 seconds