• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 19
  • 5
  • 3
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 36
  • 36
  • 16
  • 15
  • 11
  • 6
  • 6
  • 6
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Podpora DMA pro rodinu mikrokontrolerů HCS08 / DMA Support for HCS08 Microcontrollers Family

Novosád, Adrián January 2013 (has links)
Embedded systems are dedicated to perform specific tasks, so design engineers can optimize them to reduce the size and cost of the product and increase the reliability and performance. However, result of these optimizations is that some architectures may lack commonly used technologies such as direct memory access (DMA). We may encounter with this situation in family of microcontrollers HCS08. The main theme of this work is to describe a design of DMA controller that can be added into the family of microcontrollers HCS08.
12

Optimizing Memory Systems for High Efficiency in Computing Clusters

Liu, Wenjie January 2022 (has links)
DRAM-based memory system suffers from increasing aggravating row buffer interference, which causes significant performance degradation and power consumption. With DRAM scaling, the overheads of row buffer interference become even worse due to higher row activation and precharge latency. Clusters have been a prevalent and successful computing framework for processing large amount of data due to their distributed and parallelized working paradigm. A task submitted to a cluster is typically divided into a number of subtasks which are designated to different work nodes running the same code but dealing with different equal portion of the dataset to be processed. Due to the existence of heterogeneity, it could easily result in stragglers unfairly slowing down the entire processing, because work nodes finish their subtasks at different rates. With the increasing problem complexity, more irregular applications are deployed on high-performance clusters due to the parallel working paradigm, and yield irregular memory access behaviors across nodes. However, the irregularity of memory access behaviors is not comprehensively studied, which results in low utilization of the integrated hybrid memory system compositing of stacked DRAM and off-chip DRAM. This dissertation lists our research results on the above three mentioned challenges in order to optimize the memory system for high efficiency in computing clusters. Details are as follows: To address low row buffer utilization caused by row buffer interference, we propose Row Buffer Cache (RBC) architecture to efficiently mitigate row buffer interference overheads. At the core of the RBC architecture, the DRAM pages with good locality are cached and escape from the row buffer interference.Such an RBC architecture significantly reduces the overheads caused by row activation and precharge, thus improves overall system performance and energy efficiency. We evaluate our RBC using SPEC CPU2006 on a DDR4 memory compared to the commodity baseline memory system along with the state-of-art methods, DICE and Bingo. Results show that RBC improves the memory performance by up to 2.24X (16.1% on average) and reduces the overall memory energy by up to 68.2% (23.6% on average) for single-core simulations. For multi-core simulations, RBC increases the performance by up to 1.55X (16.7% on average) and reduces the energy by up to 35.4% (21.3% on average). Comparing with the state-of-art methods, RBC outperforms DICE and Bingo by 8% and 5.1% on average for single-core scenario, and by 10.1% and 4.7% for multi-core scenario. To relax the straggling effect observed in clusters, we aim to speed up straggling work nodes to quicken the overall processing by leveraging exhibited performance variation, and propose StragglerHelper which conveys the memory access characteristics experienced by the forerunner to the stragglers such that stragglers can be sped up due to the accurately informed memory prefetching. A Progress Monitor is deployed to supervise the respective progresses of the work nodes and inform the memory access patterns of forerunner to straggling nodes. Our evaluation results with the SPEC MPI 2007 and BigDataBench on a cluster of 64 work nodes have shown that StragglerHelper is able to improve the execution time of stragglers by up to 99.5% with an average of 61.4%, contributing to an overall improvement of the entire cohort of the cluster by up to 46.7% with an average of 9.9% compared to the baseline cluster. To address the performance difference in the irregular application, we devise a novel method called Similarity-Managed Hybrid Memory System (SM-HMS) to improve the hybrid memory system performance by leveraging the memory access similarity among nodes in a cluster. Within SM-HMS, two techniques are proposed, Memory Access Similarity Measuring and Similarity-based Memory Access Behavior Sharing. To quantify the memory access similarity, memory access behaviors of each node are vectorized, and the distance between two vectors is used as the memory access similarity. The calculated memory access similarity is used to share memory access behaviors precisely across nodes. With the shared memory access behaviors, SM-HMS divides the stacked DRAM into two sections, the sliding window section and the outlier section. The shared memory access behaviors guide the replacement of the sliding window section while the outlier section is managed in the LRU manner. Our evaluation results with a set of irregular applications on various clusters consisting of up to 256 nodes have shown that SM-HMS outperforms the state-of-the-art approaches, Cameo, Chameleon, and Hyrbid2, on job finish time reduction by up to 58.6%, 56.7%, and 31.3%, with 46.1%, 41.6%, and 19.3% on average, respectively. SM-HMS can also achieve up to 98.6% (91.9% on average) of the ideal hybrid memory system performance. / Computer and Information Science
13

Designing Support For MPI-2 Programming Interfaces On Modern InterConnects

Gangadharappa, Tejus A. 02 September 2009 (has links)
No description available.
14

Método otimizado de arquitetura de coerência de cache baseado em sistemas embarcados multinúcleos. / Optimized method for cache coherence architecture based on multicore embedded systems.

Kofuji, Jussara Marândola 01 December 2011 (has links)
A tese apresenta um método de arquitetura de coerência de cache especializado por sistemas embarcados. Um das contribuições principais deste método é apresentar uma proposição de arquitetura CMP de memória compartilhada orientada a padrões de acesso a memória e de um protocolo de coerência híbrido. A contribuição principal é a especificação do novo componente de hardware, chamado tabela de padrões, o qual é validado por representação formal e pela implementação da estrutura da tabela de padrões. A partir desta tabela foi desenvolvido um modelo de transação de mensagens do protocolo híbrido que diferencia as mensagens em clássicas e especulativas. A contribuição final apresenta um modelo analítico do custo efetivo de desempenho do protocolo híbrido. / This thesis presents the optimized method of cache coherent architecture based on embedded systems. The main contribution of this method presents the proposal of shared memory architecture CMP oriented by memory access patterns and cache coherent hybrid protocol. The cache coherent architecture provided the hardware specification called pattern table which can be validated by formal representation and the first implementation of pattern table. Through pattern table was developed the model of messages transaction to hybrid protocol witch differ the messages in classical and speculative. The final contribution presents the analytic model of effective cost of hybrid protocol performance.
15

Método otimizado de arquitetura de coerência de cache baseado em sistemas embarcados multinúcleos. / Optimized method for cache coherence architecture based on multicore embedded systems.

Jussara Marândola Kofuji 01 December 2011 (has links)
A tese apresenta um método de arquitetura de coerência de cache especializado por sistemas embarcados. Um das contribuições principais deste método é apresentar uma proposição de arquitetura CMP de memória compartilhada orientada a padrões de acesso a memória e de um protocolo de coerência híbrido. A contribuição principal é a especificação do novo componente de hardware, chamado tabela de padrões, o qual é validado por representação formal e pela implementação da estrutura da tabela de padrões. A partir desta tabela foi desenvolvido um modelo de transação de mensagens do protocolo híbrido que diferencia as mensagens em clássicas e especulativas. A contribuição final apresenta um modelo analítico do custo efetivo de desempenho do protocolo híbrido. / This thesis presents the optimized method of cache coherent architecture based on embedded systems. The main contribution of this method presents the proposal of shared memory architecture CMP oriented by memory access patterns and cache coherent hybrid protocol. The cache coherent architecture provided the hardware specification called pattern table which can be validated by formal representation and the first implementation of pattern table. Through pattern table was developed the model of messages transaction to hybrid protocol witch differ the messages in classical and speculative. The final contribution presents the analytic model of effective cost of hybrid protocol performance.
16

Automatic Parallel Memory Address Generation for Parallel DSP Computing

Dai, Jiehua January 2008 (has links)
<p>The concept of Parallel Vector (scratch pad) Memories (PVM) was introduced as one solution for Parallel Computing in DSP, which can provides parallel memory addressing efficiently with minimum latency. The parallel programming more efficient by using the parallel addressing generator for parallel vector memory (PVM) proposed in this thesis. However, without hiding complexities by cache, the cost of programming is high. To minimize the programming cost, automatic parallel memory address generation is needed to hide the complexities of memory access.</p><p>This thesis investigates methods for implementing conflict-free vector addressing algorithms on a parallel hardware structure. In particular, match vector addressing requirements extracted from the behaviour model to a prepared parallel memory addressing template, in order to supply data in parallel from the main memory to the on-chip vector memory.</p><p>According to the template and usage of the main and on-chip parallel vector memory, models for data pre-allocation and permutation in scratch pad memories of ASIP can be decided and configured. By exposing the parallel memory access of source code, the memory access flow graph (MFG) will be generated. Then MFG will be used combined with hardware information to match templates in the template library. When it is matched with one template, suited permutation equation will be gained, and the permutation table that include target addresses for data pre-allocation and permutation is created. Thus it is possible to automatically generate memory address for parallel memory accesses.</p><p>A tool for achieving the goal mentioned above is created, Permutator, which is implemented in C++ combined with XML. Memory access coding template is selected, as a result that permutation formulas are specified. And then PVM address table could be generated to make the data pre-allocation, so that efficient parallel memory access is possible.</p><p>The result shows that the memory access complexities is hiden by using Permutator, so that the programming cost is reduced.It works well in the context that each algorithm with its related hardware information is corresponding to a template case, so that extra memory cost is eliminated.</p>
17

Optimisation des transferts de données sur systèmes multiprocesseurs sur puce / Optimizing Data Transfers for Multiprocessor Systems on Chips

Saidi, Selma 24 October 2012 (has links)
Les systèmes multiprocesseurs sur puce, tel que le processeur CELL ou plus récemment Platform 2012, sont des architectures multicœurs hétérogènes constitués d'un processeur host et d'une fabric de calcul qui consiste en plusieurs petits cœurs dont le rôle est d'agir comme un accélérateur programmable. Les parties parallélisable d'une application, qui initialement est supposé etre executé par le host, et dont le calcul est intensif sont envoyés a la fabric multicœurs pour être exécutés. Ces applications sont en général des applications qui manipulent des tableaux trés larges de données, ces données sont stockées dans une memoire distante hors puce (off-chip memory) dont l 'accès est 100 fois plus lent que l 'accès par un cœur a une mémoire locale. Accéder ces données dans la mémoire off-chip devient donc un problème majeur pour les performances. une characteristiques principale de ces plateformes est une mémoire local géré par le software, au lieu d un mechanisme de cache, tel que les mouvements de données dans la hiérarchie mémoire sont explicitement gérés par le software. Dans cette thèse, l 'objectif est d'optimiser ces transfert de données dans le but de reduire/cacher la latence de la mémoire off-chip . / Multiprocessor system on chip (MPSoC) such as the CELL processor or the more recent Platform2012 are heterogeneous multi-core architectures, with a powerful host processor and a computation fabric, consisting of several smaller cores, whose intended role is to act as a general purpose programmable accelerator. Therefore computation-intensive (and parallelizable) parts of the application initially intended to be executed by the host processor are offloaded to the multi-cores for execution. These parts of the application are often data intensive, operating on large arrays of data initially stored in a remote off-chip memory whose access time is about 100 times slower than that of the cores local memory. Accessing data in the off-chip memory becomes then a main bottleneck for performance. A major characteristic of these platforms is a software controlled local memory storage rather than a hidden cache mechanism where data movement in the memory hierarchy, typically performed using a DMA (Direct Memory Access) engine, are explicitely managed by the software. In this thesis, we attempt to optimize such data transfers in order to reduce/hide the off-chip memory latency.
18

Acesso e memória: a informação nos arquivos das arquidioceses da Paraíba e de Olinda/Recife

Queiroz, Anna Carla Silva de 12 April 2011 (has links)
Made available in DSpace on 2015-04-16T15:23:11Z (GMT). No. of bitstreams: 1 parte1.pdf: 1821547 bytes, checksum: a03f0761fe9454eea9db0e058ff3f5f0 (MD5) Previous issue date: 2011-04-12 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / The present study aimed at providing a comparative study of the archives of the Archdiocese of Paraíba and Olinda / Recife in relation to the construction of memory and access to information. Its importance stems from a rich volume mass of documents, since the Catholic Church, in periods prior to the Proclamation of the Republic, produced records of birth, baptism, property records, among others. During this period, the regime in force in Brazil of patronage in which the Church was tied to the state, producing and accumulating a vast production of cultural, social, economic and political results of the archival point of view, a merging of civil and ecclesiastical information. The methodology consisted of an analysis of files, from the recommendations of CONARQ, focusing on the document types and media types, physical structure, projects and coordination of activities, internal regulations, budget, cataloging, accessibility, human resources. The survey was conducted from a structured questionnaire, as well as the application of simple interviews to notaries responsible for the archive. The results of the research is leading us to infer that there is contrast between the collections since the collection of Paraiba is organized, however the Pernambuco in this precarious situation. Thus it was observed that only the first case the file is within the guidelines proposed by the Pontifical Commission for the Cultural Patrimony of the Church. / O presente trabalho teve como objetivo principal realizar um estudo comparativo entre os arquivos das arquidioceses da Paraíba e de Olinda/Recife, no tocante aos processos de acesso à informação e construção da memória. Sua importância deriva de um riquíssimo volume de massa documental, uma vez que a Igreja Católica, em períodos anteriores à Proclamação da República (1889), produzia registros de nascimento, de batismo, patrimoniais, entre outros. Nesse amplo período da história do Brasil, vigorava o regime de padroado, em que a Igreja era atrelada ao Estado e produzia e acumulava uma vasta produção cultural, social, econômica e política, o que resultava, do ponto de vista arquivístico, numa confluência de informações civis e eclesiásticas. A metodologia empregada consistiu num diagnóstico dos arquivos, a partir das recomendações do CONARQ, enfocando as tipologias documentais e os tipos de suporte; estrutura física, projetos e coordenação das atividades; regulamento interno; orçamento; catalogação; acessibilidade e recursos humanos. O levantamento de dados foi realizado a partir da aplicação de questionário estruturado e de entrevistas simples aos notários responsáveis pelo acervo. Os resultados da pesquisa nos direcionam a inferir que existe contraste entre os acervos, pois o da Paraíba encontra-se organizado, entretanto o de Pernambuco está em situação precária. Desta forma observou-se que apenas no primeiro caso o arquivo encontra-se dentro das diretrizes propostas pela Pontifícia Comissão para os Bens Culturais da Igreja.
19

Automatic Parallel Memory Address Generation for Parallel DSP Computing

Dai, Jiehua January 2008 (has links)
The concept of Parallel Vector (scratch pad) Memories (PVM) was introduced as one solution for Parallel Computing in DSP, which can provides parallel memory addressing efficiently with minimum latency. The parallel programming more efficient by using the parallel addressing generator for parallel vector memory (PVM) proposed in this thesis. However, without hiding complexities by cache, the cost of programming is high. To minimize the programming cost, automatic parallel memory address generation is needed to hide the complexities of memory access. This thesis investigates methods for implementing conflict-free vector addressing algorithms on a parallel hardware structure. In particular, match vector addressing requirements extracted from the behaviour model to a prepared parallel memory addressing template, in order to supply data in parallel from the main memory to the on-chip vector memory. According to the template and usage of the main and on-chip parallel vector memory, models for data pre-allocation and permutation in scratch pad memories of ASIP can be decided and configured. By exposing the parallel memory access of source code, the memory access flow graph (MFG) will be generated. Then MFG will be used combined with hardware information to match templates in the template library. When it is matched with one template, suited permutation equation will be gained, and the permutation table that include target addresses for data pre-allocation and permutation is created. Thus it is possible to automatically generate memory address for parallel memory accesses. A tool for achieving the goal mentioned above is created, Permutator, which is implemented in C++ combined with XML. Memory access coding template is selected, as a result that permutation formulas are specified. And then PVM address table could be generated to make the data pre-allocation, so that efficient parallel memory access is possible. The result shows that the memory access complexities is hiden by using Permutator, so that the programming cost is reduced.It works well in the context that each algorithm with its related hardware information is corresponding to a template case, so that extra memory cost is eliminated.
20

Profile guided hybrid compilation / Compilation hybride guidée pour profilage

Nunes Sampaio, Diogo 14 December 2016 (has links)
L'auteur n'a pas fourni de résumé en français / The end of chip frequency scaling capacity, due heat dissipation limitations, made manufacturers search for an alternative to sustain the processing capacity growth. The chosen solution was to increase the hardware parallelism, by packing multiple independent processors in a single chip, in a Multiple-Instruction Multiple-Data (MIMD) fashion, each with special instructions to operate over a vector of data, in a Single-Instruction Multiple-Data (SIMD) manner. Such paradigm change, brought to software developer the convoluted task of producing efficient and scalable applications. Programming languages and associated tools evolved to aid such task for new developed applications. But automated optimizations capable of coping with such a new complex hardware, from legacy, single threaded applications, is still lacking.To apply code transformations, either developers or compilers, require to assert that, by doing so, they are not changing the expected comportment of the application producing unexpected results. But syntactically poor codes, such as use of pointer parameters with multiple possible indirections, complex loop structures, or incomplete codes, make very hard to extract application behavior solely from the source code in what is called a static analyses. To cope with the lack of information extracted from the source code, many tools and research has been done in, how to use dynamic analyses, that does application profiling based on run-time information, to fill the missing information. The combination of static and dynamic information to characterize an application are called hybrid analyses. This works advocates for the use of hybrid analyses to be able to optimizations on loops, regions where most of computations are done. It proposes a framework capable of statically applying some complex loop transformations, that previously would be considered unsafe, by assuring their safe use during run-time with a lightweight test.The proposed framework uses application execution profiling to help the static loop optimizer to: 1) identify and classify program hot-spots, so as to focus only on regions vital for the execution time; 2) guide the optimizer in understanding the overall loop behavior, so as to reduce the valid loop transformations search space; 3) using instruction's memory access functions, it statically builds a lightweight run-time test that determine, based on the program parameters values, if a given optimization is safe to be used or not. It's applicability is shown by performing complex loop transformations into a variety of loops, obtained from applications of different fields, and demonstrating that the run-time overhead is insignificant compared to the loop execution time or gained performance, in the vast majority of cases.

Page generated in 0.0361 seconds