Global ETD Search

361	Operating System Management of Shared Caches on Multicore Processors Tam, David Kar Fai 01 September 2010 (has links) Our thesis is that operating systems should manage the on-chip shared caches of multicore processors for the purposes of achieving performance gains. Consequently, this dissertation demonstrates how the operating system can profitably manage these shared caches. Two shared-cache management principles are investigated: (1) promoting shared use of the shared cache, demonstrated by an automated online thread clustering technique, and (2) providing cache space isolation, demonstrated by a software-based cache partitioning technique. In support of providing isolation, cache provisioning is also investigated, demonstrated by an automated online technique called RapidMRC. We show how these software-based techniques are feasible on existing multicore systems with the help of their hardware performance monitoring units and their associated hardware performance counters. On a 2-chip IBM POWER5 multicore system, promoting sharing reduced processor pipeline stalls caused by cross-chip cache accesses by up to 70%, resulting in performance improvements of up to 7%. On a larger 8-chip IBM POWER5+ multicore system, the potential for up to 14% performance improvement was measured. Providing isolation improved performance by up to 50%, using an exhaustive offline search method to determine optimal partition size. On the other hand, up to 27% performance improvement was extracted from the corresponding workload using an automated online approximation technique, made possible by RapidMRC. operating systems multicore processors shared caches promoting sharing providing isolation cache provisioning thread clustering cache partitioning approximating MRCs 0984
362	Operating System Management of Shared Caches on Multicore Processors Tam, David Kar Fai 01 September 2010 (has links) Our thesis is that operating systems should manage the on-chip shared caches of multicore processors for the purposes of achieving performance gains. Consequently, this dissertation demonstrates how the operating system can profitably manage these shared caches. Two shared-cache management principles are investigated: (1) promoting shared use of the shared cache, demonstrated by an automated online thread clustering technique, and (2) providing cache space isolation, demonstrated by a software-based cache partitioning technique. In support of providing isolation, cache provisioning is also investigated, demonstrated by an automated online technique called RapidMRC. We show how these software-based techniques are feasible on existing multicore systems with the help of their hardware performance monitoring units and their associated hardware performance counters. On a 2-chip IBM POWER5 multicore system, promoting sharing reduced processor pipeline stalls caused by cross-chip cache accesses by up to 70%, resulting in performance improvements of up to 7%. On a larger 8-chip IBM POWER5+ multicore system, the potential for up to 14% performance improvement was measured. Providing isolation improved performance by up to 50%, using an exhaustive offline search method to determine optimal partition size. On the other hand, up to 27% performance improvement was extracted from the corresponding workload using an automated online approximation technique, made possible by RapidMRC. operating systems multicore processors shared caches promoting sharing providing isolation cache provisioning thread clustering cache partitioning approximating MRCs 0984
363	Efficient synchronization and communication in many-core chip multiprocessors Abellán Miguel, José Luis 21 December 2012 (has links) En esta tesis hemos identificado tres de los mayores cuellos de botella para el rendimiento y escalabilidad de las arquitecturas many-core CMP de memoria compartida. En particular, los mecanismos de sincronización de barrera y cerrojo cuando presentan alta contención, así como los protocolos hardware de coherencia de caché en el mantenimiento de la coherencia del uso de bloques memoria compartidos en una jerarquía de memoria. Para paliar estas deficiencias y aprovechar más el rendimiento de estas arquitecturas, hemos propuesto tres mecanismos hardware: GBarrier, para un mecanismo de barreras eficiente; GLock, para un manejo justo y eficiente de la contención en el acceso a las secciones críticas protegidas por cerrojos; y ECONO, un protocolo de coherencia muy simple que aporta gran eficiencia a bajo costo. La tesis concluye que nuestras propuestas resuelven de manera eficiente los problemas de rendimiento derivados de implementaciones ineficientes para sincronización y coherencia en arquitecturas many-core CMP. / In this thesis we have identified three of the major problems that restrict efficiency and scalability in future shared-memory tiled many-core CMPs. In particular, the synchronization operations of barriers and locks under highly-contended scenarios, and the hardware-based cache coherence protocols when dealing with the maintenance of coherence of all memory blocks across all levels of a memory hierarchy. To alleviate such performance bottlenecks in order to harness the computational power of such systems, we have proposed three hardware-based mechanisms: GBarrier, a very efficient barrier mechanism; GLock, an efficient and fair mechanism to implement highly-contended locks; and ECONO, a simple and efficient hardware coherence protocol. In light of our performance results obtained in this thesis, we can affirm that our proposals represent a step forward towards the resolution of the challenges that many-core CMP architectures will pose to computer architects. sincronización hardware hardware synchronization coherencia de cache cache coherency multiprocesadores chip multiprocessors programación paralela parallel programming Arquitectura de computadores 62
364	Analysis and optimization of question answering systems Domínguez Sal, David 23 April 2010 (has links) No description available. Question answering Distributed systems Cooperative cache Evolutive summary counters Load balancing Multi-layer cache Search engine Information retrieval 004
365	Cache design and timing analysis for preemptive multi-tasking real-time uniprocessor systems Tan, Yudong 18 April 2005 (has links) In this thesis, we propose an approach to estimate the Worst Case Response Time (WCRT) of each task in a preemptive multi-tasking single-processor real-time system utilizing an L1 cache. The approach combines inter-task cache eviction analysis and intra-task cache access analysis to estimate the Cache Related Preemption Delay (CRPD). CRPD caused by preempting task(s) is then incorporated into WCRT analysis. We also propose a prioritized cache to reduce CRPD by exploiting cache partitioning technique. Our WCRT analysis approach is then applied to analyze the behavior of a prioritized cache. Four sets of applications with up to six concurrent tasks running are used to test our WCRT analysis approach and the prioritized cache. The experimental results show that our WCRT analysis approach can tighten the WCRT estimate by up to 32% (1.4X) over prior state-of-the-art. By using a prioritized cache, we can reduce the WCRT estimate of tasks by up to 26%, as compared to a conventional set associative cache. Real-time systems Multi-tasking Cache Timing analysis Real-time data processing Cache memory Memory management (Computer science)
366	Experiments with the pentium Performance monitoring counters Agarwal, Gunjan 06 1900 (has links) Performance monitoring counters are implemented in most recent microprocessors. In this thesis, we describe various performance measurement experiments for a program and a system that we conducted on a Linux operating system using the Pentium performance counters. We carried out our performance measurements on a Pentium II microprocessor. The Pentium II performance counters can be configured to count events such as cache misses, TLB misses, instructions executed etc. We used a low intrusive overhead technique to access these performance counters. We used these performance counters to measure the cache miss overheads due to context switches in Linux system. Our methodology involves sampling the hardware counters every 50ps. The sampling was set up using signals related to interval timers. We describe an analytical cache performance model under multiprogrammed condition from the literature and validate it using the performance monitoring counters. We next explores the long term performance of a system under different workload conditions. Various performance monitoring events - data cache h, data TLB misses, data cache reads or writes, branches etc. - are monitored over a 24 hour period. This is useful in identifying activities which cause loss of system performance. We used timer interrupts for sampling the performance counters. We develop a profiling methodology to give a perspective of performance of the different functions of a program, not only on the basis of execution-time but also on the data cache misses. Available tools like prof on Unix can be used to pinpoint the regions of performance loss of programs, but they mainly rely on an execution-time profiles. This does not give insight into problems in cache performance for that program. So we develop this methodology to get the performance of each function of the program not only on the basis of its execution time but also on the basis of its cache behavior. Computer and Information Science Computer Performance Evaluation Monitoring System Performance Cache Performance Pentium Performance Monitoring TLB misses Cache misses Profiling Methodology
367	Efficient graph algorithm execution on data-parallel architectures Bangalore Lakshminarayana, Nagesh 12 January 2015 (has links) Mechanisms for improving the execution efficiency of graph algorithms on Data-Parallel Architectures were proposed and identified. Execution of graph algorithms on GPGPU architectures, the prevalent data-parallel architectures was considered. Irregular and data dependent accesses in graph algorithms were found to cause significant idle cycles in GPGPU cores. A prefetching mechanism that reduced the amount of idle cycles by prefetching a data-dependent access pattern found in graph algorithms was proposed. Storing prefetches in unused spare registers in addition to storing them in the cache was shown to be more effective by the prefetching mechanism. The design of the cache hierarchy for graph algorithms was explored. First, an exclusive cache hierarchy was shown to be beneficial at the cost of increased traffic; a region based exclusive cache hierarchy was shown to be similar in performance to an exclusive cache hierarchy while reducing on-chip traffic. Second, bypassing cache blocks at both the level one and level two caches was shown to be beneficial. Third, the use of fine-grained memory accesses (or cache sub-blocking) was shown to be beneficial. The combination of cache bypassing and fine-grained memory accesses was shown to be more beneficial than applying the two mechanisms individually. Finally, the impact of different implementation strategies on algorithm performance was evaluated for the breadth first search algorithm using different input graphs and heuristics to identify the best performing implementation for a given input graph were also discussed. Graph algorithms Data-parallel architectures GPGPU architectures Prefetching Cache hierarchy Inclusion property Cache bypass Fine-grained accesses BFS characterization
368	Dynamic detection of the communication pattern in shared memory environments for thread mapping / Detecção dinâmica do padrão de comunicação em ambientes de memória compartilhada para o mapeamento de threads Cruz, Eduardo Henrique Molina da January 2012 (has links) As threads de aplicações paralelas cooperam a fim de cumprir suas tarefas, dessa forma, comunicação é realizada entre elas. A latência de comunicação entre os núcleos em arquiteturas multiprocessadas diferem dependendo da hierarquia de memória e das interconexões. Com o aumento do número de núcleos por chip e número de threads por núcleo, esta diferença entre as latências de comunicação está aumentando. Portanto, é importante mapear as threads de aplicações paralelas levando em conta a comunicação entre elas. Em aplicações paralelas baseadas no paradigma de memória compartilhada, a comunicação é implícita e ocorre através de acessos à variáveis compartilhadas, o que torna difícil a descoberta do padrão de comunicação entre as threads. Mecanismos tradicionais usam simulação para monitorar os acessos à memória realizados pela aplicação, requerendo modificações no código fonte e aumentando drasticamente a sobrecarga. Nesta dissertação de mestrado, são introduzidos dois mecanismos inovadores com uma baixa sobrecarga para se detectar o padrão de comunicação entre threads. O primeiro mecanismo faz uso de informações sobre linhas compartilhadas de caches providas por protocolos de coerência de cache. O segundo mecanismo utiliza a Translation Lookaside Buffer (TLB) para detectar quais páginas de memória cada núcleo está acessando. Ambos os mecanismos dependem totalmente do hardware, o que torna o mapeamento de threads transparente aos programadores e permite que ele seja realizado dinamicamente pelo sistema operacional. Além disto, nenhuma tarefa de alta sobrecarga, como simulação, é requerida. As propostas foram avaliadas com o NAS Parallel Benchmarks (NPB), obtendo representações precisas dos padrões de comunicação. Mapeamentos para as threads foram gerados utilizando os padrões de comunicação descobertos e um algoritmo de mapeamento. O problema do mapeamento é NP-Difícil. Portanto, de forma a se atingir uma complexidade polinomial, o algoritmo empregado é heurístico, baseado no algoritmo de emparelhamento de grafos de Edmonds. Executando as aplicações com o mapeamento resultou em um ganho de desempenho de até 15; 3%. O número de faltas na cache, invalidações em linhas de cache e transações de espionagem foram reduzidos em até 31; 9%, 41% e 65; 4%, respectivamente. / The threads of parallel applications cooperate in order to fulfill their tasks, thereby communication is performed among themselves. The communication latency between the cores in a multiprocessor architecture differs depending on the memory hierarchy and the interconnections. With the increase in the number of cores per chip and the number of threads per core, this difference between the communication latencies is increasing. Therefore, it is important to map the threads of parallel applications taking into account the communication between them. In parallel applications based on the shared memory paradigm, the communication is implicit and occurs through accesses to shared variables, which makes difficult to detect the communication pattern between the threads. Traditional approaches use simulation to monitor the memory accesses performed by the application, requiring modifications to the source code and drastically increasing the overhead. In this master thesis, we introduce two novel light-weight mechanisms to find the communication pattern of threads. The first mechanism makes use of the information about shared cache lines provided by cache coherence protocols. The second mechanism makes use of the Translation Lookaside Buffer (TLB) to detect which memory pages each core is accessing. Both our mechanisms rely entirely on hardware features, which makes the thread mapping transparent to the programmer and allows it to be performed dynamically by the operating system. Moreover, no time consuming task, such as simulation, is required. We evaluated our mechanisms with the NAS Parallel Benchmarks (NPB) and obtained accurate representations of the communication patterns. We generated thread mappings from the detected communication patterns using a mapping algorithm. Mapping is a NP-Hard problem. Therefore, in order to achieve a polynomial complexity, we designed a heuristic method based on the Edmonds graph matching algorithm. Running the applications with these mappings resulted in performance improvements of up to 15.3% compared to the original scheduler of the operating system. The number of cache misses, cache line invalidations and snoop transactions were reduced by up to 31.9%, 41% and 65.4%, respectively. Processamento paralelo Desempenho : Computadores Processamento distribuido Thread mapping Parallel computer architectures Shared memory Communication Cache memory Cache coherence protocols TLB
369	Increasing energy efficiency of processor caches via line usage predictors / Aumentando a eficiência energética da memória cache de processadores através de preditores de uso de linhas da cache Alves, Marco Antonio Zanata January 2014 (has links) O consumo de energia se torna cada vez mais importante para a arquitetura de processadores, onde o número de cores dentro de um mesmo chip está aumentando mas o total de energia disponível se mantém no mesmo nível ou até mesmo se reduz. Assim, técnicas para economizar energia, tais como opções de escala de frequência e desligamento automático de subsistemas, estão sendo usadas para manter a troca entre energia e desempenho. Para se obter alto desempenho, os atuais Chip Multiprocessors (CMPs) integram grandes memórias cache a fim de reduzir a latência média para acesso a memória principal, através da alocação do conjunto de dados da aplicação dentro do chip. Essas memórias cache tem sido projetadas tradicionalmente para explorar a localidade temporal usando políticas de substituição inteligentes e localidade espacial buscando todos os dados da linha da cache após uma falta de dados. Entretanto, estudos recentes mostraram que o número de sub-blocos dentro da linha da memória cache, que são realmente usados, costuma ser baixo, sendo que, os sub-blocos que são usados recebem poucos acessos antes de se tornarem mortos (isto é, nunca mais são acessados). Além disso, muitas da linhas da memória cache permanecem ligadas por longos períodos de tempo, mesmo que os dados não sejam usados novamente ou são inválidos. Para linhas de cache modificadas, a memória cache aguarda até que a linha seja expulsa para que esta seja gravada (write-back) de volta no próximo nível de memória. Essas escritas competem com as requisições de leitura (demanda do processador e prébusca da cache), aumentando a pressão no controlador de memória. Por essas razões, a eficiência energética e o desempenho das memórias cache não são ideais. Essa tese propõe a aplicação de preditores de uso de linhas da cache para aumentar a eficiência energética das memórias cache. São propostos os mecanismos Dead Sub-Block Predictor (DSBP) e Dead Line and Early Write-Back Predictor (DEWP) para permitir economia de energia sem que haja degradação do desempenho. DSBP é usado para prever quais sub-blocos da linha da cache serão usados e quantas vezes eles serão acessados de forma a trazer para a cache apenas os sub-blocos úteis e desliga-los após eles serem acessados pelo número de vezes previsto. DEWP prevê linhas de cache mortas assim que elas recebem o último acesso, desligando essas linhas. As linhas sujas são escalonadas para sofrerem write-back após a última operação de escrita, aumentando o potencial de salvar energia, reduzindo também a pressão no controlador de memória. Ambos os mecanismos propostos também reduzem a poluição nas memórias cache, dando prioridade para a expulsão de linhas mortas, melhorando as atuais políticas de substituição. Embora cada mecanismo apresentado seja capaz de funcionar separadamente dentro do sistema, ambos os mecanismos podem também ser misturados em uma mesma hierarquia de cache. Essa implementação mista é interessante pois a granularidade de sub-bloco é preferível para níveis de cache próximos do processador, onde as linhas de memória cache são expulsas rapidamente, enquanto o último nível de cache tende a usar toda a linha antes da sua expulsão. Com o intuito de avaliar os mecanismos propostos, é apresentado o Simulator of Non- Uniform Cache Architectures (SiNUCA). Esse simulador de microarquitetura com precisão de ciclos é validado em termos de desempenho e consumo de energia através da comparação com um processador real. Os resultados de desempenho foram obtidos executando aplicações das cargas de trabalho single-threaded do conjunto SPEC-CPU2006 e aplicações multi-threaded dos conjuntos SPEC-OMP2001 e NAS-NPB. Os resultados relativos a energia foram obtidos integrando o SiNUCA com as ferramentas de modelagem Multi-core Power, Area, and Timing (McPAT) e CACTI. Quando aplicados os mecanismos em todos os níveis de memória cache, observou-se em média uma redução de 36% no consumo de energia usando o DSBP, 25% usando o DEWP e 37% quando usou-se o DSBP nos níveis L1 e L2 e o DEWP no último nível. Todas essas reduções causaram uma perda desprezível de desempenho de menos de 4% em média. / Energy consumption is becoming more important for processor architectures, where the number of cores inside the chip is increasing and the total power budget is kept at the same level or even reduced. Thus, energy saving techniques such as frequency scaling options and automatic shutdown of sub-systems are being used to maintain the trade-off between power and performance. To deliver high performance, current Chip Multiprocessors (CMPs) integrate large caches in order to reduce the average memory access latency by allocating the applications’ working set on-chip. These cache memories have traditionally been designed to exploit temporal locality by using smart replacement policies, and spatial locality by fetching entire cache lines from memory on a cache miss. However, recent studies have shown that the number of sub-blocks within a line that are actually used is often low, and those sub-blocks that are used are accessed only a few times before becoming dead (that is, never accessed again). Additionally, many of the cache lines remain powered for a long period of time even if the data is not used again, or is invalid. For modified cache lines, the cache memory waits until the line is evicted to perform the write-back to next memory level. These write-backs compete with read requests (processor demand and cache prefetch), increasing the pressure on the memory controller. For these reasons, the energy efficiency and performance of cache memories are not ideal. This thesis introduces cache line usage predictors to increase the energy efficiency of cache memories. We propose the Dead Sub-Block Predictor (DSBP) and Dead Line and Early Write-Back Predictor (DEWP) mechanisms to enable energy savings without performance degradation. DSBP is used to predict which sub-blocks of a cache line will be actually accessed and how many times they will be used in order to bring into the cache only those sub-blocks that are necessary, and power them off after they are accessed the predicted number of times. DEWP predicts dead lines as soon as they receive the last access, and turns off these lines. Dirty lines are scheduled for write-back after the last write operation occurs, increasing the energy savings potential and also reducing the pressure on the memory controller. Both proposed mechanisms also reduce pollution in cache memories by prioritizing dead lines for eviction in the existing replacement policy. Although each introduced mechanism is capable of performing separately inside a system, both mechanisms can also be mixed in the same cache hierarchy. This mixed implementation is interesting because the sub-block granularity is more suitable for cache levels closer to the processor, where the cache lines are quickly evicted, while the Last- Level Cache (LLC) tends to use the whole cache line before its eviction. In order to evaluate our proposed mechanisms, we introduce the Simulator of Non- Uniform Cache Architectures (SiNUCA). This cycle-accurate microarchitecture simulator is validated in terms of performance and energy consumption by comparing it to a real processor. Our performance results were obtained executing single-threaded applications from SPEC-CPU2006 and multi-threaded applications from SPEC-OMP2001 and NASNPB benchmark suites. The energy related results were obtained by integrating SiNUCA with the Multi-core Power, Area, and Timing (McPAT) framework and the CACTI power modeling tool. When applying our mechanisms on all the cache levels, we observe on average a 36% energy reduction for DSBP, 25% energy reduction using DEWP and an average reduction of 37% in the energy consumption applying DSBP on L1 and L2 and DEWP on the LLC. All these reductions caused a negligible performance loss of less than 4% on average. Processadores Memoria cache Multiprocessadores Line usage predictors Sub-block psage predictors Replacement policy Early write-back Cache memories Energy efficient
370	Dynamic detection of the communication pattern in shared memory environments for thread mapping / Detecção dinâmica do padrão de comunicação em ambientes de memória compartilhada para o mapeamento de threads Cruz, Eduardo Henrique Molina da January 2012 (has links) As threads de aplicações paralelas cooperam a fim de cumprir suas tarefas, dessa forma, comunicação é realizada entre elas. A latência de comunicação entre os núcleos em arquiteturas multiprocessadas diferem dependendo da hierarquia de memória e das interconexões. Com o aumento do número de núcleos por chip e número de threads por núcleo, esta diferença entre as latências de comunicação está aumentando. Portanto, é importante mapear as threads de aplicações paralelas levando em conta a comunicação entre elas. Em aplicações paralelas baseadas no paradigma de memória compartilhada, a comunicação é implícita e ocorre através de acessos à variáveis compartilhadas, o que torna difícil a descoberta do padrão de comunicação entre as threads. Mecanismos tradicionais usam simulação para monitorar os acessos à memória realizados pela aplicação, requerendo modificações no código fonte e aumentando drasticamente a sobrecarga. Nesta dissertação de mestrado, são introduzidos dois mecanismos inovadores com uma baixa sobrecarga para se detectar o padrão de comunicação entre threads. O primeiro mecanismo faz uso de informações sobre linhas compartilhadas de caches providas por protocolos de coerência de cache. O segundo mecanismo utiliza a Translation Lookaside Buffer (TLB) para detectar quais páginas de memória cada núcleo está acessando. Ambos os mecanismos dependem totalmente do hardware, o que torna o mapeamento de threads transparente aos programadores e permite que ele seja realizado dinamicamente pelo sistema operacional. Além disto, nenhuma tarefa de alta sobrecarga, como simulação, é requerida. As propostas foram avaliadas com o NAS Parallel Benchmarks (NPB), obtendo representações precisas dos padrões de comunicação. Mapeamentos para as threads foram gerados utilizando os padrões de comunicação descobertos e um algoritmo de mapeamento. O problema do mapeamento é NP-Difícil. Portanto, de forma a se atingir uma complexidade polinomial, o algoritmo empregado é heurístico, baseado no algoritmo de emparelhamento de grafos de Edmonds. Executando as aplicações com o mapeamento resultou em um ganho de desempenho de até 15; 3%. O número de faltas na cache, invalidações em linhas de cache e transações de espionagem foram reduzidos em até 31; 9%, 41% e 65; 4%, respectivamente. / The threads of parallel applications cooperate in order to fulfill their tasks, thereby communication is performed among themselves. The communication latency between the cores in a multiprocessor architecture differs depending on the memory hierarchy and the interconnections. With the increase in the number of cores per chip and the number of threads per core, this difference between the communication latencies is increasing. Therefore, it is important to map the threads of parallel applications taking into account the communication between them. In parallel applications based on the shared memory paradigm, the communication is implicit and occurs through accesses to shared variables, which makes difficult to detect the communication pattern between the threads. Traditional approaches use simulation to monitor the memory accesses performed by the application, requiring modifications to the source code and drastically increasing the overhead. In this master thesis, we introduce two novel light-weight mechanisms to find the communication pattern of threads. The first mechanism makes use of the information about shared cache lines provided by cache coherence protocols. The second mechanism makes use of the Translation Lookaside Buffer (TLB) to detect which memory pages each core is accessing. Both our mechanisms rely entirely on hardware features, which makes the thread mapping transparent to the programmer and allows it to be performed dynamically by the operating system. Moreover, no time consuming task, such as simulation, is required. We evaluated our mechanisms with the NAS Parallel Benchmarks (NPB) and obtained accurate representations of the communication patterns. We generated thread mappings from the detected communication patterns using a mapping algorithm. Mapping is a NP-Hard problem. Therefore, in order to achieve a polynomial complexity, we designed a heuristic method based on the Edmonds graph matching algorithm. Running the applications with these mappings resulted in performance improvements of up to 15.3% compared to the original scheduler of the operating system. The number of cache misses, cache line invalidations and snoop transactions were reduced by up to 31.9%, 41% and 65.4%, respectively. Processamento paralelo Desempenho : Computadores Processamento distribuido Thread mapping Parallel computer architectures Shared memory Communication Cache memory Cache coherence protocols TLB

Search results