Global ETD Search

491	Implementation of coarse-grain coherence tracking support in ring-based multiprocessors Coté, Edmond A. 25 October 2007 (has links) As the number of processors in multiprocessor system-on-chip devices continues to increase, the complexity required for full cache coherence support is often unwarranted for application-specific designs. Bus-based interconnects are no longer suitable for larger-scale systems, and the logic and storage overhead associated with the use of a complex packet-switched network and directory-based cache coherence may be undesirable in single-chip systems. Unidirectional rings are a suitable alternative because they offer many properties favorable to both on-chip implementation and to supporting cache coherence. Reducing the overhead of cache coherence traffic is, however, a concern for these systems. This thesis adapts two filter structures that are based on principles of coarse-grained coherence tracking, and applies them to a ring-based multiprocessor. The first structure tracks the total number of blocks of remote data cached by all processors in a node for a set of regions, where a region is a large area of memory referenced by the upper bits of an address. The second structure records regions of local data whose contents are not cached by any remote node. When used together to filter incoming or outgoing requests, these structures reduce the extent of coherence traffic and limit the transmission of coherent requests to the necessary parts of the system. A complete single-chip multiprocessor system that includes the proposed filters is designed and implemented in programmable logic for this thesis. The system is composed of nodes of bus-based multiprocessors, and each node includes a common memory, two or more pipelined 32-bit processors with coherent data caches, a split-transaction bus with separate lines for requests and responses, and an interface for the system-level ring interconnect. Two coarse-grained filters are attached to each node to reduce the impact of coherence traffic on the system. Cache coherence within the node is enforced through bus snooping, while coherence across the interconnect is supported by a reduced-complexity ring snooping protocol. Main memory is globally shared and is physically distributed among the nodes. Results are presented to highlight the system's key implementation points. Synthesis results are presented in order to evaluate hardware overhead, and operational results are shown to demonstrate the functionality of the multiprocessor system and of the filter structures. / Thesis (Master, Electrical & Computer Engineering) -- Queen's University, 2007-10-24 10:16:47.81 / Financial support for this work was provided by the National Sciences and Engineering Research Council of Canada, Communications and Information Technology Ontario, and Queen's University. Cache coherence Ring-based multiprocessor Coarse-grain coherence tracking Prototype implementation Multiprocessor system-on-chip
492	The Pathogenesis of Cache Valley Virus in the Ovine Fetus Rodrigues, Aline 2011 December 1900 (has links) Cache Valley virus (CVV) induced malformations have been previously reproduced in ovine fetuses; however, no studies have established the CVV infection sequence of the cells targeted by the virus or the development of the antiviral response of the early, infected fetus that results in viral clearance before development of immunocompetency. To address these questions, ovine fetuses at 35 dg were inoculated in utero with CVV and euthanized at 7, 10, 14, 21 and 28 dpi. On postmortem examination arthrogryposis and oligohydramnios were observed in some infected fetuses. Morphologic studies showed necrosis in the central nervous system (CNS) and skeletal muscle of earlier infected fetuses and hydrocephalus, micromyelia and muscular loss in later infected fetuses. Using immunohistochemistry and in situ hybridization, intense CVV viral antigenic signal was detected in the brain, spinal cord, skeletal muscles and fetal membranes of infected fetuses. Viral signal decreased in targeted and infected tissues with the progression of the infection. To determine specific cell types targeted by CVV in the CNS, indirect immunofluorescence was applied to sections of the CNS using a double labeling technique with antibodies against CVV together with antibodies against neurons, astrocytes and microglia. CVV viral antigen was shown within the cytoplasm of neurons in the brain and spinal cord. No viral signal was observed in microglial cells; however, infected animals had marked microgliosis. The antiviral immune response in immature fetuses infected with CVV was evaluated. Gene expression associated with an innate, immune response was quantified by real-time, quantitative PCR. Upregulated genes in infected fetuses included ISG15, Mx1, Mx2, IL-1, IL-6, TNF-?, TLR-7 and TLR-8. The amount of Mx protein, an interferon stimulated GTPase capable of restricting growth of bunyaviruses, was elevated in the allantoic and amniotic fluid in infected fetuses. ISG15 protein expression was significantly increased in target tissues of infected animals. B lymphocytes and immunoglobulin-positive cells were detected in lymphoid tissues and in the meninges of infected animals. This demonstrated that the infected ovine fetus is able to stimulate an innate and adaptive immune response before immunocompetency that presumably contributes to viral clearance in infected animals. Cache Valley Virus Ovine Fetus Sheep Bunyavirus Arthrogryposis Central Nervous System Skeletal muscle
493	Multigrid with Cache Optimizations on Adaptive Mesh Refinement Hierarchies Thorne Jr., Daniel Thomas 01 January 2003 (has links) This dissertation presents a multilevel algorithm to solve constant and variable coeffcient elliptic boundary value problems on adaptively refined structured meshes in 2D and 3D. Cacheaware algorithms for optimizing the operations to exploit the cache memory subsystem areshown. Keywords: Multigrid, Cache Aware, Adaptive Mesh Refinement, Partial Differential Equations, Numerical Solution.
494	DESIGN AND IMPLEMENTATION OF THE INSTRUCTION SET ARCHITECTURE FOR DATA LARS Ponnala, Kalyan 01 January 2010 (has links) The ideal memory system assumed by most programmers is one which has high capacity, yet allows any word to be accessed instantaneously. To make the hardware approximate this performance, an increasingly complex memory hierarchy, using caches and techniques like automatic prefetch, has evolved. However, as the gap between processor and memory speeds continues to widen, these programmer-visible mechanisms are becoming inadequate. Part of the recent increase in processor performance has been due to the introduction of programmer/compiler-visible SWAR (SIMD Within A Register) parallel processing on increasingly wide DATA LARs (Line Associative Registers) as a way to both improve data access speed and increase efficiency of SWAR processing. Although the base concept of DATA LARs predates this thesis, this thesis presents the first instruction set architecture specification complete enough to allow construction of a detailed prototype hardware design. This design was implemented and tested using a hardware simulator. Line Associative Registers DATA LARs SIMD Within a Register (SWAR) Cache Registers (CRegs) Associativity Electrical and Computer Engineering
495	CACHE OPTIMIZATION AND PERFORMANCE EVALUATION OF A STRUCTURED CFD CODE - GHOST Palki, Anand B. 01 January 2006 (has links) This research focuses on evaluating and enhancing the performance of an in-house, structured, 2D CFD code - GHOST, on modern commodity clusters. The basic philosophy of this work is to optimize the cache performance of the code by splitting up the grid into smaller blocks and carrying out the required calculations on these smaller blocks. This in turn leads to enhanced code performance on commodity clusters. Accordingly, this work presents a discussion along with a detailed description of two techniques: external and internal blocking, for data access optimization. These techniques have been tested on steady, unsteady, laminar, and turbulent test cases and the results are presented. The critical hardware parameters which influenced the code performance were identified. A detailed study investigating the effect of these parameters on the code performance was conducted and the results are presented. The modified version of the code was also ported to the current state-of-art architectures with successful results.
496	PERFORMANCE EVALUATION AND OPTIMIZATION OF THE UNSTRUCTURED CFD CODE UNCLE Gupta, Saurabh 01 January 2006 (has links) Numerous advancements made in the field of computational sciences have made CFD a viable solution to the modern day fluid dynamics problems. Progress in computer performance allows us to solve a complex flow field in practical CPU time. Commodity clusters are also gaining popularity as computational research platform for various CFD communities. This research focuses on evaluating and enhancing the performance of an in-house, unstructured, 3D CFD code on modern commodity clusters. The fundamental idea is to tune the codes to optimize the cache behavior of the node on commodity clusters to achieve enhanced code performance. Accordingly, this work presents discussion of various available techniques for data access optimization and detailed description of those which yielded improved code performance. These techniques were tested on various steady, unsteady, laminar, and turbulent test cases and the results are presented. The critical hardware parameters which influenced the code performance were identified. A detailed study investigating the effect of these parameters on the code performance was conducted and the results are presented. The successful single node improvements were also efficiently tested on parallel platform. The modified version of the code was also ported to different hardware architectures with successful results. Loop blocking is established as a predictor of code performance.
497	LINE ASSOCIATIVE REGISTERS Melarkode, Krishna 01 January 2004 (has links) As technological advances have improved processor speed, main memory speed has lagged behind. Even with advanced RAM technologies, it has not been possible to close the gap in speeds. Ideally, a CPU can deliver good performance when the right data is made available to it at the right time. Caches and Registers solved the problem to an extent. This thesis takes the approach of trying to create a new memory access model that is more efficient and simple instead of using various add on mechanisms to mask high memory latency. The Line Associative Registers have the functionality of a cache, scalar registers and vector registers built into them. This new model qualitatively changes how the processor accesses memory.
498	AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER Raghunathan, Vijai 01 January 2007 (has links) Designing hardware to output pixels for light field displays or multi-projector systems is challenging owing to the memory bandwidth and speed of the application. A new technique of hardware that implements ‗anywhere pixel routing‘ was designed earlier at the University of Kentucky. This technique uses hardware to route pixels from input to output based upon a Look up Table (LUT). The initial design suffered from high memory latency due to random accesses to the DDR SDRAM input buffer. This thesis presents a cache design that alleviates the memory latency issue by reducing the number of random SDRAM accesses. The cache is implemented in the block RAM of a field programmable gate array (FPGA). A number of simulations are conducted to find an efficient cache. It is found that the cache takes only a few kilobits, about 7% of the block RAM and on an average speeds up the memory accesses by 20-30%. Pixel router LUT Memory latencies Block RAM Cache Electrical and Computer Engineering
499	Low-Power Soft-Error-Robust Embedded SRAM Shah, Jaspal Singh 06 November 2014 (has links) Soft errors are radiation-induced ionization events (induced by energetic particles like alpha particles, cosmic neutron, etc.) that cause transient errors in integrated circuits. The circuit can always recover from such errors as the underlying semiconductor material is not damaged and hence, they are called soft errors. In nanometer technologies, the reduced node capacitance and supply voltage coupled with high packing density and lack of masking mechanisms are primarily responsible for the increased susceptibility of SRAMs towards soft errors. Coupled with these are the process variations (effective length, width, and threshold voltage), which are prominent in scaled-down technologies. Typically, SRAM constitutes up to 90% of the die in microprocessors and SoCs (System-on-Chip). Hence, the soft errors in SRAMs pose a potential threat to the reliable operation of the system. In this work, a soft-error-robust eight-transistor SRAM cell (8T) is proposed to establish a balance between low power consumption and soft error robustness. Using metrics like access time, leakage power, and sensitivity to single event transients (SET), the proposed approach is evaluated. For the purpose of analysis and comparisons the results of 8T cell are compared with a standard 6T SRAM cell and the state-of-the-art soft-error-robust SRAM cells. Based on simulation results in a 65-nm commercial CMOS process, the 8T cell demonstrates higher immunity to SETs along with smaller area and comparable leakage power. A 32-kb array of 8T cells was fabricated in silicon. After functional verification of the test chip, a radiation test was conducted to evaluate the soft error robustness. As SRAM cells are scaled aggressively to increase the overall packing density, the smaller transistors exhibit higher degrees of process variation and mismatch, leading to larger offset voltages. For SRAM sense amplifiers, higher offset voltages lead to an increased likelihood of an incorrect decision. To address this issue, a sense amplifier capable of cancelling the input offset voltage is presented. The simulated and measured results in 180-nm technology show that the sense amplifier is capable of detecting a 4 mV differential input signal under dc and transient conditions. The proposed sense amplifier, when compared with a conventional sense amplifier, has a similar die area and a greatly reduced offset voltage. Additionally, a dual-input sense amplifier architecture is proposed with corroborating silicon results to show that it requires smaller differential input to evaluate correctly. VLSI Embedded SRAM Cache Soft Error Offset cancellation Sense amplifier Low Power
500	Exploiting Parallelism in GPUs Hechtman, Blake Alan January 2014 (has links) <p>Heterogeneous processors with accelerators provide an opportunity to improve performance within a given power budget.</p><p>Many of these heterogeneous processors contain Graphics Processing Units (GPUs) that can perform graphics and embarrassingly parallel computation orders of magnitude faster than a CPU while using less energy. Beyond these obvious applications for GPUs, a larger variety of applications can benefit from a GPU's large computation and memory bandwidth. However, many of these applications are irregular and, as a result, require synchronization and scheduling that are commonly believed to perform poorly on GPUs. The basic building block of synchronization and scheduling is memory consistency, which is, therefore, the first place to look for improving performance on irregular applications. In this thesis, we approach the programmability of irregular applications on GPUs by thinking across traditional boundaries of the compute stack. We think about architecture, microarchitecture and runtime systems from the programmers perspective. To this end, we study architectural memory consistency on future GPUs with cache coherence. In addition, we design a GPU memory system</p><p>microarchitecture that can support fine-grain and coarse-grain synchronization without sacrificing throughput. Finally, we develop a task runtime that embraces the GPU microarchitecture to perform well</p><p>on fork/join parallelism desired by many programmers. Overall, this thesis contributes non-intuitive solutions to improve the performance and programmability of irregular applications from the programmer's perspective.</p> / Dissertation Computer engineering Computer science Cache Coherence GPU Memory Consistency Task Parallelism

Search results