Global ETD Search

481	Efficient fault tolerance for pipelined structures and its application to superscalar and dataflow machines Mizan, Elias, 1976- 10 October 2012 (has links) Silicon reliability has reemerged as a very important problem in digital system design. As voltage and device dimensions shrink, combinational logic is becoming sensitive to temporary errors caused by single event upsets, transistor and interconnect aging and circuit variability. In particular, computational functional units are very challenging to protect because current redundant execution techniques have a high power and area overhead, cannot guarantee detection of some errors and cause a substantial performance degradation. As traditional worst-case design rules that guarantee error avoidance become too conservative to be practical, new microarchitectures need to be investigated to address this problem. To this end, this dissertation introduces Self-Imposed Temporal Redundancy (SITR), a speculative microarchitectural temporal redundancy technique suitable for pipelined computational functional units. SITR is able to detect most temporary errors, is area and energy-efficient and can be easily incorporated in an out-of-order microprocessor. SITR can also be used as a throttling mechanism against thermal viruses and, in some cases, allows designers to design very aggressive bypass networks capable of achieving high instruction throughput, by tolerating timing violations. To address the performance degradation caused by redundant execution, this dissertation proposes using a tiled-data ow model of computation because it enables the design of scalable, resource-rich computational substrates. Starting with the WaveScalar tiled-data flow architecture, we enhance the reliability of its datapath, including computational logic, interconnection network and storage structures. Computations are performed speculatively using SITR while traditional information redundancy techniques are used to protect data transmission and storage. Once a value has been verified, confirmation messages are transmitted to consumer instructions. Upon error detection, nullification messages are sent to the instructions affected by the error. Our experimental results demonstrate that the slowdown due to redundant computation and error recovery on the tiled-data flow machine is consistently smaller than on a superscalar von Neumann architecture. However, the number of additional messages required to support SITR execution is substantial, increasing power consumption. To reduce this overhead without significantly affecting performance, we introduce wave-based speculation, a mechanism targeted for data flow architectures that enables speculation only when it is likely to benefit performance. / text Fault-tolerant computing Computers, Pipeline--Reliability Data flow computing Computer architecture
482	OS-aware architecture for improving microprocessor performance and energy efficiency Li, Tao 28 August 2008 (has links) Not available / text Operating systems (Computers) Computer architecture Microprocessors--Design and construction Microprocessors--Energy consumption
483	Architectures and algorithms for high performance switching Prakash, Amit 28 August 2008 (has links) Not available / text Packet switching (Data transmission) Computer algorithms Parallel algorithms Computer architecture Production scheduling Computer networks
484	Adaptive predication via compiler-microarchitecture cooperation Kim, Hyesoon, 1974- 28 August 2008 (has links) Not available / text Computer organization Computer architecture Code generators Compiling (Electronic computers) Compilers (Computer programs)
485	Scalable hardware memory disambiguation Sethumadhavan, Lakshminarasimhan, 1978- 29 August 2008 (has links) Not available Memory management (Computer science) Computer storage devices Microprocessors--Design and construction Computer architecture
486	Architecting heterogeneous memory systems with 3D die-stacked memory Sim, Jae Woong 21 September 2015 (has links) The main objective of this research is to efficiently enable 3D die-stacked memory and heterogeneous memory systems. 3D die-stacking is an emerging technology that allows for large amounts of in-package high-bandwidth memory storage. Die-stacked memory has the potential to provide extraordinary performance and energy benefits for computing environments, from data-intensive to mobile computing. However, incorporating die-stacked memory into computing environments requires innovations across the system stack from hardware and software. This dissertation presents several architectural innovations to practically deploy die-stacked memory into a variety of computing systems. First, this dissertation proposes using die-stacked DRAM as a hardware-managed cache in a practical and efficient way. The proposed DRAM cache architecture employs two novel techniques: hit-miss speculation and self-balancing dispatch. The proposed techniques virtually eliminate the hardware overhead of maintaining a multi-megabytes SRAM structure, when scaling to gigabytes of stacked DRAM caches, and improve overall memory bandwidth utilization. Second, this dissertation proposes a DRAM cache organization that provides a high level of reliability for die-stacked DRAM caches in a cost-effective manner. The proposed DRAM cache uses error-correcting code (ECCs), strong checksums (CRCs), and dirty data duplication to detect and correct a wide range of stacked DRAM failures—from traditional bit errors to large-scale row, column, bank, and channel failures—within the constraints of commodity, non-ECC DRAM stacks. With only a modest performance degradation compared to a DRAM cache with no ECC support, the proposed organization can correct all single-bit failures, and 99.9993% of all row, column, and bank failures. Third, this dissertation proposes architectural mechanisms to use large, fast, on-chip memory structures as part of memory (PoM) seamlessly through the hardware. The proposed design achieves the performance benefit of on-chip memory caches without sacrificing a large fraction of total memory capacity to serve as a cache. To achieve this, PoM implements the ability to dynamically remap regions of memory based on their access patterns and expected performance benefits. Lastly, this dissertation explores a new usage model for die-stacked DRAM involving a hybrid of caching and virtual memory support. In the common case where system’s physical memory is not over-committed, die-stacked DRAM operates as a cache to provide performance and energy benefits to the system. However, when the workload’s active memory demands exceed the capacity of the physical memory, the proposed scheme dynamically converts the stacked DRAM cache into a fast swap device to avoid the otherwise grievous performance penalty of swapping to disk. Stacked DRAM Die-stacking Memory systems Heterogeneous memory Hardware management Computer architecture
487	Design of platforms for computing context with spatio-temporal locality Ziotopoulos, Agisilaos Georgios 02 June 2011 (has links) This dissertation is in the area of pervasive computing. It focuses on designing platforms for storing, querying, and computing contextual information. More specifically, we are interested in platforms for storing and querying spatio-temporal events where queries exhibit locality. Recent advances in sensor technologies have made possible gathering a variety of information on the status of users, the environment machines, etc. Combining this information with computation we are able to extract context, i.e., a filtered high-level description of the situation. In many cases, the information gathered exhibits locality both in space and time, i.e., an event is likely to be consumed in a location close to the location where the event was produced, at a time whic h is close to the time the event was produced. This dissertation builds on this observation to create better platforms for computing context. We claim three key contributions. We have studied the problem of designing and optimizing spatial organizations for exchanging context. Our thesis has original theoretical work on how to create a platform based on cells of a Voronoi diagram for optimizing the energy and bandwidth required for mobiles to exchange contextual information t hat is tied to specific locations in the platform. Additionally, we applied our results to the problem of optimizing a system for surveilling the locations of entities within a given region. We have designed a platform for storing and querying spatio-temporal events exhibiting locality. Our platform is based on a P2P infrastructure of peers organized based on the Voronoi diagram associated with their locations to store events based on their own associated locations. We have developed theoretical results based on spatial point processes for the delay experienced by a typical query in this system. Additionally, we used simulations to study heuristics to improve the performance of our platform. Finally, we came up with protocols for the replicated storage of events in order to increase the fault-tolerance of our platform. Finally, in this thesis we propose a design for a platform, based on RFID tags, to support context-aware computing for indoor spaces. Our platform exploits the structure found in most indoor spaces to encode contextual information in suitably designed RFID tags. The elements of our platform collaborate based on a set of messages we developed to offer context-aware services to the users of the platform. We validated our research with an example hardware design of the RFID tag and a software emulation of the tag's functionality. / text Stochastic geometry Pervasive computing Computer architecture Spatial data Spatio-temporal data P2P networks
488	A hybrid MPI/OpenMP parallelization of the adaptive integral method for multi-core clusters Wei, Fangzhou 02 August 2011 (has links) A hybrid of message passing and shared memory techniques is presented for scalable parallelization of the adaptive integral method (AIM), an FFT based algorithm, on clusters of identical multi-core processors. The proposed hybrid MPI/OpenMP parallelization scheme is based on a nested one-dimensional (1-D) slab decomposition of the 3-D auxiliary uniform grid and the associated AIM calculations: If there are M processors and T cores per processor, the scheme (i) divides the uniform grid into M slabs and MT sub-slabs, (ii) assigns each slab/sub-slab and the associated operations to one of the processors/cores, and (iii) uses MPI for inter-processor data communication and OpenMP for intra-processor data exchange. The MPI/OpenMP parallel AIM is used to accelerate the MOM solution of combined-field integral equations pertinent to the analysis of scattering from perfectly conducting surfaces. The scalability and efficiency of the implementation are investigated theoretically and verified numerically by solving benchmark scattering problems on a (near) petaflop supercomputing cluster of quad-core processors. The timing and speedup results on up to 1024 processors show that the proposed hybrid MPI/OpenMP parallelization exhibits better strong scalability (fixed problem size speedup) compared to pure MPI parallelization when multiple cores are used on each processor. / text AIM CFIE MPI/OpenMP Multi-core processor cluster Multi-core processing Computer architecture
489	Efficient Handling of Narrow Width and Streaming Data in Embedded Applications Li, Bengu January 2006 (has links) Embedded environment imposes severe constraints of system resources on embedded applications. Performance, memory footprint, and power consumption are critical factors for embedded applications. Meanwhile, the data in embedded applications demonstrate unique properties. More specifically, narrow width data are data representable in considerably fewer bits than in one word, which nevertheless occupy an entire register or memory word and streaming data are the input data processed by an application sequentially, which stay in the system for a short duration and thus exhibit little data locality. Narrow width and streaming data affect the efficiency of register, cache, and memory and must be taken into account when optimizing for performance, memory footprint, and power consumption.This dissertation proposes methods to efficiently handle narrow width and streaming data in embedded applications. Quantitative measurements of narrow width and streaming data are performed to provide guidance for optimizations. Novel architectural features and associated compiler algorithms are developed. To efficiently handle narrow width data in registers, two register allocation schemes are proposed for the ARM processor to allocate two narrow width variables to one register. A static scheme exploits maximum bitwidth. A speculative scheme further exploits dynamic bitwidth. Both result in reduced spill cost and performance improvement. To efficiently handle narrow width data in memory, a memory layout method is proposed to coalesce multiple narrow width data in one memory location in a DSP processor, leading to fewer explicit address calculations. This method improves performance and shrinks memory footprint. To efficiently handle streaming data in network processor, two cache mechanisms are proposed to enable the reuse of data and computation. The slack created is further transformed into reduction in energy consumption through a fetch gating mechanism. compiler computer architecture narrow width data streaming data embedded system embedded application
490	Efficient Multi-ported Memories for FPGAs LaForest, Charles Eric 15 February 2010 (has links) Multi-ported memories are challenging to implement on FPGAs since the provided block RAMs typically have only two ports. In this dissertation we present a thorough exploration of the design space of FPGA multi-ported memories by evaluating conventional solutions to this problem, and introduce a new design that efficiently combines block RAMs into multi-ported memories with arbitrary numbers of read and write ports and true random access to any memory location, while achieving significantly higher operating frequencies than conventional approaches. For example we build a 256-location, 32-bit, 12-ported (4-write, 8-read) memory that operates at 281 MHz on Altera Stratix III FPGAs while consuming an area equivalent to 3679 ALMs: a 43% speed improvement and 84% area reduction over a pure ALM implemen- tation, and a 61% speed improvement over a pure "multipumped" implementation, although the pure multipumped implementation is 7.2-fold smaller. FPGA computer architecture multi-ported memory multipumping soft processor parallelism FIFO buffers 0544

Search results