Global ETD Search

471	Cache-Oblivious Searching and Sorting in Multisets Farzan, Arash January 2004 (has links) We study three problems related to searching and sorting in multisets in the cache-oblivious model: Finding the most frequent element (the mode), duplicate elimination and finally multi-sorting. We are interested in minimizing the cache complexity (or number of cache misses) of algorithms for these problems in the context under which the cache size and block size are unknown. We start by showing the lower bounds in the comparison model. Then we present the lower bounds in the cache-aware model, which are also the lower bounds in the cache-oblivious model. We consider the input multiset of size <i>N</i> with multiplicities <i>N</i><sub>1</sub>,. . . , <i>N<sub>k</sub></i>. The lower bound for the cache complexity of determining the mode is Ω({<i>N</i> over <i>B</i>} log {<i>M</i> over <i>B</i>} {<i>N</i> over <i>fB</i>}) where &fnof; is the frequency of the mode and <i>M</i>, <i>B</i> are the cache size and block size respectively. Cache complexities of duplicate removal and multi-sorting have lower bounds of Ω({<i>N</i> over <i>B</i>} log {<i>M</i> over <i>B</i>} {<i>N</i> over <i>B</i>} - £{<i>k</i> over <i>i</i>}=1{<i>N<sub>i</sub></i> over <i>B</i>}log {<i>M</i> over <i>B</i>} {<i>N<sub>i</sub></i> over <i>B</i>}). We present two deterministic approaches to give algorithms: selection and distribution. The algorithms with these deterministic approaches differ from the lower bounds by at most an additive term of {<i>N</i> over <i>B</i>} loglog <i>M</i>. However, since loglog <i>M</i> is very small in real applications, the gap is tiny. Nevertheless, the ideas of our deterministic algorithms can be used to design cache-aware algorithms for these problems. The algorithms turn out to be simpler than the previously-known cache-aware algorithms for these problems. Another approach to design algorithms for these problems is the probabilistic approach. In contrast to the deterministic algorithms, our randomized cache-oblivious algorithms are all optimal and their cache complexities exactly match the lower bounds. All of our algorithms are within a constant factor of optimal in terms of the number of comparisons they perform. Computer Science Memory Hierarchies Cache-Oblivious model Multisets Determining the mode Duplicate Elimination Sorting
472	Technology Impacts of CMOS Scaling on Microprocessor Core Design for Hard-Fault Tolerance in Single-Core Applications and Optimized Throughput in Throughput-Oriented Chip Multiprocessors Bower, Fred January 2010 (has links) <p>The continued march of technological progress, epitomized by Moore’s Law provides the microarchitect with increasing numbers of transistors to employ as we continue to shrink feature geometries. Physical limitations impose new constraints upon designers in the areas of overall power and localized power density. Techniques to scale threshold and supply voltages to lower values in order to reduce power consumption of the part have also run into physical limitations, exacerbating power and cooling problems in deep sub-micron CMOS process generations. Smaller device geometries are also subject to increased sensitivity to common failure modes as well as manufacturing process variability.</p> <p>In the face of these added challenges, we observe a shift in the focus of the industry, away from building ever–larger single–core chips, whose focus is on reducing single–threaded latency toward a design approach that employs multiple cores on a single chip to improve throughput. While the early multicore era utilized the existing single–core designs of the previous generation in small numbers, subsequent generations have introduced cores tailored to multicore use. These cores seek to achieve power-efficient throughput and have led to a new emphasis on throughput-oriented computing, particularly for Internet workloads, where the end-to-end computational task is dominated by long–latency network operations. The ubiquity of these workloads makes a compelling argument for throughput–oriented designs, but does not free the microarchitect fully from latency demands of common workloads in enterprise and desktop application spaces.</p> <p>We believe that a continued need for both throughput–oriented and latency–sensitive processors will exist in coming generations of technology. We further opine that making effective use of the additional transistors that will be available may require different techniques for latency–sensitive designs than for throughput–oriented ones, since we may trade latency or throughput for the desired attribute of a core in each of the respective paradigms.</p> <p>We make three major contributions with this thesis. Our first contribution is a fine–grained fault diagnosis and deconfiguration technique for array structures, such as the ROB, within the microprocessor core. We present and evaluate two variants of this technique. The first variant uses an existing fault detection and correction technique whose scope is the processor core execution pipeline to ensure correct processor operation. The second variant integrates fault detection and correction into the array structure itself to provide a self–contained, fine–grained, fault detection, diagnosis, and repair technique.</p> <p>In our second contribution, we develop a lightweight, fine–grained fault diagnosis mechanism for the processor core. In this work, we leverage the first contribution's methods to provide deconfiguration of faulty array elements. We additionally extend the scope of that work to include all pipeline circuitry from instruction issue to retirement.</p> <p>In our third and final contribution, we focus on throughput–oriented core data cache design. In this work, we study the demands of the throughput–oriented core running a representative workload and then propose and evaluate an alternative data cache implementation that more closely matches the demands of the core. We then show that a better–matched cache design can be exploited to provide improved throughput under a fixed power budget.</p> <p>Our results show that typical latency–sensitive cores have sufficient redundancy to make finegrained hard–fault tolerance an affordable alternative for hardening complex designs. Our designs suffer little or no performance loss when no faults are present and retain nearly the same performance characteristics in the presence of small numbers of hard faults in protected structures. In our study of the latency–sensitive core, we have shown that SRAM–based designs have low latencies that end up providing less benefit to a throughput–oriented core and workload than a better–fitted data cache composed of DRAM. The move from a high–power, fast technology to a lower–power, slower technology allows us to increase L1 data cache capacity, which is a net benefit for the throughput–oriented core.</p> / Dissertation Computer Science Computer Engineering Engineering, Electronics and Electrical architecture cache chip multiprocessors fault tolerance
473	The Linux Porting and Integration Verification of An Academic 32-bit Processor Chen, Chien-Chih 10 September 2012 (has links) For improving the performance and application of microprocessor, it is necessary to integrate pipelined core, exception control unit, cache unit and memory management unit (MMU). The operating system is an effective way for microprocessor integration verification. However, it is not a feasible debugging methodology to detect the exact design bug while operating system booting crash. We found the main execution features of operating system are the data transfer and exception handling. We propose an integration verification methodology based on these execution features. The methodology is to verify concurrent cache transfer operation, consecutive cache transfer operation, external interrupt exception handling, page fault exception handling and multiple interrupt exception handling for microprocessor integration. We utilize ARM7-Like developed by our laboratory to do the experiment. It is effective to detect the design bugs in RTL simulation by the software-based verification methodology proposed by us. The modified ARM7-Like microprocessor is able to successfully boot Linux kernel and execute user applications in FPGA. microprocessor cache transfer operation page fault operating system integration verification external interrupt
474	DESIGNING COST-EFFECTIVE COARSE-GRAINED RECONFIGURABLE ARCHITECTURE Kim, Yoonjin 2009 May 1900 (has links) Application-specific optimization of embedded systems becomes inevitable to satisfy the market demand for designers to meet tighter constraints on cost, performance and power. On the other hand, the flexibility of a system is also important to accommodate the short time-to-market requirements for embedded systems. To compromise these incompatible demands, coarse-grained reconfigurable architecture (CGRA) has emerged as a suitable solution. A typical CGRA requires many processing elements (PEs) and a configuration cache for reconfiguration of its PE array. However, such a structure consumes significant area and power. Therefore, designing cost-effective CGRA has been a serious concern for reliability of CGRA-based embedded systems. As an effort to provide such cost-effective design, the first half of this work focuses on reducing power in the configuration cache. For power saving in the configuration cache, a low power reconfiguration technique is presented based on reusable context pipelining achieved by merging the concept of context reuse into context pipelining. In addition, we propose dynamic context compression capable of supporting only required bits of the context words set to enable and the redundant bits set to disable. Finally, we provide dynamic context management capable of reducing reduce power consumption in configuration cache by controlling a read/write operation of the redundant context words In the second part of this dissertation, we focus on designing a cost-effective PE array to reduce area and power. For area and power saving in a PE array, we devise a costeffective array fabric addresses novel rearrangement of processing elements and their interconnection designs to reduce area and power consumption. In addition, hierarchical reconfigurable computing arrays are proposed consisting of two reconfigurable computing blocks with two types of communication structure together. The two computing blocks have shared critical resources and such a sharing structure provides efficient communication interface between them with reducing overall area. Based on the proposed design approaches, a CGRA combining the multiple design schemes is shown to verify the synergy effect of the integrated approach. Experimental results show that the integrated approach reduces area by 23.07% and power by up to 72% when compared with the conventional CGRA. System-on-Chip Embedded Systems Configuration Cache Context Word PE Array
475	The Umbrella File System: Storage Management Across Heterogeneous Devices Garrison, John Allen 2010 May 1900 (has links) With the advent of Flash based solid state devices (SSDs), the differences in physical devices used to store data in computers are becoming more and more pronounced. Effectively mapping the differences in storage devices to the files, and applications using the devices, is the problem addressed in this dissertation. This dissertation presents the Umbrella File System (UmbrellaFS), a layered file system designed to effectively map file and device level differences, while maintaining a single coherent directory structure for users. Particular files are directed to appropriate underlying file systems by intercepting system calls connecting the Virtual File System (VFS) to the underlying file systems. Files are evaluated by a policy module that can examine both filenames and file metadata to make decisions about final placement. Files are transparently directed to and moved between appropriate file systems based on their characteristics. A prototype of UmbrellaFS is implemented as a loadable kernel module in the 2.4 and 2.6 Linux kernels. In addition to providing the ability to direct files to file systems, UmbrellaFS enables different decisions at other layers of the storage stack. In particular, alternate page cache writeback methods are presented through the use of UmbrellaFS. A multiple queue strategy based on file sequentiality and a sorting strategy are presented as alternatives to standard Linux cache writeback protocols. These strategies are implemented in a 2.6 Linux kernel and show improvements in a variety of benchmarks and tests. Flash SSD File System Device Characteristics Policy-Driven Storage Namespaces Page Cache Write Ordering
476	Mobile Home Node: Improving Directory Cache Coherence Performance in NoCs via Exploitation of Producer-Consumer Relationships Soni, Tarun 2010 August 1900 (has links) The implementation of multiple processors on a single chip has been made possible with advancements in process technology. The benefits of having multiple cores on a single chip bring with it a new set of constraints for maintaining fast and consistent memory accesses. Cache coherence protocols are needed to maintain the consistency of shared memory on individual caches. Current cache coherency protocols are either snoop based, which is not scalable but provides fast access for small number of cores, or directory based, which involves a directory that acts as the ordering point providing scalability with relatively slower access. Our focus is on improving the memory access time of the scalable directory protocol. We have observed that most memory requests follow a pattern where in one of the processors, which we will dub the Producer, repeatedly writes to a particular memory location. A subset of the remaining cores, which we will dub the Consumers, repeatedly read the data from that same memory location. In our implementation we utilize this relationship to provide direct cache to cache transfers and minimize the access time by avoiding the indirection through the directory. We move the directory temporarily to the Producer node so that the consumer can directly request the producer for the cache line. Our technique improves the memory access time by 13 percent and reduces network traffic by 30 percent over standard directory coherence protocol with very little area overhead. cache coherence protocol directory based coherence snoop based coherence NoC CMP Mobile DCP
477	High-Availability für ZOPE Damaschke, Marko 11 June 2005 (has links) (PDF) Im Rahmen dieser vorliegenden Arbeit soll untersucht werden, welche Möglichkeiten zur Sicherung einer möglichst hohen Verfügbarkeit (High-Availability), Mechanismen zur Lastverteilung mittels des ZEO-Produkts oder ähnlichem sowie welche Strategien des Cachings sinnvoll an einem ZOPE-Server zum Einsatz kommen können. Die Arbeit untersucht dabei die Einsatzmöglichkeiten von bereits vorhandenen und die eventuelle Notwendigkeit der Eigenimplementierung weiterer Produkte der ZOPE-Entwicklung. Den Rahmen der Arbeit bildet die Serverstruktur des Bildungsmarktplatzes Sachsen. APE Plone Replikation ZEO ZODB ddc:004 Cache-Speicher Lastverteilung PostgreSQL Zope <Programm>
478	Enhanced font services for X Window system Tsang, Pong-fan, Dex. January 2000 (has links) Thesis (M. Phil.)--University of Hong Kong, 2001. / Includes bibliographical references (leaves 80-84).
479	Mitigating DRAM complexities through coordinated scheduling policies Stuecheli, Jeffrey Adam 04 June 2012 (has links) Contemporary DRAM systems have maintained impressive scaling by managing a careful balance between performance, power, and storage density. In achieving these goals, a significant sacrifice has been made in DRAM's operational complexity. To realize good performance, systems must properly manage the significant number of structural and timing restrictions of the DRAM devices. DRAM's efficient use is further complicated in many-core systems where the memory interface has to be shared among multiple cores/threads competing for memory bandwidth. In computer architecture, caches have primarily been viewed as a means to hide memory latency from the CPU. Cache policies have focused on anticipating the CPU's data needs, and are mostly oblivious to the main memory. This work demonstrates that the era of many-core architectures has created new main memory bottlenecks, and mandates a new approach: coordination of cache policy with main memory characteristics. Using the cache for memory optimization purposes dramatically expands the memory controller's visibility of processor behavior, at low implementation overhead. Through memory-centric modification of existing policies, such as scheduled writebacks, this work demonstrates that performance-limiting effects of highly-threaded architectures combined with complex DRAM operation can be overcome. This work shows that an awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved. The use of the "Page-Mode" feature of DRAM devices can mitigate many DRAM constraints. Current open-page policies attempt to garner the highest level of page hits. In an effort to achieve this, such greedy schemes map sequential address sequences to a single DRAM resource. This non-uniform resource usage pattern introduces high levels of conflict when multiple workloads in a many-core system map to the same set of resources. This work presents a scheme that provides a careful balance between the benefits (increased performance and decreased power), and the detractors (unfairness) of page-mode accesses. In the proposed Minimalist approach, the system targets "just enough" page-mode accesses to garner page-mode benefits, avoiding system unfairness. This is accomplished with the use of a fair memory hashing scheme to control the maximum number of page mode hits. High density memory is becoming ever more important as many execution streams are consolidated onto single chip many-core processors. DRAM is ubiquitous as a main memory technology, but while DRAM's per-chip density and frequency continue to scale, the time required to refresh its dynamic cells has grown at an alarming rate. This work shows how currently-employed methods to schedule refresh operations are ineffective in mitigating the significant performance degradation caused by longer refresh times. Current approaches are deficient -- they do not effectively exploit the flexibility of DRAMs to postpone refresh operations. This work proposes dynamically reconfigurable predictive mechanisms that exploit the full dynamic range allowed in the industry standard DRAM memory specifications. The proposed mechanisms are shown to mitigate much of the penalties seen with dense DRAM devices. In summary this work presents a significant improvement in the ability to exploit the capabilities of high density, high frequency, DRAM devices in a many-core environment. This is accomplished though coordination of previously disparate system components, exploiting integration of such components into highly integrated system designs. / text Memory systems DRAM Cache Memory refresh Memory page-mode Memory scheduler
480	Memory-subsystem resource management for the many-core era Kaseridis, Dimitrios 11 July 2012 (has links) As semiconductor technology continues to scale lower in the nanometer era, the communication between processor and main memory has been particularly challenged. The well-studied frequency, memory and power ``walls'' have redirect architects towards utilizing Chip Multiprocessors (CMP) as an attractive architecture for leveraging technology scaling. In order to achieve high efficiency and throughput, CMPs rely heavily on sharing resources among multiple cores, especially in the case of the memory hierarchy. Unfortunately, such sharing introduces resource contention and interference between the multiple executing threads. The ever-increasing access latency difference between processor and memory, the gradually increasing memory bandwidth demands to main memory, and the decreasing cache capacity size available to each core due to multiple core integration, has made the need for an efficient memory subsystem resource management more critical than ever before. This dissertation focuses on managing the sharing of the Last-level Cache (LLC) capacity and the main memory bandwidth, as the two most important resources that significantly affect system performance and energy consumption. The presented schemes include efficient solutions to all of the three basic requirements for implementing a resource management schemes, that is: a) profiling mechanisms to capture applications' resource requirements, b) microarchitecture mechanisms to enforce a resource allocation scheme, and c) resource allocations algorithms/policies to manage the available memory resources throughput the whole memory hierarchy of a CMP system. To achieve these targets the dissertation first describes a set of low overhead, non-invasive profiling mechanisms that are able to project applications’ memory resource requirements and memory sharing behavior. Two memory resource partitioning schemes are presented. The first one, the Bank-aware dynamic partitioning scheme provides a low overhead solution for partitioning cache resources of large CMP architectures that are based on a Dynamic Non-Uniform Cache Architecture (DNUCA) last-level cache design, consistent with the current industry trends. In addition, the second scheme, the Bandwidth-aware dynamic scheme presents a system-wide optimization of memory-subsystem resource allocation and job scheduling for large, multi-chip CMP systems. The scheme is seeking for optimizations both within and outside single CMP chips, aiming at overall system throughput and efficiency improvements. As cache partitioning schemes with isolated partitions impose a set of restrictions in the use of the last-level cache, which can severely affect the performance of large CMP designs, this dissertation presents a Quasi-partitioning scheme that breaks such restrictions while providing most of the benefits of cache partitioning schemes. The presented solution is able to efficiently scale to a significant larger number of cores than what previously described schemes that are based on isolated partition can achieve. Finally, as the memory controller is one of the fundamental components of the memory-subsystem, a well-designed memory-subsystem resource management needs to carefully utilize the memory controller resources and coordinate its functionality with the operation of the main memory and the last-level cache. To improve execution fairness and system throughput, this dissertation presents a criticality-based, memory controller requests priority scheme. The scheme ranks demand read and prefetch operations based on their latency sensitivity, while it coordinates its operation with the DRAM page-mode policy and the memory data prefetcher. / text Chip-multiprocessors Many-core Cache Memory Resource-management Processor Computer architecture Memory controllers

Search results