Global ETD Search

341	Auto-Determination of Cache/TLB parameters Kommanaboina, Kishor Yadav 23 August 2013 (has links) No description available. Computer Science Cache TLB Microbenchmarks inclusive caches exclusive caches
342	The Development of Cooperative Enterprises in Cache Valley 1865-1900 Felix, Joseph Carl 01 January 1956 (has links) (PDF) As one studies the history of Cache Valley, he becomes increasingly aware of the presence of church-sponsored cooperative stores, farms, and mills, in every community in the valley. True, there are only scattered remains of a once rather extensive movement, but there is enough evidence to cause one to wonder what influence the cooperative enterprises had in the settlement of Cache Valley. This study has been made to determine the extent of this contribution and to preserve as much information as possible concerning a very important phase of the settlement days in Cache Valley.This study includes only the period from 1865 to 1900. These are the important years of church-sponsored cooperative institutions in Cache Valley. The general plan of cooperation was introduced formally in the October Conference of 1868. There were a few cooperative stores in operation prior to this time, however. The movement grew to magnanimous proportions before dwindling to a mere trickle by 1900. There were only a few concerns that extended beyond this date.Data for this study has been obtained from many sources. Newspapers, journals, and other manuscripts have been the most valuable sources. Other important sources have included personal interviews, secondary sources, and company records. Cache Valley Utah Idaho History Cultural History Mormon Studies Sociology
343	Analyzing Instructtion Based Cache Replacement Policies Xiang, Ping 01 January 2010 (has links) The increasing speed gap between microprocessors and off-chip DRAM makes last-level caches (LLCs) a critical component for computer performance. Multi core processors aggravate the problem since multiple processor cores compete for the LLC. As a result, LLCs typically consume a significant amount of the die area and effective utilization of LLCs is mandatory for both performance and power efficiency. We present a novel replacement policy for last-level caches (LLCs). The fundamental observation is to view LLCs as a shared resource among multiple address streams with each stream being generated by a static memory access instruction. The management of LLCs in both single-core and multi-core processors can then be modeled as a competition among multiple instructions. In our proposed scheme, we prioritize those instructions based on the number of LLC accesses and reuses and only allow cache lines having high instruction priorities to replace those of low priorities. The hardware support for our proposed replacement policy is light-weighted. Our experimental results based on a set of SPEC 2006 benchmarks show that it achieves significant performance improvement upon the least-recently used (LRU) replacement policy for benchmarks with high numbers of LLC misses. To handle LRU-friendly workloads, the set sampling technique is adopted to retain the benefits from the LRU replacement policy. memory hierarchy cache replacement policy Computer Engineering Engineering
344	High-Performance Matrix Multiplication: Hierarchical Data Structures, Optimized Kernel Routines, and Qualitative Performance Modeling Wu, Wenhao 02 August 2003 (has links) The optimal implementation of matrix multiplication on modern computer architectures is of great importance for scientific and engineering applications. However, achieving the optimal performance for matrix multiplication has been continuously challenged both by the ever-widening performance gap between the processor and memory hierarchy and the introduction of new architectural features in modern architectures. The conventional way of dealing with these challenges benefits significantly from the blocking algorithm, which improves the data locality in the cache memory, and from the highly tuned inner kernel routines, which in turn exploit the architectural aspects on the specific processor to deliver near peak performance. A state-of-art improvement of the blocking algorithm is the self-tuning approach that utilizes "heroic" combinatorial optimization of parameters spaces. Other recent research approaches include the approach that explicitly blocks for the TLB (Translation Lookaside Buffer) and the hierarchical formulation that employs memoryriendly Morton Ordering (a spaceilling curve methodology). This thesis compares and contrasts the TLB-blocking-based and Morton-Order-based methods for dense matrix multiplication, and offers a qualitative model to explain the performance behavior. Comparisons to the performance of self-tuning library and the "vendor" library are also offered for the Alpha architecture. The practical benchmark experiments demonstrate that neither conventional blocking-based implementations nor the self-tuning libraries are optimal to achieve consistent high performance in dense matrix multiplication of relatively large square matrix size. Instead, architectural constraints and issues evidently restrict the critical path and options available for optimal performance, so that the relatively simple strategy and framework presented in this study offers higher and flatter overall performance. Interestingly, maximal inner kernel efficiency is not a guarantee of global minimal multiplication time. Also, efficient and flat performance is possible at all problem sizes that fit in main memory, rather than "jagged" performance curves often observed in blocking and self-tuned blocking libraries. performance tuning matrix multiplication hierarchical matrix storage cache model
345	Improving Hard Disk Drive Write IO Performance with Phase Change Memory as a Buffer Cache Balasubramanian, Sanchayeni January 2017 (has links) No description available. Computer Engineering Buffer cache PCM Hard Disk Drives performance
346	A STUDY OF CLUSTER PAGING METHODS TO BOOST VIRTUAL MEMORY PERFORMANCE RAMAN, VENKATESH 11 March 2002 (has links) No description available. virtual memory page fault cluster paging swap cache operating systems
347	A STUDY OF SWAP CACHE BASED PREFETCHING TO IMPROVE VITUAL MEMORY PERFORMANCE KUNAPULI, UDAYKUMAR 11 March 2002 (has links) No description available. virtual memory page fault swap cache prediction operating systems
348	A Hybrid Network-on-Chip and Segmented Bus Architecture for Large Caches Velayutham, Chandru 20 April 2009 (has links) No description available. Engineering Network on Chip bus cache NUCA multi-core
349	Intelligent Caching to Mitigate the Impact of Web Robots on Web Servers Rude, Howard Nathan January 2016 (has links) No description available. Computer Science web cache web robots crawlers prefetching prediction LSTM
350	On Optimizing and Leveraging Distributed Shared Memory for High Performance, Resource Aggregation, and Cache-coherent Heterogeneous-ISA Processors Chuang, Ho-Ren 28 June 2022 (has links) This dissertation focuses on the problem space of heterogeneous-ISA multiprocessors – an architectural design point that is being studied by the academic research community and increasingly available in commodity systems. Since such architectures usually lack globally coherent shared memory, software-based distributed shared memory (DSM) is often used to provide the illusion of such a memory. The DSM abstraction typically provides this illusion using a reader-replicate, writer-invalidate memory consistency protocol that operates at the granularity of memory pages and is usually implemented as a first-class operating system abstraction. This enables symmetric multiprocessing (SMP) programming frameworks, augmented with a heterogeneous-ISA compiler, to use CPU cores of different ISAs for parallel computations as if they are of the same ISA, improving programmability, especially for legacy SMP applications which therefore can run unmodified on such hardware. Past DSMs have been plagued by poor performance, in part due to the high latency and low bandwidth of interconnect network infrastructures. The dissertation revisits DSM in light of modern interconnects that reverse this performance trend. The dissertation presents Xfetch, a bulk page prefetching mechanism designed for the DEX DSM system. Xfetch exploits spatial locality, and aggressively and sequentially prefetches pages before potential read faults, improving DSM performance. Our experimental evaluations reveal that Xfetch achieves up to ≈142% speedup over the baseline DEX DSM that does not prefetch page data. SMP programming models often allow primitives that permit weaker memory consistency semantics, where synchronization updates can be delayed, permitting greater parallelism and thereby higher performance. Inspired by such primitives, the dissertation presents a DSM protocol called MWPF that trades-off memory consistency for higher performance in select SMP code regions, targeting heterogeneous-ISA multiprocessor systems. MWPF also overcomes performance bottlenecks of past DSM systems for heterogeneous-ISA multiprocessors such as due to significant number of invalidation messages, false page sharing, large number of read page faults, and large synchronization overheads by using efficient protocol primitives that delay and batch invalidation messages, aggressively prefetch data pages, and perform cross-domain synchronization with low overhead. Our experimental evaluations reveal that MWPF achieves, on average, 11% speedup over the baseline DSM implementation. The dissertation presents PuzzleHype, a distributed hypervisor that enables a single virtual machine (VM) to use fragmented resources in distributed virtualized settings such as CPU cores, memory, and devices of different physical hosts, and thereby decrease resource fragmentation and increase resource utilization. PuzzleHype leverages DSM implemented in host operating systems to present an unified and consistent view of a continuous pseudo-physical address space to guest operating systems. To transparently utilize CPU and I/O resources, PuzzleHype integrates multiple physical CPUs into a single VM by migrating threads, forwarding interrupts, and by delegating I/O. Our experimental evaluations reveal that PuzzleHype yields speedups in the range of 355%–173% over baseline over-provisioning scenarios which are otherwise necessary due to resource fragmentation. To enable a distributed hypervisor to adapt to resource and workload changes, the dissertation proposes the concept of CPU borrowing that allows a VM's virtual CPU (vCPU) to migrate to an available physical CPU (pCPU) and release it when it is no longer necessary, i.e., CPU returning. CPU borrowing can thus be used when a node is over-committed, and CPU returning can be used when the borrowed CPU resource is no longer necessary. To transparently migrate a vCPU at runtime without incurring a significant downtime, the dissertation presents a suite of techniques including leveraging thread migration, loading/restoring vCPU in KVM states, maintaining a global vCPU location table, and creating a DSM kernel thread for handling on-demand paging. Our experimental evaluations reveal that migrating vCPUs to resource-available nodes achieves a speedup of 1.4x over running the vCPUs on distributed nodes. When a VM spans multiple nodes, it is likelihood for failure increases. To mitigate this, the dissertation presents a distributed checkpoint/restart mechanism that allows a distributed VM to tolerate failures. A user interface is introduced for sending/receiving checkpoint/restart commands to a distributed VM. We implement the checkpoint/restart technique in the native KVM tool, and extend it to a distributed mode by converting Inter-Process Communication (IPC) into message passing between nodes, pausing/resuming distributed vCPU executions, and loading/restoring runtime states on the correct set of nodes. Our experimental evaluations indicate that the overhead of checkpointing a distributed VM is ≈10% or less than that of the native KVM tool with our checkpoint support. Restarting a distributed VM is faster than native KVM with our restart support because no additional page faults occur during restarting. The dissertation's final contribution is PopHype, a system software stack that allows simulation of cache-coherent, shared memory heterogeneous-ISA hardware. PopHype includes a Linux operating system that implements DSM as an OS abstraction for processes, i.e., allows multiple processes running on multiple (ISA-different) machines to share memory. With KVM-enabled, this OS becomes a hypervisor that allows multiple, process-based instances of an architecture emulator such as QEMU to execute in a shared address space, allowing multiple QEMU instances to emulate different ISAs in shared memory, i.e., emulate shared memory heterogeneous-ISA hardware. PopHype also includes a modified QEMU to use process-level DSM and an optimized guest OS kernel for improved performance. Our experimental studies confirm PopHype's effectiveness, and reveal that PopHype achieves an average speedup of 7.32x over a baseline that runs multiple QEMU instances in shared memory atop a single host OS. / Doctor of Philosophy / Computing devices are ubiquitous around us. Each of these devices is powered by specialized chips called processors. These processors take in instructions, process them, and produce output. Such processing is what enables us, humans, to send messages to our loved ones, take photographs, as well as carry out various business functions such as using spreadsheet software. The kinds of instructions these processors execute are classified into so-called Instruction Set Architectures or ISAs. Chip designers build processors adopting different ISAs for various applications ranging from computing on mobile phones to cloud computing data centers used by large technology companies. Within a data center, there are typically hundreds of thousands of computing devices that serve an organization's purpose to serve millions or even billions of users. Programming these computers individually to serve a collective goal is an arduous task requiring hundreds of software engineering experts. To simplify programming these computers on a large scale, this thesis envisions an abstraction where tens of devices appear as one computing unit to the programmer, allowing them to program multiple computers as if they are one. This allows for better resource utilization in the sense that the power of multiple computing devices can be pooled together without the need to acquire newer, larger, and more-expensive computers. Furthermore, such pooling allows the software to leverage multiple different ISAs on different computers instead of a single ISA on one computer. This thesis also envisions a way for software to run on multiple computers with potentially different ISAs without exposing the difficulty of managing them to the software engineers. Distributed Shared Memory Virtualization Scale-out Non-cache-coherent Domains

Search results