Return to search

Fault-tolerant aspects of memory systems

Memory system design is important for providing high reliability and availability. This dissertation presents a memory architecture to support checkpoints that can improve reliability, and also algorithms to improve recoverable virtual memory. In addition, two novel techniques of reliability analysis are presented that account for program and operating system behavior. Checkpoint and rollback recovery is a method that allows a system to tolerate a failure by periodically saving the state and, if an error occurs, rolling back to the prior checkpoint. A technique is proposed that embeds the support for checkpoint and rollback recovery directly into the virtual memory translation hardware. A system with both highly reliable and normal memory enables recoverable virtual memory by placing modified data in the highly reliable memory and read-only data in normal memory. Hybrid algorithms are proposed for use in systems with multiple classes of physical memory; that is, one virtual memory policy for the highly reliable memory and one for the normal memory. These techniques are analyzed with a trace-driven simulation. Reliability analysis of memories and their relationship to system reliability is an important aspect of system design. The dynamic aspects of the memory are very important. Two aspects studied here are memory usage patterns by a program and the memory allocation by the operating system. A new model is developed for the successful execution of a program taking into account memory reference patterns. This is contrasted against traditional memory reliability calculations showing that the actual reliability may be more optimistic when program behavior is considered. A new theory to explain correlations between increased workloads and increased failure rates is proposed. The tradeoffs in performance and reliability for memory management policies (e.g., virtual or cache memory) are studied as a function of the block-miss reload time. A very small percentage of the memory is found to contribute to a majority of the unreliability. Techniques are proposed to dramatically improve the reliability (i.e., an algorithm called selective scrubbing and the use of very small amounts of highly reliable memory).

Identiferoai:union.ndltd.org:UMASS/oai:scholarworks.umass.edu:dissertations-8252
Date01 January 1992
CreatorsBowen, Nicholas S
PublisherScholarWorks@UMass Amherst
Source SetsUniversity of Massachusetts, Amherst
LanguageEnglish
Detected LanguageEnglish
Typetext
SourceDoctoral Dissertations Available from Proquest

Page generated in 0.0016 seconds