Spelling suggestions: "subject:"diss failure""
1 |
Enhancing Storage Dependability and Computing Energy Efficiency for Large-Scale High Performance Computing SystemsHuang, Song 05 1900 (has links)
With the advent of information explosion age, larger capacity disk drives are used to store data and powerful devices are used to process big data. As the scale and complexity of computer systems increase, we expect these systems to provide dependable and energy-efficient services and computation. Although hard drives are reliable in general, they are the most commonly replaced hardware components. Disk failures cause data corruption and even data loss, which can significantly affect system performance and financial losses. In this dissertation research, I analyze different manifestations of disk failures in production data centers and explore data mining techniques combined with statistical analysis methods to discover categories of disk failures and their distinctive properties. I use similarity measures to quantify the degradation process of each failure type and derive the degradation signature. The derived degradation signatures are further leveraged to forecast when future disk failures may happen. Meanwhile, this dissertation also studies energy efficiency of high performance computers. Specifically, I characterize the power and energy consumption of Haswell processors which are used in multiple supercomputers, and analyze the power and energy consumption of Legion, a data-centric programming model and runtime system, and Legion applications. We find that power and energy efficiency can be improved significantly by optimizing the settings and runtime scheduling of processors, and Legion runtime performs well for larger-scale computation in terms of power and energy consumption.
|
2 |
Reliability Modelling Of Whole RAID Storage SubsystemsKarmakar, Prasenjit 04 1900 (has links) (PDF)
Reliability modelling of RAID storage systems with its various components such as RAID controllers, enclosures, expanders, interconnects and disks is important from a storage system designer's point of view. A model that can express all the failure characteristics of the whole RAID storage system can be used to evaluate design choices, perform cost reliability trade-offs and conduct sensitivity analyses.
We present a reliability model for RAID storage systems where we try to model all the components as accurately as possible. We use several state-space reduction techniques, such as aggregating all in-series components and hierarchical decomposition, to reduce the size of our model. To automate computation of reliability, we use the PRISM model checker as a CTMC solver where appropriate.
Initially, we assume a simple 3-state disk reliability model with independent disk failures. Later, we assume a Weibull model for the disks; we also consider a correlated disk failure model to check correspondence with the field data available. For all other components in the system, we assume exponential failure distribution. To use the CTMC solver, we approximate the Weibull distribution for a disk using sum of exponentials and we first confirm that this model gives results that are in reasonably good agreement with those from the sequential Monte Carlo simulation methods for RAID disk subsystems.
Next, our model for whole RAID storage systems (that includes, for example, disks, expanders, enclosures) uses Weibull distributions and, where appropriate, correlated failure modes for disks, and exponential distributions with independent failure modes for all other components. Since the CTMC solver cannot handle the size of the resulting models, we solve such models using hierarchical decomposition technique. We are able to model fairly large configurations with upto 600 disks using this model.
We can use such reasonably complete models to conduct several "what-if" analyses for many RAID storage systems of interest. Our results show that, depending on the configuration, spanning a RAID group across enclosures may increase or decrease reliability. Another key finding from our model results is that redundancy mechanisms such as multipathing is beneficial only if a single failure of some other component does not cause data inaccessibility of a whole RAID group.
|
3 |
COMPUTATIONAL FRAMEWORK TO ASSESS ROLE OF MANUFACTURING IN MATERIAL-DEFECT RELATED FAILURE RISKSubramanian, Rohit 02 October 2014 (has links)
No description available.
|
Page generated in 0.0669 seconds