Global ETD Search

301	Stateless Parallel Processing Architecture for Extreme Scale HPC and Auction-based Clouds Taifi, Moussa January 2013 (has links) Extreme scale HPC (high performance computing) applications require massively many nodes. At these scales, transient hardware and software failures, as well as network congestion and disconnections increase linearly with the number of components. This volatility contributed to the dramatic decrease in applications' MTBF (mean time between failures). Traditional point-to-point transmission APIs semantics are ill-fitted to support applications of extreme scale. In this thesis, we investigate an application dependent network design that focuses on the sustainability of extreme scale high performance computing applications using packet-switching-inspired statistical multiplexing of semantic data tuples and decoupled computations. We report the design and implementation of a distributed tuple space using Cassandra and Zookeeper for tunable spatial and temporal redundancies without negative impact on application performance. We detail the various failure scenarios that can be handled seamlessly by our system and provide a description of the advantages of Stateless Parallel Processing for HPC applications. We report our results on performance, reliability and overall application sustainability. In the preliminary tests, for the most common HPC application categories, the prototype has demonstrated sustained performance, while providing a reliable computing architecture that can withstand multiple failure types without manual checkpoint-restart(CPR). The feasibility of efficient non-stop HPC enables aution-based cloud for more cost efficient HPC applications. For all HPC application categories, we first report a novel method for determining bid-aware checkpointing intervals using fluctuating cloud providers' pricing histories. Subsequently, we explore the effects of bidding in the case of virtual HPC clusters composed of EC2 Spot Instances. We expose the counter-intuitive effects of uniform versus non-uniform bidding, especially in terms of failure rate and failure model, and we propose a method to deal with the problem of predicting the runtime of parallel applications under various bidding strategies. We then show that CPR-free HPC applications require a new optimization strategy. As extreme scale HPC and auction-based cloud computing offer the ultimate computational scale and resource efficiency, they challenge the very foundations in computer science research and development. This thesis answers some critical questions about these challenges and we hope to pave the way for future improvements of the HPC field under increasingly harsh and volatile conditions. / Computer and Information Science Computer Science Fault Tolerance Performance of Systems Scalability
302	A High Level Synthesis Approach for Reduced Interconnects and Fault Tolerance Lemstra, David 01 1900 (has links) <p> High Level Synthesis (HLS) is a promising approach to managing design complexity at a more abstract level as integrated circuit technology edges deeper into sub-micron design. One useful facet of HLS is the ability to automatically integrate architectural components that can address potential reliability issues, which may be on the increase due to miniaturization. Research into harnessing HLS for fault tolerance (FT) has been progressing since the early 1990's. There currently exists a large body of work regarding methods to incorporate capabilities such as fault detection, compensation, and recovery into HLS design.</p> <p> While many avenues of FT have been explored in the HLS environment, very little work has considered the effectiveness and feasibility of these techniques in the context of large HLS systems, which presumably is the raison d'etre of HLS. While existing HLS FT approaches are often elegant and involve highly sophisticated techniques to achieve optimal solutions, the costs of HLS infrastructure in regards to scalability are not well reported. The intent of this thesis is to explore the ramifications of applying common HLS techniques to large designs.</p> <p> Furthermore, a new HLS tool entitled RIFT is presented that is specifically designed to mitigate infrastructure costs that mount as greater parallelism is utilized. RIFT is named for its design philosophy of "Reducing Interconnects for Fault Tolerance". RIFT iteratively builds a logical hardware representation, which consists of both the components instantiated and their interconnections, one operation at a time. It chooses the next operation to be "mapped" to the burgeoning design based on scheduling constraints as well as the extra hardware and interconnect costs required to support a particular selection. Emphasis is placed on minimizing the delay of the datapath in effort to reduce the performance cost associated with the extra interconnects needed for FT. RIFT has been used to generate efficient solutions for FT designs requiring as many as a thousand operations.</p> / Thesis / Master of Applied Science (MASc)
303	Optimising Fault Tolerance in Real-time Cloud Computing IaaS Environment Mohammed, Bashir, Kiran, Mariam, Awan, Irfan U., Maiyama, Kabiru M. 22 August 2016 (has links) Yes / Fault tolerance is the ability of a system to respond swiftly to an unexpected failure. Failures in a cloud computing environment are normal rather than exceptional, but fault detection and system recovery in a real time cloud system is a crucial issue. To deal with this problem and to minimize the risk of failure, an optimal fault tolerance mechanism was introduced where fault tolerance was achieved using the combination of the Cloud Master, Compute nodes, Cloud load balancer, Selection mechanism and Cloud Fault handler. In this paper, we proposed an optimized fault tolerance approach where a model is designed to tolerate faults based on the reliability of each compute node (virtual machine) and can be replaced if the performance is not optimal. Preliminary test of our algorithm indicates that the rate of increase in pass rate exceeds the decrease in failure rate and it also considers forward and backward recovery using diverse software tools. Our results obtained are demonstrated through experimental validation thereby laying a foundation for a fully fault tolerant IaaS Cloud environment, which suggests a good performance of our model compared to current existing approaches. / Petroleum Technology Development Fund (PTDF)
304	A secure IoT-based modern healthcare system with fault-tolerant decision making process Gope, P., Gheraibia, Y., Kabir, Sohag, Sikdar, B. 11 October 2020 (has links) Yes / The advent of Internet of Things (IoT) has escalated the information sharing among various smart devices by many folds, irrespective of their geographical locations. Recently, applications like e-healthcare monitoring has attracted wide attention from the research community, where both the security and the effectiveness of the system are greatly imperative. However, to the best of our knowledge none of the existing literature can accomplish both these objectives (e.g., existing systems are not secure against physical attacks). This paper addresses the shortcomings in existing IoT-based healthcare system. We propose an enhanced system by introducing a Physical Unclonable Function (PUF)-based authentication scheme and a data driven fault-tolerant decision-making scheme for designing an IoT-based modern healthcare system. Analyses show that our proposed scheme is more secure and efficient than existing systems. Hence, it will be useful in designing an advanced IoT-based healthcare system. / Supported in part by Singapore Ministry of Education Academic Research Fund Tier 1 (R-263-000- D63-114). / Research Development Fund Publication Prize Award winner, July 2020. Internet of Things (IoT) Healthcare Machine learning Fault tolerance Sensor fusion
305	Effective Fusion and Separation of Distribution, Fault-Tolerance, and Energy-Efficiency Concerns Kwon, Young Woo 03 July 2014 (has links) As software applications are becoming increasingly distributed and mobile, their design and implementation are characterized by distributed software architectures, possibility of faults, and the need for energy awareness. Thus, software developers should be able to simultaneously reason about and handle the concerns of distribution, fault-tolerance, and energy-efficiency. Being closely intertwined, these concerns can introduce significant complexity into the design and implementation of modern software. In other words, to develop reliable and energy-efficient applications, software developers must understand how distribution, fault-tolerance, and energy-efficiency interplay with each other and how to implement these concerns while keeping the complexity in check. This dissertation addresses five technical issues that stand on the way of engineering reliable and energy-efficient software: (1) how can developers select and parameterize middleware to achieve the requisite levels of performance, reliability, and energy-efficiency? (2) how can one streamline the process of implementing and reusing fault tolerance functionality in distributed applications? (3) can automated techniques be developed to help transition centralized applications to using cloud-based services efficiently and reliably? (4) how can one leverage cloud-based resources to improve the energy-efficiency of mobile applications? (5) how can middleware be adapted to improve the energy-efficiency of distributed mobile applications operated over heterogeneous mobile networks? To address these issues, this research studies the concerns of distribution, fault-tolerance, and energy-efficiency as well as their interaction. It also develops novel approaches, techniques, and tools that effectively fuse and separate these concerns as required by particular software development scenarios. The specific innovations include (1) a systematic assessment of the performance, conciseness, complexity, reliability, and energy consumption of middleware mechanisms for accessing remote functionality, (2) a declarative approach to hardening distributed applications with resiliency against partial failure, (3) cloud refactoring, a set of automated program transformations for transitioning to using cloud-based services efficiently and reliably, (4) a cloud offloading approach that improves the energy-efficiency of mobile applications without compromising their reliability, (5) a middleware mechanism that optimizes energy consumption by adapting execution patterns dynamically in response to fluctuations in network conditions. / Ph. D. distributed computing fault-tolerance energy-efficiency middleware mobile applications program transformation runtime system dynamic adaptation refactoring
306	On the Fault-tolerance and High Performance of Replicated Transactional Systems Hirve, Sachin 28 September 2015 (has links) With the recent technological developments in last few decades, there is a notable shift in the way business/consumer transactions are conducted. These transactions are usually triggered over the internet and transactional systems working in the background ensure that these transactions are processed. The majority of these transactions nowadays fall in Online Transaction Processing (OLTP) category, where low latency is preferred characteristic. In addition to low latency, OLTP transaction systems also require high service continuity and dependability. Replication is a common technique that makes the services dependable and therefore helps in providing reliability, availability and fault-tolerance. Deferred Update Replication (DUR) and Deferred Execution Replication (DER) represent the two well known transaction execution models for replicated transactional systems. Under DUR, a transaction is executed locally at one node before a global certification is invoked to resolve conflicts against other transactions running on remote nodes. On the other hand, DER postpones the transaction execution until the agreement on a common order of transaction requests is reached. Both DUR and DER require a distributed ordering layer, which ensures a total order of transactions even in case of faults. In today's distributed transactional systems, performance is of paramount importance. Any loss in performance, e.g., increased latency due to slow processing of client requests, may entail loss of revenue for businesses. On one hand, the DUR model is a good candidate for transaction processing in those systems in case the conflicts among transactions are rare, while it can be detrimental for high conflict workload profiles. On the other hand, the DER model is an attractive choice because of its ability to behave as independent of the characteristics of the workload, but trivial realizations of the model ultimately do not offer a good performance increase margin. Indeed transactions are executed sequentially and the total order layer can be a serious bottleneck for latency and scalability. This dissertation proposes novel solutions and system optimizations to enhance the overall performance of replicated transactional systems. The first presented result is HiperTM, a DER-based transaction replication solution that is able to alleviate the costs of the total order layer via speculative execution techniques. HiperTM exploits the time that is between the broadcast of a client request and the finalization of the order for that request to speculatively execute the request, so to achieve an overlapping between replicas coordination and transactions execution. HiperTM proposes two main components: OS-Paxos, a novel total order layer that is able to early deliver requests optimistically according to a tentative order, which is then either confirmed or rejected by a final total order; SCC, a lightweight speculative concurrency control protocol that is able to exploit the optimistic delivery of OS-Paxos and execute transactions in a speculative fashion. SCC still processes write transactions serially in order to minimize the code instrumentation overheads, but it is able to parallelize the execution of read-only transactions thanks to its built-in object multiversion scheme. The second contribution in this dissertation is X-DUR, a novel transaction replication system that addressed the high cost of local and remote aborts in case of high contention on shared objects in DUR based approaches, due to which the performance is adversely affected. Exploiting the knowledge of client's transaction locality, X-DUR incorporates the benefits of state machine approach to scale-up the distributed performance of DUR systems. As third contribution, this dissertation proposes Archie, a DER-based replicated transactional system that improves HiperTM in two aspects. First, Archie includes a highly optimized total order layer that combines optimistic-delivery and batching thus allowing the anticipation of a big amount of work before the total order is finalized. Then the concurrency control is able to process transactions speculatively and with a higher degree of parallelism, although the order of the speculative commits still follows the order defined by the optimistic delivery. Both HiperTM and Archie perform well up to a certain number of nodes in the system, beyond which their performance is impacted by limitations of single leader-based total-order layer. This motivates the design of Caesar, the forth contribution of this dissertation, which is a transactional system based on a novel multi-leader partial order protocol. Caesar enforces a partial order on the execution of transactions according to their conflicts, by letting non-conflicting transactions to proceed in parallel and without enforcing any synchronization during the execution (e.g., no locks). As the last contribution, this dissertation presents Dexter, a replication framework that exploits the commonly observed phenomenon such that not all read-only workloads require up-to-date data. It harnesses the application specific freshness and content-based constraints of read-only transactions to achieve high scalability. Dexter services the read-only requests according to the freshness guarantees specified by the application and routes the read-only workload accordingly in the system to achieve high performance and low latency. As a result, Dexter framework also alleviates the interference between read-only requests and read-write requests thereby helping to improve the performance of read-write requests execution as well. / Ph. D. Distributed Transaction Memory Fault-tolerance Active Replication Distributed Systems On-line Transaction Processing
307	Error isolation in distributed systems Behrens, Diogo 25 May 2016 (has links) (PDF) In distributed systems, if a hardware fault corrupts the state of a process, this error might propagate as a corrupt message and contaminate other processes in the system, causing severe outages. Recently, state corruptions of this nature have been observed surprisingly often in large computer populations, e.g., in large-scale data centers. Moreover, since the resilience of processors is expected to decline in the near future, the likelihood of state corruptions will increase even further. In this work, we argue that preventing the propagation of state corruption should be a first-class requirement for large-scale fault-tolerant distributed systems. In particular, we propose developers to target error isolation, the property in which each correct process ignores any corrupt message it receives. Typically, a process cannot decide whether a received message is corrupt or not. Therefore, we introduce hardening as a class of principled approaches to implement error isolation in distributed systems. Hardening techniques are (semi-)automatic transformations that enforce that each process appends an evidence of good behavior in the form of error codes to all messages it sends. The techniques “virtualize” state corruptions into more benign failures such as crashes and message omissions: if a faulty process fails to detect its state corruption and abort, then hardening guarantees that any corrupt message the process sends has invalid error codes. Correct processes can then inspect received messages and drop them in case they are corrupt. With this dissertation, we contribute theoretically and practically to the state of the art in fault-tolerant distributed systems. To show that hardening is possible, we design, formalize, and prove correct different hardening techniques that enable existing crash-tolerant designs to handle state corruption with minimal developer intervention. To show that hardening is practical, we implement and evaluate these techniques, analyzing their effect on the system performance and their ability to detect state corruptions in practice. hardware errors arbitrary state corruption data corruption error isolation distributed systems Byzantine fault tolerance ddc:004 rvk:ST 200
308	Décodeurs LDPC opérant sur des circuits à comportement probabiliste : limites théoriques et évaluation pratique de la capacité de correction / LDPC decoders running on error prone devices : theoretical limits and practical assessment of the error correction performance Kameni Ngassa, Christiane 13 October 2014 (has links) Ces dernières années ont vu naitre un intérêt grandissant pour les décodeurs correcteurs d'erreurs opérant sur des circuits non fiables. En effet, la miniaturisation croissante des composants électroniques ainsi l'échelonnage agressif de la tension d'alimentation ont pour conséquence la diminution de la fiabilité des systèmes. Par conséquent, les futures générations de circuits électroniques seront intrinsèquement non fiables. En outre, les décodeurs correcteurs d'erreurs sont indispensables non seulement pour assurer une transmission fiable de l'information mais aussi pour concevoir des systèmes de stockage performants.Nous nous intéressons, dans cette thèse, plus particulièrement aux décodeurs à précision finie Min-Sum (MS), Self-Corrected Min-Sum (SCMS) et Stochastiques.Nous commençons par effectuer une analyse statistique du décodeur Min-Sum opérant sur des circuits à comportement probabiliste. Pour ce faire nous introduisons des modèles d'erreurs probabilistes pour les composants logiques et les opérateurs arithmétiques du décodeur et étudions leurs propriétés de symétrie. Puis nous effectuions une analyse asymptotique rigoureuse et en déduisons les équations d'évolution de densité du décodeur Min-Sum bruité. Nous mettons ainsi en évidence l'effet positif, dans certains cas, du bruit issu du circuit sur la capacité de correction du décodeur. Nous révélons ensuite l'existence d'un phénomène de seuil particulier que nous nommons seuil fonctionnel. Ce dernier peut être considéré comme la généralisation du seuil classique pour les décodeurs non fiables. Nous corroborons ensuite les résultats asymptotiques par des simulations Monte-Carlo.Nous implémentons des décodeurs LDPC bruités pour plusieurs paramètres de bruit et montrons que les décodeurs LDPC bruité ont des résultats très proches de ceux des décodeurs non bruités. Nous pouvons par conséquent considérer le circuit d'autocorrection comme un patch bruité appliqué au décodeur MS bruité afin d'améliorer la robustesse du décodeur face au bruit issu des composants non fiables. Nous évaluons par railleurs l'impact de l'ordonnancement et montrons qu'un ordonnancement série dégrade fortement la robustesse des décodeurs bruités MS et SCMS qui ne parviennent plus à atteindre une capacité de correction acceptable.Pour finir nous étudions les performances des décodeurs stochastiques pourvus de mémoires d'arêtes et opérant sur des circuits non fiables. Nous proposons deux modèles d'erreurs décrivant le comportement probabiliste des composants du décodeur. Nous montrons que, dans certains cas, le bruit issu du circuit non fiable permet de réduire le plancher d'erreur. Nous en déduisons alors que le décodeur stochastique est intrinsèquement tolérant aux fautes. / Over the past few years, there has been an increasing interest in error correction decoders built out of unreliable components. Indeed, it is widely accepted that future generation of electronic circuit will be inherently unreliable, due to the increase in density integration and aggressive voltage scaling. Furthermore, error correction decoders play a crucial role both in reliable transmission of information and in the design of reliable storage systems. It is then important to investigate the robustness of error correction decoders in presence of hardware noise.In this thesis we focus on LDPC decoders built out of unreliable computing units. We consider three types of LDPC decoders: the finite-precision Min-Sum (MS) decoder, the Self-Corrected Min-Sum (SCMS) decoder and the Stochastic decoder.We begin our study by the statistical analysis of the finite-precision Min-Sum decoder with probabilistic components. To this end, we first introduce probabilistic models for the arithmetic and logic units of the decoder and discuss their symmetry properties. We conduct a thorough asymptotic analysis and derive density evolution equations for the noisy Min-Sum decoder. We highlight that in some particular cases, the noise introduced by the device can increase the correction capacity of the noisy Min-Sum with respect to the noiseless decoder. We also reveal the existence of a specific threshold phenomenon, referred to as functional threshold, which can be viewed as the generalization of the threshold definition for noisy decoders. We then corroborate the asymptotic results through Monte-Carlo simulations.Since density evolution cannot be defined for decoders with memory, the analysis of noisy Self-corrected Min-Sum decoders and noisy Stochastic decoders was restricted to Monte-Carlo simulations.We emulate the noisy SCMS decoders with various noise parameters and show that noisy SCMS decoders perform close to the noiseless SCMS decoder for a wide range of noise parameters. Therefore, one can think of the self-correction circuit as a noisy patch applied to the noisy MS decoder, in order to improve its robustness to hardware defect. We also evaluate the impact of the decoder scheduling on the robustness of the noisy MS and SCMS decoders and show that when the serial scheduling is used neither the noisy MS decoder nor the noisy SCMS decoder can provide acceptable error correction.Finally, we investigate the performance of stochastic decoders with edge-memories in presence of hardware noise. We propose two error models for the noisy components. We show that in some cases, the hardware noise can be used to lower the error floor of the decoder meaning that stochastic decoders have an inherent fault tolerant capability. Tolérance aux fautes Codes LDPC Décodage itératif Évolution de densité Fault tolerance LDPC codes Iterative decoding Density evolution
309	Uma arquitetura otimizada para a detecção de falhas em grades computacionais. / A failure detection architecture optimized for grid computing platforms. Lemos, Fernando Tarlá Cardoso 07 November 2012 (has links) A detecção de falhas em uma plataforma distribuída é um componente essencial para uma grande quantidade de estratégias de tolerância a falhas, como a restauração do estado das aplicações distribuídas através de checkpointing e message logging. Porém, esta detecção frequentemente depende da comunicação confiável entre os nós de processamento e os módulos de detecção de falhas. Em grades computacionais hierárquicas com limitações de conectividade, a comunicação direta entre nós e módulos de detecção é muitas vezes impossível. Outro fator que dificulta a detecção de falhas em grades computacionais é a localização geograficamente esparsa entre as instituições e os recursos computacionais disponíveis na grade e a consequente utilização de redes de longa distância para os conectar. Esta dissertação apresenta uma arquitetura para a detecção de falhas em plataformas distribuídas otimizada para o funcionamento em grades computacionais hierárquicas, levando suas limitações e requisitos em consideração. A arquitetura, denominada GFDA (Grid Fault Detection Architecture), é estruturada em módulos de detecção das falhas que afetam nós computacionais disponibilizados na grade, módulos de detecção de falhas das aplicações distribuídas, e módulos de coleção, processamento e encaminhamento das notificações de falha e recuperação emitidas pelos módulos de detecção. Detalhes da implementação e da verificação do funcionamento correto da arquitetura são apresentados, bem como resultados obtidos através da execução de componentes da arquitetura em um cluster de computadores simulado através de máquinas virtuais. São propostas técnicas para a otimização da qualidade de serviço de detecção de falhas. Os resultados obtidos com a utilização destas técnicas são comparados com resultados obtidos com abordagens tradicionais. Observa-se que as técnicas implementadas na arquitetura GFDA para o processamento de notificações de falha e recuperação e a introdução de redundância nas mensagens trocadas entre os módulos de detecção de falhas traz resultados positivos em condições adversas de conectividade. Conclui-se que a arquitetura GFDA contribui para o estabelecimento de uma solução viável para a detecção de falhas em uma grade computacional hierárquica em que há restrições de conectividade entre os nós computacionais. / In distributed platforms, fault detection is an essential requirement to a wide range of fault tolerance techniques, such as restoring the state of distributed applications with checkpointing and message logging. However, fault detection often depends on reliable communication between the processing nodes and detection fault modules. Direct communication between the nodes and detection modules is often impossible in hierarchical grid computing platforms. The physical distance between the institutions and resources available on the grid, and thus the requirement of long distance networks connecting them, is another factor that makes direct fault detection in computer grids a challenge. This thesis presents a fault detection architecture for distributed platforms, optimized for usage in hierarchical grids and thus taking into account its restrictions and requirements. The architecture, named GFDA (Grid Fault Detection Architecture), is structured as fault detection modules for faults that affect the computing nodes available on the grid, detection modules for faults that affect the distributed applications, and modules that perform the collection, processing and forwarding of the fault and recovery notifications generated by the detection modules. This thesis presents implementation details, an evaluation of the correctness of the designed architecture, and results obtained through the deployment of parts of the architecture in a simulated cluster that uses virtual machines to simulate computing nodes. Techniques to optimize the quality of the detection fault service are proposed. The results obtained through the usage of such techniques are compared to the results obtained through traditional approaches. Positive results were extracted even under adverse connectivity conditions by using techniques such as the processing of fault and recovery notifications and the introduction of redundant information in the messages exchanged between the detection modules. It is concluded that the GFDA architecture contributes to the establishment of a viable solution for fault detection in a hierarchical grid computing platform that presents connectivity restrictions between the nodes. Detecção de falhas Detecção distribuída de falhas Distributed fault detection Fault detection Fault tolerance Grades computacionais Grid computing Tolerância a falhas
310	Early evaluation of multicore systems soft error reliability using virtual platforms / Avaliação de sistema de larga escala sob à influência de falhas temporárias durante a exploração de inicial projetos através do uso de plataformas virtuais Rosa, Felipe Rocha da January 2018 (has links) A crescente capacidade de computação dos componentes multiprocessados como processadores e unidades de processamento gráfico oferecem novas oportunidades para os campos de pesquisa relacionados computação embarcada e de alto desempenho (do inglês, high-performance computing). A crescente capacidade de computação progressivamente dos sistemas baseados em multicores permite executar eficientemente aplicações complexas com menor consumo de energia em comparação com soluções tradicionais de núcleo único. Essa eficiência e a crescente complexidade das cargas de trabalho das aplicações incentivam a indústria a integrar mais e mais componentes de processamento no mesmo sistema. O número de componentes de processamento empregados em sistemas grande escala já ultrapassa um milhão de núcleos, enquanto as plataformas embarcadas de 1000 núcleos estão disponíveis comercialmente. Além do enorme número de núcleos, a crescente capacidade de processamento, bem como o número de elementos de memória interna (por exemplo, registradores, memória RAM) inerentes às arquiteturas de processadores emergentes, está tornando os sistemas em grande escala mais vulneráveis a erros transientes e permanentes. Além disso, para atender aos novos requisitos de desempenho e energia, os processadores geralmente executam com frequências de relógio agressivos e múltiplos domínios de tensão, aumentando sua susceptibilidade à erros transientes, como os causados por efeitos de radiação. A ocorrência de erros transientes pode causar falhas críticas no comportamento do sistema, o que pode acarretar em perdas de vidas financeiras ou humanas. Embora tenha sido observada uma taxa de 280 erros transientes por dia durante o voo de uma nave espacial, os sistemas de processamento que trabalham à nível do solo devem experimentar pelo menos um erro transiente por dia em um futuro próximo. A susceptibilidade crescente de sistemas multicore à erros transientes necessariamente exige novas ferramentas para avaliar a resiliência à erro transientes de componentes multiprocessados em conjunto com pilhas complexas de software (sistema operacional, drivers) durante o início da fase de projeto. O objetivo principal abordado por esta Tese é desenvolver um conjunto de técnicas de injeção de falhas, que formam uma ferramenta de injeção de falha. O segundo objetivo desta Tese é estabelecer as bases para novas disciplinas de gerenciamento de confiabilidade considerando erro transientes em sistemas emergentes multi/manycore utilizando aprendizado de máquina. Este trabalho identifica multiplicas técnicas que podem ser usadas para fornecer diferentes níveis de confiabilidade na carga de trabalho e na criticidade do aplicativo. / The increasing computing capacity of multicore components like processors and graphics processing unit (GPUs) offer new opportunities for embedded and high-performance computing (HPC) domains. The progressively growing computing capacity of multicore-based systems enables to efficiently perform complex application workloads at a lower power consumption compared to traditional single-core solutions. Such efficiency and the ever-increasing complexity of application workloads encourage industry to integrate more and more computing components into the same system. The number of computing components employed in large-scale HPC systems already exceeds a million cores, while 1000-cores on-chip platforms are available in the embedded community. Beyond the massive number of cores, the increasing computing capacity, as well as the number of internal memory cells (e.g., registers, internal memory) inherent to emerging processor architectures, is making large-scale systems more vulnerable to both hard and soft errors. Moreover, to meet emerging performance and power requirements, the underlying processors usually run in aggressive clock frequencies and multiple voltage domains, increasing their susceptibility to soft errors, such as the ones caused by radiation effects. The occurrence of soft errors or Single Event Effects (SEEs) may cause critical failures in system behavior, which may lead to financial or human life losses. While a rate of 280 soft errors per day has been observed during the flight of a spacecraft, electronic computing systems working at ground level are expected to experience at least one soft error per day in near future. The increased susceptibility of multicore systems to SEEs necessarily calls for novel cost-effective tools to assess the soft error resilience of underlying multicore components with complex software stacks (operating system-OS, drivers) early in the design phase. The primary goal addressed by this Thesis is to describe the proposal and development of a fault injection framework using state-of-the-art virtual platforms, propose set of novel fault injection techniques to direct the fault campaigns according to with the software stack characteristics, and an extensive framework validation with over a million of simulation hours. The second goal of this Thesis is to set the foundations for a new discipline in soft error reliability management for emerging multi/manycore systems using machine learning techniques. It will identify and propose techniques that can be used to provide different levels of reliability on the application workload and criticality. Microeletrônica Tolerancia : Falhas Aprendizado : máquina Multi/Manycore Systems Machine Learning Soft Errors ARM Simulation Virtual Platforms Reliability Fault Tolerance

Search results