Global ETD Search

1	Metamori: A library for Incremental File Checkpointing Jeyakumar, Ashwin Raju 21 June 2004 (has links) The advent of cluster computing has resulted in a thrust towards providing software mechanisms for reliability on clusters. The prevalent model for such mechanisms is to take a snapshot of the state of an application, called a checkpoint and commit it to stable storage. This checkpoint has sufficient meta-data, so that if the application fails, it can be restarted from the checkpoint. This operation is called a restore. In order to record a process' complete state, both its volatile and persistent state must be checkpointed. Several libraries exist for checkpointing volatile state. Some of these libraries feature incremental checkpointing, where only the changes since the last checkpoint are recorded in the next checkpoint. Such incremental checkpointing is advantageous since otherwise, the time taken for each successive checkpoint becomes larger and larger. Also, when checkpointing is done in increments, we can restore state to any of the previous checkpoints; a vital feature for adaptive applications. This thesis presents a user-level incremental checkpointing library for files: Metamori. This brings the advantages of incremental memory checkpointing to files as well, thereby providing a low-overhead approach to checkpoint persistent state. Thus, the complete state of an application can now be incrementally checkpointed, as compared to earlier approaches where volatile state was checkpointed incrementally but persistent state had no such facilities. / Master of Science Fault-tolerance File Checkpointing Checkpointing Rollback-Recovery
2	A Rollback Mechanism to Recover from Software Failures in Role-based Adaptive Software Systems Taing, Nguonly, Springer, Thomas, Cardozo, Nicolás, Schill, Alexander 23 June 2021 (has links) Context-dependent applications are relatively complex due to their multiple variations caused by context activation, especially in the presence of unanticipated adaptation. Testing these systems is challenging, as it is hard to reproduce the same execution environments. Therefore, a software failure caused by bugs is no exception. This paper presents a rollback mechanism to recover from software failures as part of a role-based runtime with support for unanticipated adaptation. The mechanism performs checkpoints before each adaptation and employs specialized sensors to detect bugs resulting from recent configuration changes. When the runtime detects a bug, it assumes that the bug belongs to the latest configuration. The runtime rolls back to the recent checkpoint to recover and subsequently notifes the developer to fix the bug and re-applying the adaptation through unanticipated adaptation. We prototype the concept as part of our role-based runtime engine LyRT and demonstrate the applicability of the rollback recovery mechanism for unanticipated adaptation in erroneous situations. info:eu-repo/classification/ddc/004 ddc:004
3	Fault-Tolerant Average Execution Time Optimization for General-Purpose Multi-Processor System-On-Chips Väyrynen, Mikael January 2009 (has links) <p>Fault tolerance is due to the semiconductor technology development important, not only for safety-critical systems but also for general-purpose (non-safety critical) systems. However, instead of guaranteeing that deadlines always are met, it is for general-purpose systems important to minimize the average execution time (AET) while ensuring fault tolerance. For a given job and a soft (transient) no-error probability, we define mathematical formulas for AET using voting (active replication), rollback-recovery with checkpointing (RRC) and a combination of these (CRV) where bus communication overhead is included. And, for a given multi-processor system-on-chip (MPSoC), we define integer linear programming (ILP) models that minimize the AET including bus communication overhead when: (1) selecting the number of checkpoints when using RRC or a combination where RRC is included, (2) finding the number of processors and job-to-processor assignment when using voting or a combination where voting is used, and (3) defining fault tolerance scheme (voting, RRC or CRV) per job and defining its usage for each job. Experiments demonstrate significant savings in AET.</p> Fault tolerance Execution time optimization Rollback recovery with checkpointing Active replication MPSoC Computer science Datavetenskap
4	Applying Classification and Regression Trees to manage financial risk Martin, Stephen Fredrick 16 August 2012 (has links) This goal of this project is to develop a set of business rules to mitigate risk related to a specific financial decision within the prepaid debit card industry. Under certain circumstances issuers of prepaid debit cards may need to decide if funds on hold can be released early for use by card holders prior to the final transaction settlement. After a brief introduction to the prepaid card industry and the financial risk associated with the early release of funds on hold, the paper presents the motivation to apply the CART (Classification and Regression Trees) method. The paper provides a tutorial of the CART algorithms formally developed by Breiman, Friedman, Olshen and Stone in the monograph Classification and Regression Trees (1984), as well as, a detailed explanation of the R programming code to implement the RPART function. (Therneau 2010) Special attention is given to parameter selection and the process of finding an optimal solution that balances complexity against predictive classification accuracy when measured against an independent data set through a cross validation process. Lastly, the paper presents an analysis of the financial risk mitigation based on the resulting business rules. / text CART Classification and Regression Trees Breiman Risk Prepaid Debit cards Rollback R RPART Cross validation
5	Attentiveness: Reactivity at Scale Hartman, Gregory S. 01 December 2010 (has links) Clients of reactive systems often change their priorities. For example, a human user of an email viewer may attempt to display a message while a large attachment is downloading. To the user, an email viewer that delayed display of the message would exhibit a failure similar to priority inversion in real-time systems. We propose a new quality attribute, attentiveness, that provides a unified way to model the forms of redirection offered by application-level reactive systems to accommodate the changing priorities of their clients, which may be either humans or systems components. Modeling attentiveness as a quality attribute provides system designers with a single conceptual framework for policy and architectural decisions to address trade-offs among criteria such as responsiveness, overall performance, behavioral predictability, and state consistency. Reactive systems responsiveness state consistency concurrency distributed systems data race detection cancel rollback
6	Implementação de um mecanismo de recuperação por retorno para a ferramenta ourgrid / Implementation of a rollback recovery mechanism for ourGrid toolkit Silva, Hélio Antônio Miranda da January 2007 (has links) A computação em grid (ou computação em grade) emergiu como uma área de pesquisa importante por permitir o compartilhamento de recursos computacionais geograficamente distribuídos entre vários usuários. Contudo, a heterogeneidade e a dinâmica do comportamento dos recursos em ambientes de grid tornam complexos o desenvolvimento e a execução de aplicações. OurGrid é uma plataforma de software que procura contornar estas dificuldades: além de permitir a execução de aplicações distribuídas em ambientes de computação em grid, oferece e gerencia um esquema de troca de favores entre usuários. Neste esquema, instituições (ou usuários) que possuam recursos ociosos podem oferecê-los a outros que deles necessitem. Quanto mais um domínio oferecer recursos ao grid, mais será favorecido quando precisar, ou seja, terá prioridade mais alta quando requisitar máquinas ao grid. O software MyGrid é o principal componente do OurGrid. É através dele que o usuário interage com o grid, submetendo e gerenciando suas aplicações. No modelo de execução do MyGrid, as tarefas são lançadas por um nó central que coordena todo o escalonamento de tarefas que serão executadas no grid. Este nó apresenta uma fragilidade caracterizada na literatura como "ponto único de falhas", pois seu colapso faz com que os resultados do processamento corrente sejam perdidos. Isto pode significar horas ou, até mesmo, dias de processamento perdido, dependendo das aplicações. Visando suprir esta deficiência, este trabalho descreve o funcionamento e a implementação de um mecanismo de checkpointing (ou salvamento de estado), usado como base para a recuperação por retorno, que permite ao sistema voltar a um estado consistente, minimizando a perda de dados, após uma falha no nó central do MyGrid. Assim, ele salva, de forma estável, o estado da aplicação (estruturas de dados e informações de controle imprescindíveis) capaz de restaurar o sistema após o colapso, oferecendo uma alternativa à sua característica de ponto único de falhas. Os checkpoints são obtidos e salvos a cada mudança de estado do escalonador de tarefas do nó central. A eficiência do mecanismo de recuperação é comprovada através de experimentos que exercitam este mecanismo em cenários com diferentes características, visando validar e avaliar o impacto real no desempenho do MyGrid. / The grid computing has emerged as an important research area because it allows sharing geographically distributed computing resources among several users. However, resources in a grid are highly heterogeneous and dynamic, turning complex the development and the execution of applications. OurGrid is a software platform that intends to reduce these difficulties. Besides allowing the execution of distributed applications in grid environments, it offers and gives support to an exchange of favors between users. In this way, institutions (or users) that have idle resources can offer them to other users. The more resources a domain offers to the grid, the more it will be favored when in need. It will have higher priority when requesting machines to grid. MyGrid software is the main component of OurGrid: it constitutes the interface for user interaction as well as application submission and management. In the execution model of MyGrid, tasks are launched by a central node (home-machine), which manages the scheduling of tasks to be executed in the grid. This node constitutes a "single point of failure", because its crash causes the loss of results of the previous processing. Depending on the particular applications, this loss can be the result of hours or days of processing time. This dissertation aims to reduce the consequences of this problem offering an alternative to the single point of failure: here is proposed and implemented a checkpointing mechanism, used as basis for the rollback recovery. Checkpoints are taken synchronously with the state changes of the scheduler on the central node. After a failure affecting the home-machine of MyGrid, the system recovers information on the state of the application (data structures and essential control information) and results of previous computation, saved in stable storage, minimizing the loss of data. The efficiency of the recovery mechanism and its impact over MyGrid are evaluated through experiments that exercise this mechanism in scenarios with different characteristics. Computação móvel Tolerancia : Falhas Processamento distribuido Grid computation Fault tolerance Rollback-recovery Checkpointing OurGrid
7	Suporte a rollback em sistemas de gerenciamento de mudanças em TI. / Roolback support in IT change management systems Machado, Guilherme Sperb January 2008 (has links) As atuais pesquisas em gerência de mudança em um ambiente de TI (Tecnologia de Informação) têm explorado diferentes aspectos desta nova disciplina, porém normalmente assumindo que as mudanças expressas em documentos de Requisição de Mudanças (RFC – Request for Change) são sempre executadas com sucesso sobre uma determinada infraestrutura de TI. Esse cenário, muitas vezes, pode não refletir a realidade em sistemas de TI, pois falhas durante a execução de mudanças podem ocorrer e não devem ser simplesmente ignoradas. Para abordar esta questão, esta dissertação propõe uma solução onde atividades em um plano de mudança podem ser agrupadas, formando grupos atômicos. Esses grupos são atômicos no sentido de que quando uma atividade falha, todas as outras atividades já executadas do mesmo grupo precisam retroceder para o último estado consistente. Automatizar o processo de rollback em mudanças pode ser especialmente conveniente no sentido de que não seja necessário um operador humano desfazer manualmente as atividades que falharam de um determinado grupo atômico. Para avaliar a solução proposta e a sua viabilidade técnica foi implementado um protótipo que, usando elementos da linguagem BPEL (Business Process Execution Language), torna-se possível definir como sistemas de gerenciamento de mudanças em TI devem se comportar para capturar e identificar falhas. Os resultados mostram que a solução proposta não somente gera planos completos e corretos com base em marcações de atomicidade, mas também que a geração dos planos de rollback interfere minimamente no processo de agendamento de mudanças. / The current research on IT change management has been investigating several aspects of this new discipline, but they are usually carried out assuming that changes expressed in Requests for Change (RFC) documents will be successfully executed over the managed IT infrastructure. This assumption, however, is not realistic in actual IT systems because failures during the execution of changes do happen and cannot be ignored. In order to address this issue, we propose a solution where tightly-related change activities are grouped together forming atomic groups of activities. These groups are atomic in the sense that if one activity fails, all other already executed activities of the same group must rollback to move the system backwards to the previous consistent state. The automation of change rollback is especially convenient because it relieves the IT human operator of manually undoing the activities of a change group that has failed. To prove concept and technical feasibility of our proposed solution, we have implemented a prototype system that, using elements of the Business Process Execution Language (BPEL), is able to control how atomic groups of activities must be handled in IT change management systems. Results showed that the rollback solution not only generates complete and correct plans given a set of atomicity marks, but also that the rollback plan generation minimally interferes in the change scheduling process. Gerencia : Redes : Computadores Seguranca : Redes : Computadores IT change management Rollback plan Change failures ITIL
8	Implementação de um mecanismo de recuperação por retorno para a ferramenta ourgrid / Implementation of a rollback recovery mechanism for ourGrid toolkit Silva, Hélio Antônio Miranda da January 2007 (has links) A computação em grid (ou computação em grade) emergiu como uma área de pesquisa importante por permitir o compartilhamento de recursos computacionais geograficamente distribuídos entre vários usuários. Contudo, a heterogeneidade e a dinâmica do comportamento dos recursos em ambientes de grid tornam complexos o desenvolvimento e a execução de aplicações. OurGrid é uma plataforma de software que procura contornar estas dificuldades: além de permitir a execução de aplicações distribuídas em ambientes de computação em grid, oferece e gerencia um esquema de troca de favores entre usuários. Neste esquema, instituições (ou usuários) que possuam recursos ociosos podem oferecê-los a outros que deles necessitem. Quanto mais um domínio oferecer recursos ao grid, mais será favorecido quando precisar, ou seja, terá prioridade mais alta quando requisitar máquinas ao grid. O software MyGrid é o principal componente do OurGrid. É através dele que o usuário interage com o grid, submetendo e gerenciando suas aplicações. No modelo de execução do MyGrid, as tarefas são lançadas por um nó central que coordena todo o escalonamento de tarefas que serão executadas no grid. Este nó apresenta uma fragilidade caracterizada na literatura como "ponto único de falhas", pois seu colapso faz com que os resultados do processamento corrente sejam perdidos. Isto pode significar horas ou, até mesmo, dias de processamento perdido, dependendo das aplicações. Visando suprir esta deficiência, este trabalho descreve o funcionamento e a implementação de um mecanismo de checkpointing (ou salvamento de estado), usado como base para a recuperação por retorno, que permite ao sistema voltar a um estado consistente, minimizando a perda de dados, após uma falha no nó central do MyGrid. Assim, ele salva, de forma estável, o estado da aplicação (estruturas de dados e informações de controle imprescindíveis) capaz de restaurar o sistema após o colapso, oferecendo uma alternativa à sua característica de ponto único de falhas. Os checkpoints são obtidos e salvos a cada mudança de estado do escalonador de tarefas do nó central. A eficiência do mecanismo de recuperação é comprovada através de experimentos que exercitam este mecanismo em cenários com diferentes características, visando validar e avaliar o impacto real no desempenho do MyGrid. / The grid computing has emerged as an important research area because it allows sharing geographically distributed computing resources among several users. However, resources in a grid are highly heterogeneous and dynamic, turning complex the development and the execution of applications. OurGrid is a software platform that intends to reduce these difficulties. Besides allowing the execution of distributed applications in grid environments, it offers and gives support to an exchange of favors between users. In this way, institutions (or users) that have idle resources can offer them to other users. The more resources a domain offers to the grid, the more it will be favored when in need. It will have higher priority when requesting machines to grid. MyGrid software is the main component of OurGrid: it constitutes the interface for user interaction as well as application submission and management. In the execution model of MyGrid, tasks are launched by a central node (home-machine), which manages the scheduling of tasks to be executed in the grid. This node constitutes a "single point of failure", because its crash causes the loss of results of the previous processing. Depending on the particular applications, this loss can be the result of hours or days of processing time. This dissertation aims to reduce the consequences of this problem offering an alternative to the single point of failure: here is proposed and implemented a checkpointing mechanism, used as basis for the rollback recovery. Checkpoints are taken synchronously with the state changes of the scheduler on the central node. After a failure affecting the home-machine of MyGrid, the system recovers information on the state of the application (data structures and essential control information) and results of previous computation, saved in stable storage, minimizing the loss of data. The efficiency of the recovery mechanism and its impact over MyGrid are evaluated through experiments that exercise this mechanism in scenarios with different characteristics. Computação móvel Tolerancia : Falhas Processamento distribuido Grid computation Fault tolerance Rollback-recovery Checkpointing OurGrid
9	Suporte a rollback em sistemas de gerenciamento de mudanças em TI. / Roolback support in IT change management systems Machado, Guilherme Sperb January 2008 (has links) As atuais pesquisas em gerência de mudança em um ambiente de TI (Tecnologia de Informação) têm explorado diferentes aspectos desta nova disciplina, porém normalmente assumindo que as mudanças expressas em documentos de Requisição de Mudanças (RFC – Request for Change) são sempre executadas com sucesso sobre uma determinada infraestrutura de TI. Esse cenário, muitas vezes, pode não refletir a realidade em sistemas de TI, pois falhas durante a execução de mudanças podem ocorrer e não devem ser simplesmente ignoradas. Para abordar esta questão, esta dissertação propõe uma solução onde atividades em um plano de mudança podem ser agrupadas, formando grupos atômicos. Esses grupos são atômicos no sentido de que quando uma atividade falha, todas as outras atividades já executadas do mesmo grupo precisam retroceder para o último estado consistente. Automatizar o processo de rollback em mudanças pode ser especialmente conveniente no sentido de que não seja necessário um operador humano desfazer manualmente as atividades que falharam de um determinado grupo atômico. Para avaliar a solução proposta e a sua viabilidade técnica foi implementado um protótipo que, usando elementos da linguagem BPEL (Business Process Execution Language), torna-se possível definir como sistemas de gerenciamento de mudanças em TI devem se comportar para capturar e identificar falhas. Os resultados mostram que a solução proposta não somente gera planos completos e corretos com base em marcações de atomicidade, mas também que a geração dos planos de rollback interfere minimamente no processo de agendamento de mudanças. / The current research on IT change management has been investigating several aspects of this new discipline, but they are usually carried out assuming that changes expressed in Requests for Change (RFC) documents will be successfully executed over the managed IT infrastructure. This assumption, however, is not realistic in actual IT systems because failures during the execution of changes do happen and cannot be ignored. In order to address this issue, we propose a solution where tightly-related change activities are grouped together forming atomic groups of activities. These groups are atomic in the sense that if one activity fails, all other already executed activities of the same group must rollback to move the system backwards to the previous consistent state. The automation of change rollback is especially convenient because it relieves the IT human operator of manually undoing the activities of a change group that has failed. To prove concept and technical feasibility of our proposed solution, we have implemented a prototype system that, using elements of the Business Process Execution Language (BPEL), is able to control how atomic groups of activities must be handled in IT change management systems. Results showed that the rollback solution not only generates complete and correct plans given a set of atomicity marks, but also that the rollback plan generation minimally interferes in the change scheduling process. Gerencia : Redes : Computadores Seguranca : Redes : Computadores IT change management Rollback plan Change failures ITIL
10	Fault-Tolerant Average Execution Time Optimization for General-Purpose Multi-Processor System-On-Chips Väyrynen, Mikael January 2009 (has links) Fault tolerance is due to the semiconductor technology development important, not only for safety-critical systems but also for general-purpose (non-safety critical) systems. However, instead of guaranteeing that deadlines always are met, it is for general-purpose systems important to minimize the average execution time (AET) while ensuring fault tolerance. For a given job and a soft (transient) no-error probability, we define mathematical formulas for AET using voting (active replication), rollback-recovery with checkpointing (RRC) and a combination of these (CRV) where bus communication overhead is included. And, for a given multi-processor system-on-chip (MPSoC), we define integer linear programming (ILP) models that minimize the AET including bus communication overhead when: (1) selecting the number of checkpoints when using RRC or a combination where RRC is included, (2) finding the number of processors and job-to-processor assignment when using voting or a combination where voting is used, and (3) defining fault tolerance scheme (voting, RRC or CRV) per job and defining its usage for each job. Experiments demonstrate significant savings in AET. Fault tolerance Execution time optimization Rollback recovery with checkpointing Active replication MPSoC Computer Sciences Datavetenskap (datalogi)

Search results