Spelling suggestions: "subject:"selfhealing systems"" "subject:"selfdealing systems""
1 |
Towards Model-Based Fault Management for Computing SystemsJia, Rui 07 May 2016 (has links)
Large scale distributed computing systems have been extensively utilized to host critical applications in the fields of national defense, finance, scientific research, commerce, etc. However, applications in distributed systems face the risk of service outages due to inevitable faults. Without proper fault management methods, faults can lead to significant revenue loss and degradation of Quality of Service (QoS). An ideal fault management solution should guarantee fast and accurate fault diagnosis, scalability in distributed systems, portability for a variety of systems, and the versatility of recovering different types of faults. This dissertation presents a model-based fault management structure which automatically recovers computing systems from faults. This structure can recover a system from common faults while minimizing the impact on the system’s QoS. It covers all stages of fault management including fault detection, identification and recovery. It also has the flexibility to incorporate various fault diagnosis methods. When faults occur, the approach identifies fault types and intensity, and it accordingly computes the optimal recovery plan with minimum performance degradation, based on a cost function that defines performance objectives and a predictive control algorithm. The fault management approach has been verified on a centralized Web application testbed and a distributed big data processing testbed with four types of simulated faults: memory leak, network congestion, CPU hog and disk failure. The feasibility of the fault recovery control algorithm is also verified. Simulation results show that our approach enabled effective automatic recovery from faults. Performance evaluation reveals that CPU and memory overhead of the fault management process is negligible. To let domain engineers conveniently apply the proposed fault management structure on their specific systems, a component-based modeling environment is developed. The meta-model of the fault management structure is developed with Unified Modeling Language as an abstract of a general fault recovery solution for computing systems. It defines the fundamental reusable components that comprise such a system, including the connections among them, attributes of each component and constraints. The meta-model can be interpreted into a userriendly graphic modeling environment for creating application models of practical domain specific systems and generating executable codes on them.
|
2 |
Um serviço de self-healing baseado em P2P para manutenção de redes de computadores / A P2P based self-healing service for coputer networks maintenanceDuarte, Pedro Arthur Pinheiro Rosa January 2015 (has links)
Observou-se nos últimos anos um grande aumetno na complexidade das redes. Surgiram também novos desa os para gerenciamento dessas redes. A dimensão atual e as tendências de crescimento das infraestruturas tem inviabilizado as técnicas de gerencimento de redes atuais, baseadas na intervenção humana. Por exemplo, a heterogeneidade dos elementos gerenciados obrigam que administradores e gerentes lidem com especi cidades de implanta ção que vão além dos objetivos gerenciais. Considerando as áreas funcionais da gerência de redes, a gerência de falhas apresenta impactos operacionais interessantes. Estima-se que 33% dos custos operacionais estão relacionados com a prevenção e recuperação de falhas e que aproximadamente 44% desse custo visa à resolução de problemas causados por erros humanos. Dentre as abordagens de gerência de falhas, o Self-Healing objetiva minimizar as interações humanas nas rotinas de gerenciamento de falhas, diminuindo dessa forma erros e demandas operacionais. Algumas propostas sugerem que o Self-Healing seja planejado no momento do projeto das aplicações. Tais propostas são inviáveis de aplicação em sistemas legados. Otras pesquisas sugerem à análise e instrumentação das aplicações em tempo de execução. Embora aplicáveis a sistemas legados, análise e instrumentação em tempo de execução estão fortemente acopladas as tecnologias e detalhes de implementação das aplicações. Por esse motivo, é difícil aplicar tais propostas, por exemplo, em um ambiente de rede que abrange muitas entidades gerenciadas implantadas através de diferentes tecnologias. Porém, parece plausível oferecer aos adminitradores e gerentes facilidades através das quais eles possam expressar seus conhecimentos sobre anoamlias e falhas de aplicações, bem como mecanismos através dos quais esses conhecimentos possam ser utilizado no gerenciamento de sistemas. Essa dissertação de mestrado tem como objetivo apresentar e avaliar uma solução comum que introduza nas redes capacidades de self-healing. A solu- ção apresentada utiliza-se de workplans para capturar o conhecimento dos administradores em como diagnosticar e recuperar anomalias e falhas em redes. Além disso, o projeto e implementação de um framework padrão para detecção e noti cação de falhas é discutido no âmbito de um sistema de gerenciamento baseado em P2P. Por último, uma avaliação experimental clari ca a viabilidade do ponto de vista operacional. / In recent years, a huge raise in networks' complexity was witnessed. Along the raise in complexity, many management challenges also arose. For instance, managed entities' heterogeneity demands administrators and managers to deal with cumbersome implementation and deployment speci cities. Moreover, infrastructures' current size and growth-trends show that it is becoming infeasible to rely on human-in-the-loop management techniques. Inside the problem domain of network management, Fault Management is appealing because of its impact in operational costs. Researches estimate that more than 33% of operational costs are related to preventing and recovering faults, where about 40% of this investment is directed to solve human-caused operational errors. Hence, addressing human interaction is mostly unarguably a need. Among di erent approaches, Self-Healing, a property of Autonomic Network Management's proposal, targets to avoid humans' interactions and decisions on Fault Management loops, thereupon unburden administrators and managers from performing Fault Management-related tasks. Some researches on Self-Healing enabling approaches suppose that Fault Management capabilities should be planned in design-time. These approaches are impossible to apply on legacy systems. Other researches suggest runtime analysis and instrumentation of applications' bytecode. Albeit applicable to some legacy systems, these last proposals are tightly-coupled to implementation's issues of underlaying technologies. For this reason, it is hard to apply such proposals end-toend, for example, in a scenario encompassing many managed entities implemented through di erent technologies. However, it is possible to o er to administrators and managers facilities to express they knowledge about networks' anomalies and faults, and facilities to leverage this knowledge. This master dissertation has as objective to present and evaluate a solution to imbue network management systems with self-healing capabilities. The solution relies on workplans as a mean to gather administrators and managers' knowledge on how to diagnose and heal networks' anomalies and faults. Besides that, the design and implementation of a standard framework for fault detection and noti cation customization is discussed while considering a P2P-Based Network Management System as its foundations. At last, an experimental evaluation renders clear the proposal's feasibility from the operational point of view.
|
3 |
Um serviço de self-healing baseado em P2P para manutenção de redes de computadores / A P2P based self-healing service for coputer networks maintenanceDuarte, Pedro Arthur Pinheiro Rosa January 2015 (has links)
Observou-se nos últimos anos um grande aumetno na complexidade das redes. Surgiram também novos desa os para gerenciamento dessas redes. A dimensão atual e as tendências de crescimento das infraestruturas tem inviabilizado as técnicas de gerencimento de redes atuais, baseadas na intervenção humana. Por exemplo, a heterogeneidade dos elementos gerenciados obrigam que administradores e gerentes lidem com especi cidades de implanta ção que vão além dos objetivos gerenciais. Considerando as áreas funcionais da gerência de redes, a gerência de falhas apresenta impactos operacionais interessantes. Estima-se que 33% dos custos operacionais estão relacionados com a prevenção e recuperação de falhas e que aproximadamente 44% desse custo visa à resolução de problemas causados por erros humanos. Dentre as abordagens de gerência de falhas, o Self-Healing objetiva minimizar as interações humanas nas rotinas de gerenciamento de falhas, diminuindo dessa forma erros e demandas operacionais. Algumas propostas sugerem que o Self-Healing seja planejado no momento do projeto das aplicações. Tais propostas são inviáveis de aplicação em sistemas legados. Otras pesquisas sugerem à análise e instrumentação das aplicações em tempo de execução. Embora aplicáveis a sistemas legados, análise e instrumentação em tempo de execução estão fortemente acopladas as tecnologias e detalhes de implementação das aplicações. Por esse motivo, é difícil aplicar tais propostas, por exemplo, em um ambiente de rede que abrange muitas entidades gerenciadas implantadas através de diferentes tecnologias. Porém, parece plausível oferecer aos adminitradores e gerentes facilidades através das quais eles possam expressar seus conhecimentos sobre anoamlias e falhas de aplicações, bem como mecanismos através dos quais esses conhecimentos possam ser utilizado no gerenciamento de sistemas. Essa dissertação de mestrado tem como objetivo apresentar e avaliar uma solução comum que introduza nas redes capacidades de self-healing. A solu- ção apresentada utiliza-se de workplans para capturar o conhecimento dos administradores em como diagnosticar e recuperar anomalias e falhas em redes. Além disso, o projeto e implementação de um framework padrão para detecção e noti cação de falhas é discutido no âmbito de um sistema de gerenciamento baseado em P2P. Por último, uma avaliação experimental clari ca a viabilidade do ponto de vista operacional. / In recent years, a huge raise in networks' complexity was witnessed. Along the raise in complexity, many management challenges also arose. For instance, managed entities' heterogeneity demands administrators and managers to deal with cumbersome implementation and deployment speci cities. Moreover, infrastructures' current size and growth-trends show that it is becoming infeasible to rely on human-in-the-loop management techniques. Inside the problem domain of network management, Fault Management is appealing because of its impact in operational costs. Researches estimate that more than 33% of operational costs are related to preventing and recovering faults, where about 40% of this investment is directed to solve human-caused operational errors. Hence, addressing human interaction is mostly unarguably a need. Among di erent approaches, Self-Healing, a property of Autonomic Network Management's proposal, targets to avoid humans' interactions and decisions on Fault Management loops, thereupon unburden administrators and managers from performing Fault Management-related tasks. Some researches on Self-Healing enabling approaches suppose that Fault Management capabilities should be planned in design-time. These approaches are impossible to apply on legacy systems. Other researches suggest runtime analysis and instrumentation of applications' bytecode. Albeit applicable to some legacy systems, these last proposals are tightly-coupled to implementation's issues of underlaying technologies. For this reason, it is hard to apply such proposals end-toend, for example, in a scenario encompassing many managed entities implemented through di erent technologies. However, it is possible to o er to administrators and managers facilities to express they knowledge about networks' anomalies and faults, and facilities to leverage this knowledge. This master dissertation has as objective to present and evaluate a solution to imbue network management systems with self-healing capabilities. The solution relies on workplans as a mean to gather administrators and managers' knowledge on how to diagnose and heal networks' anomalies and faults. Besides that, the design and implementation of a standard framework for fault detection and noti cation customization is discussed while considering a P2P-Based Network Management System as its foundations. At last, an experimental evaluation renders clear the proposal's feasibility from the operational point of view.
|
4 |
Um serviço de self-healing baseado em P2P para manutenção de redes de computadores / A P2P based self-healing service for coputer networks maintenanceDuarte, Pedro Arthur Pinheiro Rosa January 2015 (has links)
Observou-se nos últimos anos um grande aumetno na complexidade das redes. Surgiram também novos desa os para gerenciamento dessas redes. A dimensão atual e as tendências de crescimento das infraestruturas tem inviabilizado as técnicas de gerencimento de redes atuais, baseadas na intervenção humana. Por exemplo, a heterogeneidade dos elementos gerenciados obrigam que administradores e gerentes lidem com especi cidades de implanta ção que vão além dos objetivos gerenciais. Considerando as áreas funcionais da gerência de redes, a gerência de falhas apresenta impactos operacionais interessantes. Estima-se que 33% dos custos operacionais estão relacionados com a prevenção e recuperação de falhas e que aproximadamente 44% desse custo visa à resolução de problemas causados por erros humanos. Dentre as abordagens de gerência de falhas, o Self-Healing objetiva minimizar as interações humanas nas rotinas de gerenciamento de falhas, diminuindo dessa forma erros e demandas operacionais. Algumas propostas sugerem que o Self-Healing seja planejado no momento do projeto das aplicações. Tais propostas são inviáveis de aplicação em sistemas legados. Otras pesquisas sugerem à análise e instrumentação das aplicações em tempo de execução. Embora aplicáveis a sistemas legados, análise e instrumentação em tempo de execução estão fortemente acopladas as tecnologias e detalhes de implementação das aplicações. Por esse motivo, é difícil aplicar tais propostas, por exemplo, em um ambiente de rede que abrange muitas entidades gerenciadas implantadas através de diferentes tecnologias. Porém, parece plausível oferecer aos adminitradores e gerentes facilidades através das quais eles possam expressar seus conhecimentos sobre anoamlias e falhas de aplicações, bem como mecanismos através dos quais esses conhecimentos possam ser utilizado no gerenciamento de sistemas. Essa dissertação de mestrado tem como objetivo apresentar e avaliar uma solução comum que introduza nas redes capacidades de self-healing. A solu- ção apresentada utiliza-se de workplans para capturar o conhecimento dos administradores em como diagnosticar e recuperar anomalias e falhas em redes. Além disso, o projeto e implementação de um framework padrão para detecção e noti cação de falhas é discutido no âmbito de um sistema de gerenciamento baseado em P2P. Por último, uma avaliação experimental clari ca a viabilidade do ponto de vista operacional. / In recent years, a huge raise in networks' complexity was witnessed. Along the raise in complexity, many management challenges also arose. For instance, managed entities' heterogeneity demands administrators and managers to deal with cumbersome implementation and deployment speci cities. Moreover, infrastructures' current size and growth-trends show that it is becoming infeasible to rely on human-in-the-loop management techniques. Inside the problem domain of network management, Fault Management is appealing because of its impact in operational costs. Researches estimate that more than 33% of operational costs are related to preventing and recovering faults, where about 40% of this investment is directed to solve human-caused operational errors. Hence, addressing human interaction is mostly unarguably a need. Among di erent approaches, Self-Healing, a property of Autonomic Network Management's proposal, targets to avoid humans' interactions and decisions on Fault Management loops, thereupon unburden administrators and managers from performing Fault Management-related tasks. Some researches on Self-Healing enabling approaches suppose that Fault Management capabilities should be planned in design-time. These approaches are impossible to apply on legacy systems. Other researches suggest runtime analysis and instrumentation of applications' bytecode. Albeit applicable to some legacy systems, these last proposals are tightly-coupled to implementation's issues of underlaying technologies. For this reason, it is hard to apply such proposals end-toend, for example, in a scenario encompassing many managed entities implemented through di erent technologies. However, it is possible to o er to administrators and managers facilities to express they knowledge about networks' anomalies and faults, and facilities to leverage this knowledge. This master dissertation has as objective to present and evaluate a solution to imbue network management systems with self-healing capabilities. The solution relies on workplans as a mean to gather administrators and managers' knowledge on how to diagnose and heal networks' anomalies and faults. Besides that, the design and implementation of a standard framework for fault detection and noti cation customization is discussed while considering a P2P-Based Network Management System as its foundations. At last, an experimental evaluation renders clear the proposal's feasibility from the operational point of view.
|
5 |
An approach for Self-healing Transactional Composite Services / Une approche auto-corrective pour des services composites transactionnelsAngarita Arocha, Rafael Enrique 11 December 2015 (has links)
Dans ce mémoire de thèse, nous présentons une approche d’exécution auto-corrective (self-healing) de services composites, basée sur des agents capables de prendre, de manière autonome, des décisions pendant l’exécution des services, à partir de leurs connaissances. Dans un premier temps, nous définissons, de manière formelle, en utilisant des réseaux de Petri colorés, les services composites, leur processus d’exécution, et leurs mécanismes de tolérance aux pannes. Notre approche offre plusieurs mécanismes de reprise sur panne alternatifs : la récupération en arrière avec compensation ; la récupération en avant avec ré-exécution et/ou remplacement de service ; et le point de contrôle (checkpointing), à partir duquel il est possible de reprendre l’exécution du service ultérieurement. Dans notre approche, les services sont contrôlés par des agents, i.e. des composants dont le rôle est de s’assurer que l’exécution des services soit tolérante aux pannes. Notre approche est également étendue afin de permettre un auto-recouvrement. Dans cette extension, les agents disposent d’une base de connaissances contenant à la fois des informations sur eux-mêmes et sur le contexte d’exécution. Pour prendre des décisions concernant la sélection des stratégies de récupération, les agents font des déductions en fonction des informations qu’ils ont sur l’ensemble du service composite, sur eux-mêmes, tout en prenant en compte également ce qui est attendu et ce qui se passe réellement lors de l’exécution. Finalement, nous illustrons notre approche par une évaluation expérimentale en utilisant un cas d’étude. / In this thesis, we present a self-healing approach for composite services supported by knowledge-based agents capable of making decisions at runtime. First, we introduce our formal definition of composite services, their execution processes, and their fault tolerance mechanisms using Colored Petri nets. We implement the following recovery mechanisms: backward recovery through compensation; forward recovery through service retry and service replacement; and checkpointing as an alternative strategy. We introduce the concept of Service Agents, which are software components in charge of component services and their fault tolerance execution control. We then extend our approach with self-healing capabilities. In this self-healing extension, Service Agents are knowledge-based agents; that is, they are self- and context-aware. To make decisions about the selection of recovery and proactive fault tolerance strategies, Service Agents make deductions based on the information they have about the whole composite service, about themselves, and about what is expected and what it is really happening at runtime. Finally, we illustrate our approach and evaluate it experimentally using a case study.
|
6 |
Using unsupervised machine learning for fault identification in virtual machinesSchneider, C. January 2015 (has links)
Self-healing systems promise operating cost reductions in large-scale computing environments through the automated detection of, and recovery from, faults. However, at present there appears to be little known empirical evidence comparing the different approaches, or demonstrations that such implementations reduce costs. This thesis compares previous and current self-healing approaches before demonstrating a new, unsupervised approach that combines artificial neural networks with performance tests to perform fault identification in an automated fashion, i.e. the correct and accurate determination of which computer features are associated with a given performance test failure. Several key contributions are made in the course of this research including an analysis of the different types of self-healing approaches based on their contextual use, a baseline for future comparisons between self-healing frameworks that use artificial neural networks, and a successful, automated fault identification in cloud infrastructure, and more specifically virtual machines. This approach uses three established machine learning techniques: Naïve Bayes, Baum-Welch, and Contrastive Divergence Learning. The latter demonstrates minimisation of human-interaction beyond previous implementations by producing a list in decreasing order of likelihood of potential root causes (i.e. fault hypotheses) which brings the state of the art one step closer toward fully self-healing systems. This thesis also examines the impact of that different types of faults have on their respective identification. This helps to understand the validity of the data being presented, and how the field is progressing, whilst examining the differences in impact to identification between emulated thread crashes and errant user changes – a contribution believed to be unique to this research. Lastly, future research avenues and conclusions in automated fault identification are described along with lessons learned throughout this endeavor. This includes the progression of artificial neural networks, how learning algorithms are being developed and understood, and possibilities for automatically generating feature locality data.
|
7 |
Vers le recouvrement automatique dans la composition de services WEB basée protocole / Towards automatic recovery in protocol-based Web service compositionMenadjelia, Nardjes 15 July 2013 (has links)
Dans une composition de services Web basée protocole, un ensemble de services composants se collaborent pour donner lieu à un service Composite. Chaque service est représenté par un automate à états finis (AEF). Au sein d’un AEF, chaque transition exprime l’exécution d’une opération qui fait avancer le service vers un état suivant. Une exécution du composite correspond à une séquence de transitions où chacune est déléguée à un des composants. Lors de l’exécution du composite, un ou plusieurs composants peuvent devenir indisponibles. Ceci peut produire une exécution incomplète du composite, et de ce fait un recouvrement est nécessaire. Le recouvrement consiste à transformer l’exécution incomplète en une exécution alternative ayant encore la capacité d’aller vers un état final. La transformation s'effectue en compensant certaines transitions et exécutant d’autres. Cette thèse présente une étude formelle du problème de recouvrement dans une composition de service Web basée protocole. Le problème de recouvrement consiste à trouver une meilleure exécution alternative parmi celles disponibles. Une meilleure alternative doit être atteignable à partir de l’exécution incomplète avec un nombre minimal de compensations visibles (vis-à-vis le client). Pour une exécution alternative donnée, nous prouvons que le problème de décision associé au calcul du nombre de transitions invisiblement compensées est NP-Complet. De ce fait, nous concluons que le problème de décision associé au recouvrement appartient à la classe ΣP2. / In a protocol-based Web service composition, a set of available component services collaborate together in order to provide a new composite service. Services export their protocols as finite state machines (FSMs). A transition in the FSM represents a task execution that makes the service moving to a next state. An execution of the composite corresponds to a sequence of transitions where each task is delegated to a component service. During composite run, one or more delegated components may become unavailable due to hard or soft problems on the Network. This unavailability may result in a failed execution of the composite. We provide in this thesis a formal study of the automatic recovery problem in the protocol-based Web service composition. Recovery consists in transforming the failed execution into a recovery execution. Such a transformation is performed by compensating some transitions and executing some others. The recovery execution is an alternative execution of the composite that still has the ability to reach a final state. The recovery problem consists then in finding the best recovery execution(s) among those available. The best recovery execution is attainable from the failed execution with a minimal number of visible compensations with respect to the client. For a given recovery execution, we prove that the decision problem associated with computing the number of invisibly-compensated transitions is NP-complete. Thus, we conclude that deciding of the best recovery execution is in ΣP2.
|
8 |
Cross-layer self-diagnosis for services over programmable networks / Auto-diagnostic multi-couche pour services sur réseaux programmablesSánchez Vílchez, José Manuel 07 July 2016 (has links)
Les réseaux actuels servent millions de clients mobiles et ils se caractérisent par équipement hétérogène et protocoles de transport et de gestion hétérogènes, et des outils de gestion verticaux, qui sont très difficiles à intégrer dans leur infrastructure. La gestion de pannes est loin d’être automatisée et intelligent, ou un 40 % des alarmes sont redondantes et seulement un 1 ou 2% des alarmes sont corrélées au plus dans un centre opérationnel. Ça indique qu’il y a un débordement significatif des alarmes vers les adminis-trateurs humains, a comme conséquence un haut OPEX vue la nécessité d’embaucher de personnel expert pour accomplir les tâches de gestion de pannes. Comme conclusion, le niveau actuel d’automatisation dans les tâches de gestion de pannes dans réseaux télécoms n’est pas adéquat du tout pour adresser les réseaux programmables, lesquels promettent la programmation des ressources et la flexibilité afin de réduire le time-to-market des nouveaux services. L’automatisation de la gestion des pannes devient de plus en plus nécessaire avec l’arrivée des réseaux programmables, SDN (Software-Defined Networking), NFV (Network Functions Virtualization) et le Cloud. En effet, ces paradigmes accélèrent la convergence entre les domaines des réseaux et la IT, laquelle accélère de plus en plus la transformation des réseaux télécoms actuels en menant à repenser les opérations de gestion de réseau et des services, en particulier les opérations de gestion de fautes. Cette thèse envisage l’application des principes d’autoréparation en infrastructures basées sur SDN et NFV, en focalisant sur l’autodiagnostic comme facilitateur principal des principes d’autoréparation. Le coeur de cette thèse c’est la conception d’une approche de diagnostic qui soit capable de diagnostiquer de manière continuée les services dynamiques virtualisés et leurs dépendances des ressources virtuels (VNFs et liens virtuels) mais aussi les dépendances de ceux ressources virtuels de la infrastructure physique en-dessous, en prenant en compte la mobilité, la dynamicite, le partage de ressources à l’infrastructure en-dessous / Current networks serve billions of mobile customer devices. They encompass heterogeneous equipment, transport and manage-ment protocols, and vertical management tools, which are very difficult and costly to integrate. Fault management operations are far from being automated and intelligent, where around 40% of alarms are redundant only around 1-2% of alarms are correlated at most in a medium-size operational center. This indicates that there is a significant alarm overflow for human administrators, which inherently derives in high OPEX due to the increasingly need to employ high-skilled people to perform fault management tasks. In conclusion, the current level of automation in fault management tasks in Telcos networks is not at all adequate for programmable networks, which promise a high degree of programmability and flexibility to reduce the time-to-market. Automation on fault management is more necessary with the advent of programmable networks, led by with SDN (Software-Defined Networking), NFV (Network Functions Virtualization) and the Cloud. Indeed, the arise of those paradigms accelerates the convergence between networks and IT realms, which as consequence, is accelerating faster and faster the transformation of cur-rent networks leading to rethink network and service management and operations, in particular fault management operations. This thesis envisages the application of self-healing principles in SDN and NFV combined infrastructures, by focusing on self-diagnosis tasks as main enabler of self-healing. The core of thesis is to devise a self-diagnosis approach able to diagnose at run-time the dynamic virtualized networking services and their dependencies from the virtualized resources (VNFs and virtual links) but also the dependencies of those virtualized resources from the underlying network infrastructure, taking into account the mobility, dynamicity, and sharing of resources in the underlying infrastructure
|
Page generated in 0.1 seconds