Global ETD Search

21	Efficient end-to-end monitoring for fault management in distributed systems Feng, Dawei 27 March 2014 (has links) (PDF) In this dissertation, we present our work on fault management in distributed systems, with motivating application roots in monitoring fault and abrupt change of large computing systems like the grid and the cloud. Instead of building a complete a priori knowledge of the software and hardware infrastructures as in conventional detection or diagnosis methods, we propose to use appropriate techniques to perform end-to-end monitoring for such large scale systems, leaving the inaccessible details of involved components in a black box.For the fault monitoring of a distributed system, we first model this probe-based application as a static collaborative prediction (CP) task, and experimentally demonstrate the effectiveness of CP methods by using the max margin matrix factorization method. We further introduce active learning to the CP framework and exhibit its critical advantage in dealing with highly imbalanced data, which is specially useful for identifying the minority fault class.Further we extend the static fault monitoring to the sequential case by proposing the sequential matrix factorization (SMF) method. SMF takes a sequence of partially observed matrices as input, and produces predictions with information both from the current and history time windows. Active learning is also employed to SMF, such that the highly imbalanced data can be coped with properly. In addition to the sequential methods, a smoothing action taken on the estimation sequence has shown to be a practically useful trick for enhancing sequential prediction performance.Since the stationary assumption employed in the static and sequential fault monitoring becomes unrealistic in the presence of abrupt changes, we propose a semi-supervised online change detection (SSOCD) framework to detect intended changes in time series data. In this way, the static model of the system can be recomputed once an abrupt change is detected. In SSOCD, an unsupervised offline method is proposed to analyze a sample data series. The change points thus detected are used to train a supervised online model, which gives online decision about whether there is a change presented in the arriving data sequence. State-of-the-art change detection methods are employed to demonstrate the usefulness of the framework.All presented work is verified on real-world datasets. Specifically, the fault monitoring experiments are conducted on a dataset collected from the Biomed grid infrastructure within the European Grid Initiative, and the abrupt change detection framework is verified on a dataset concerning the performance change of an online site with large amount of traffic. Fault management Collaborative prediction End-to-end monitoring Sequential matrix factorization Sequential change detection Semi-supervised change detection
22	Enhanced fault recovery methods for protected traffic services in GMPLS networks Calle Ortega, Eusebi 07 May 2004 (has links) Les noves tecnologies a la xarxa ens permeten transportar, cada cop més, grans volums d' informació i trànsit de xarxa amb diferents nivells de prioritat. En aquest escenari, on s'ofereix una millor qualitat de servei, les conseqüències d'una fallada en un enllaç o en un node esdevenen més importants. Multiprotocol Lavel Switching (MPLS), juntament amb l'extensió a MPLS generalitzat (GMPLS), proporcionen mecanismes ràpids de recuperació de fallada establint camins, Label Switch Path (LSPs), redundants per ser utilitzats com a camins alternatius. En cas de fallada podrem utilitzar aquests camins per redireccionar el trànsit. El principal objectiu d'aquesta tesi ha estat millorar alguns dels actuals mecanismes de recuperació de fallades MPLS/GMPLS, amb l'objectiu de suportar els requeriments de protecció dels serveis proporcionats per la nova Internet. Per tal de fer aquesta avaluació s'han tingut en compte alguns paràmetres de qualitat de protecció com els temps de recuperació de fallada, les pèrdues de paquets o el consum de recursos.En aquesta tesi presentem una completa revisió i comparació dels principals mètodes de recuperació de fallada basats en MPLS. Aquest anàlisi inclou els mètodes de protecció del camí (backups globals, backups inversos i protecció 1+1), els mètodes de protecció locals i els mètodes de protecció de segments. També s'ha tingut en compte l'extensió d'aquests mecanismes a les xarxes òptiques mitjançant el pla de control proporcionat per GMPLS.En una primera fase d'aquest treball, cada mètode de recuperació de fallades és analitzat sense tenir en compte restriccions de recursos o de topologia. Aquest anàlisi ens dóna una primera classificació dels millors mecanismes de protecció en termes de pèrdues de paquets i temps de recuperació. Aquest primer anàlisi no és aplicable a xarxes reals. Per tal de tenir en compte aquest nou escenari, en una segona fase, s'analitzen els algorismes d'encaminament on sí tindrem en compte aquestes limitacions i restriccions de la xarxa. Es presenten alguns dels principals algorismes d'encaminament amb qualitat de servei i alguna de les principals propostes d'encaminament per xarxes MPLS. La majoria dels actual algorismes d'encaminament no tenen en compte l'establiment de rutes alternatives o utilitzen els mateixos objectius per seleccionar els camins de treball i els de protecció. Per millorar el nivell de protecció introduïm i formalitzem dos nous conceptes: la Probabilitat de fallada de la xarxa i l'Impacte de fallada. Un anàlisi de la xarxa a nivell físic proporciona un primer element per avaluar el nivell de protecció en termes de fiabilitat i disponibilitat de la xarxa. Formalitzem l'impacte d'una fallada, quant a la degradació de la qualitat de servei (en termes de retard i pèrdues de paquets). Expliquem la nostra proposta per reduir la probabilitat de fallada i l'impacte de fallada. Per últim fem una nova definició i classificació dels serveis de xarxa segons els valors requerits de probabilitat de fallada i impacte.Un dels aspectes que destaquem dels resultats d'aquesta tesi és que els mecanismes de protecció global del camí maximitzen la fiabilitat de la xarxa, mentre que les tècniques de protecció local o de segments de xarxa minimitzen l'impacte de fallada. Per tant podem assolir mínim impacte i màxima fiabilitat aplicant protecció local a tota la xarxa, però no és una proposta escalable en termes de consum de recursos. Nosaltres proposem un mecanisme intermig, aplicant protecció de segments combinat amb el nostre model d'avaluació de la probabilitat de fallada. Resumint, aquesta tesi presenta diversos mecanismes per l'anàlisi del nivell de protecció de la xarxa. Els resultats dels models i mecanismes proposats milloren la fiabilitat i minimitzen l'impacte d'una fallada en la xarxa. / New network technology enables increasingly higher volumes of information to be carried. Various types of mission-critical, higher-priority traffic are now transported over these networks. In this scenario, when offering better quality of service, the consequences of a fault in a link or node become more pronounced. Multiprotocol Label Switching (MPLS) and the extended Generalized MPLS (GMPLS) provide fast mechanisms for recovery from failures by establishing redundant Label Switch Paths as backup paths. With these backups, traffic can always be redirected in case of failure. The main objective of this thesis is to improve some of the current MPLS/GMPLS fault recovery methods, in order to support the protection requirements of the new Internet services. Some parameters, such as fault recovery time, packet loss or resource consumption, all within the scope of this quality of protection, are considered. In this thesis a review and detailed comparison of the MPLS fault recovery methods are presented. Path protection methods (global backups, reverse backups and 1+1 methods), as well as segment protection and local methods are included in this analysis. The extension of these mechanisms to optical networks using GMPLS control plane is also taken into account.In the first phase MPLS fault recovery methods are analyzed without taking into account resource or network topology constraints. This analysis reported a first classification of the best protection methods in terms of packet loss and recovery time. This first analysis cannot be applied to real networks. In real networks, bandwidth or network topology constraints can force a change in the a priori optimal protection choice. In this new scenario, current routing algorithms must be analyzed. The main aspects of the QoS routing methods are introduced, and some of these mechanisms are described and compared. QoS routing algorithms do not include protection as a main objective and, moreover, the same QoS objectives for selecting the working path are used for selecting the backup path. In order to evaluate the quality of protection, two novel concepts are introduced and analyzed: the network failure probability and the failure impact. The physical network provides an initial value of the network protection level in terms of network reliability and availability. A proposal to evaluate network reliability is introduced, and a formulation to calculate the failure impact (the QoS degradation in terms of packet loss and delay) is presented. A proposal to reduce the failure probability and failure impact as well as the enhancement of some current routing algorithms in order to achieve better protection are explained. A review of the traffic services protection requirements and a new classification, based on the failure probability and failure impact values, is also provided in this work.Results show that path protection schemes improve network reliability. Segment/local protection schemes reduce the network failure impact. Minimum impact with maximum reliability can be achieved using local protection throughout the entire network. However, it is not scalable in terms of resource consumption. In this case our failure probability evaluation model can be used to minimize the required resources. Results demonstrate the reduction of the failure impact combining segment protection and our network reliability evaluation model in different network scenarios.In summary, an in-depth analysis is carried out and a formulation to evaluate the network protection level is presented. This evaluation is based on network reliability maximization and failure impact reduction in terms of QoS degradation. A scalable proposal in terms of resource consumption, detailed and experimentally analyzed, offers the required level of protection in different network scenarios for different traffic services. Fault management QoS routing algorithms Gestió de errores Algorismes d'encaminament Path protection Protección del camino Algorismos de encaminamiento Protecció del camí Gestió de fallades MPLS/GMPLS 004 68
23	Gerenciamento de Faltas em Computação em Nuvem: Um Mapeamento Sistemático de Literatura Leite Neto, Clodoaldo Brasilino 22 January 2014 (has links) Made available in DSpace on 2015-05-14T12:36:48Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 4346482 bytes, checksum: 66ecee23f8ca75e6ea5bba135cfa9a42 (MD5) Previous issue date: 2014-01-22 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Background: With the growing popularity of cloud computing, an challenge seen in this discipline is the management of faults that may occur in such big infrastructures that, because of its size, has a greater chance of ocurring faults, errors and failures. A work that maps the solutions already created by researchers efficiently should help visualizing gaps over those research fields. Aims: This work aims to find research gaps in the cloud computing and fault management domains, aside from building a social network of researchers in the area. Method: We conducted a systematic mapping study to collect, filter and classify scientific works in this area. The 4535 scientific papers found on major search engines were filtered and the remaining 166 papers were classified according to a taxonomy described in this work. Results: We found that IaaS is most explored in the selected studies. The main dependability functions explored were Tolerance and Removal, and the attributes were Reliability and Availability. Most papers had been classified by research type as Solution Proposal. Conclusion: This work summarizes and classifies the research effort conducted on fault management in cloud computing, providing a good starting point for further research in this area. / Fundamentação: Com o grande crescimento da popularidade da computação em nuvens, observa-se que um desafio dessa área é gerenciar falhas que possam ocorrer nas grandes infra-estruturas computacionais construídas para dar suporte à computação como serviço. Por serem extensas, possuem maior ocorrência de faltas, falhas e erros. Um trabalho que mapeie as soluções já criadas por pesquisadores de maneira simples e eficiente pode ajudar a visualizar oportunidades e saturações nesta área de pesquisa. Objetivos: Este trabalho visa mapear de forma sistemática todo o esforço de pesquisa aplicado ao gerênciamento de faltas em computação em nuvem de forma a facilitar a identificação de áreas pouco exploradas e que pode eventualmente representar novas oportunidades de pesquisa. Metodologia: Para este trabalho utiliza-se a metodologia de pesquisa baseada em evidências, através do método de mapeamento sistemático, sendo a pesquisa construída em três etapas de seleção de estudos. Método: Conduzimos um mapeamento sistemático para coletar, filtrar e classificar trabalhos científicos na área. Foram inicialmente coletados 4535 artigos científicos nos grandes engenhos de busca que, após três etapas de filtragem, acabaram sendo reduzidos a 166 artigos. Estes artigos restantes foram classificados de acordo com a taxonomia definida neste trabalho. Resultados: Observa-se que IaaS é a área mais explorada nos estudos selecionados. As funções de gerência de falhas mais exploradas são Tolerância e Remoção, e os atributos são Confiabilidade e Disponibilidade. A maioria dos trabalhos foram classificados como tipo de pesquisa de Proposta de Solução. Conclusão: Este trabalho sumariza e classifica o esforço de pesquisa conduzido em Gerenciamento de Faltas em Computação em Nuvens, provendo um ponto de partida para pesquisas futuras nesta área. computação em nuvens gerenciamento de faltas mapeamento sistemático pesquisa baseada em evidências estudo secundário Cloud Computing Fault Management Dependability Mapping Study Evidence-Based Research Secondary Study
24	Adaptive Fault Tolerance Strategies for Large Scale Systems George, Cijo January 2012 (has links) (PDF) Exascale systems of the future are predicted to have mean time between node failures (MTBF) of less than one hour. At such low MTBF, the number of processors available for execution of a long running application can widely vary throughout the execution of the application. Employing traditional fault tolerance strategies like periodic checkpointing in these highly dynamic environments may not be effective because of the high number of application failures, resulting in large amount of work lost due to rollbacks apart from the increased recovery overheads. In this context, it is highly necessary to have fault tolerance strategies that can adapt to the changing node availability and also help avoid significant number of application failures. In this thesis, we present two adaptive fault tolerance strategies that make use of node failure pre-diction mechanisms to provide proactive fault tolerance for long running parallel applications on large scale systems. The first part of the thesis deals with an adaptive fault tolerance strategy for malleable applications. We present ADFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. We first develop cost models that consider different factors like accuracy of node failure predictions and application scalability, for evaluating the benefits of various fault tolerance actions including check-pointing, live-migration and rescheduling. Our adaptive framework then uses the cost models to make runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to minimize application failures and maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in work done by the application in the presence of failures, and is effective even for petascale and exascale systems. In the second part of the thesis, we present a fault tolerance strategy using adaptive process replication that can provide fault tolerance for applications using partial replication of a set of application processes. This fault tolerance framework adaptively changes the set of replicated processes (replicated set) periodically based on node failure predictions to avoid application failures. We have developed an MPI prototype implementation, PAREP-MPI that allows dynamically changing the replicated set of processes for MPI applications. Experiments with real scientific applications on real systems have shown that the overhead of PAREP-MPI is minimal. We have shown using simulations with real and synthetic failure traces that our strategy involving adaptive process replication significantly outperforms existing mechanisms providing up to 20% improvement in application efficiency even for exascale systems. Significant observations are also made which can drive future research efforts in fault tolerance for large and very large scale systems. Fault-tolerant Computing Large Scale Systems Adaptive Fault Tolerance Adaptive Process Replication Large Scale Systems - Fault Tolerance Malleability and Rescheduling Large Scale Parallel Systems Proactive Fault Tolerance High Performance Computing Adaptive Fault Management Computer Science
25	Cross-layer self-diagnosis for services over programmable networks / Auto-diagnostic multi-couche pour services sur réseaux programmables Sánchez Vílchez, José Manuel 07 July 2016 (has links) Les réseaux actuels servent millions de clients mobiles et ils se caractérisent par équipement hétérogène et protocoles de transport et de gestion hétérogènes, et des outils de gestion verticaux, qui sont très difficiles à intégrer dans leur infrastructure. La gestion de pannes est loin d’être automatisée et intelligent, ou un 40 % des alarmes sont redondantes et seulement un 1 ou 2% des alarmes sont corrélées au plus dans un centre opérationnel. Ça indique qu’il y a un débordement significatif des alarmes vers les adminis-trateurs humains, a comme conséquence un haut OPEX vue la nécessité d’embaucher de personnel expert pour accomplir les tâches de gestion de pannes. Comme conclusion, le niveau actuel d’automatisation dans les tâches de gestion de pannes dans réseaux télécoms n’est pas adéquat du tout pour adresser les réseaux programmables, lesquels promettent la programmation des ressources et la flexibilité afin de réduire le time-to-market des nouveaux services. L’automatisation de la gestion des pannes devient de plus en plus nécessaire avec l’arrivée des réseaux programmables, SDN (Software-Defined Networking), NFV (Network Functions Virtualization) et le Cloud. En effet, ces paradigmes accélèrent la convergence entre les domaines des réseaux et la IT, laquelle accélère de plus en plus la transformation des réseaux télécoms actuels en menant à repenser les opérations de gestion de réseau et des services, en particulier les opérations de gestion de fautes. Cette thèse envisage l’application des principes d’autoréparation en infrastructures basées sur SDN et NFV, en focalisant sur l’autodiagnostic comme facilitateur principal des principes d’autoréparation. Le coeur de cette thèse c’est la conception d’une approche de diagnostic qui soit capable de diagnostiquer de manière continuée les services dynamiques virtualisés et leurs dépendances des ressources virtuels (VNFs et liens virtuels) mais aussi les dépendances de ceux ressources virtuels de la infrastructure physique en-dessous, en prenant en compte la mobilité, la dynamicite, le partage de ressources à l’infrastructure en-dessous / Current networks serve billions of mobile customer devices. They encompass heterogeneous equipment, transport and manage-ment protocols, and vertical management tools, which are very difficult and costly to integrate. Fault management operations are far from being automated and intelligent, where around 40% of alarms are redundant only around 1-2% of alarms are correlated at most in a medium-size operational center. This indicates that there is a significant alarm overflow for human administrators, which inherently derives in high OPEX due to the increasingly need to employ high-skilled people to perform fault management tasks. In conclusion, the current level of automation in fault management tasks in Telcos networks is not at all adequate for programmable networks, which promise a high degree of programmability and flexibility to reduce the time-to-market. Automation on fault management is more necessary with the advent of programmable networks, led by with SDN (Software-Defined Networking), NFV (Network Functions Virtualization) and the Cloud. Indeed, the arise of those paradigms accelerates the convergence between networks and IT realms, which as consequence, is accelerating faster and faster the transformation of cur-rent networks leading to rethink network and service management and operations, in particular fault management operations. This thesis envisages the application of self-healing principles in SDN and NFV combined infrastructures, by focusing on self-diagnosis tasks as main enabler of self-healing. The core of thesis is to devise a self-diagnosis approach able to diagnose at run-time the dynamic virtualized networking services and their dependencies from the virtualized resources (VNFs and virtual links) but also the dependencies of those virtualized resources from the underlying network infrastructure, taking into account the mobility, dynamicity, and sharing of resources in the underlying infrastructure Autonomique Systèmes d’auto réparation Réseaux programmables SDN NFV Gestion de pannes Réseaux bayésiens Auto-modélisation Auto-diagnostic Corrélation d’alarmes Isolation d'erreurs Autonomics Self-healing systems Programmable networks SDN NFV Fault management Bayesian networks Self-modeling Self-diagnosis Alarm correlation Fault-isolation

Page generated in 0.0764 seconds