1 |
Scalable Robust Models Under Adversarial Data CorruptionZhang, Xuchao 04 April 2019 (has links)
The presence of noise and corruption in real-world data can be inevitably caused by accidental outliers, transmission loss, or even adversarial data attacks. Unlike traditional random noise usually assume a specific distribution with low corruption ratio, the data collected from crowdsourcing or labeled by weak annotators can contain adversarial data corruption. More challenge, the adversarial data corruption can be arbitrary, unbounded and do not follow any specific distribution. In addition, in the era of data explosion, the fast-growing amount of data makes the robust models more difficult to handle large-scale data sets.
This thesis focuses on the development of methods for scalable robust models under the adversarial data corruption assumptions. Four methods are proposed, including robust regression via heuristic hard-thresholding, online and distributed robust regression with adversarial noises, self-paced robust learning for leveraging clean labels in noisy data, and robust regression via online feature selection with adversarial noises. Moreover, I extended the self-paced robust learning method to its distributed version for the scalability of the proposed algorithm, named distributed self-paced learning in alternating direction method of multiplier. Last, a robust multi-factor personality prediction model is proposed to hand the correlated data noises.
For the first method, existing solutions for robust regression lack rigorous recovery guarantee of regression coefficients under the adversarial data corruption with no prior knowledge of corruption ratio. The proposed contributions of our work include: (1) Propose efficient algorithms to address the robust least-square regression problem; (2) Design effective approaches to estimate the corruption ratio; (3) Provide a rigorous robustness guarantee for regression coefficient recovery; and (4) Conduct extensive experiments for performance evaluation.
For the second method, existing robust learning methods typically focus on modeling the entire dataset at once; however, they may meet the bottleneck of memory and computation as more and more datasets are becoming too large to be handled integrally. The proposed contributions of our work for this task include: (1) Formulate a framework for the scalable robust least-squares regression problem; (2) Propose online and distributed algorithms to handle the adversarial corruption; (3) Provide a rigorous robustness guarantee for regression coefficient recovery; and (4) Conduct extensive experiments for performance evaluations.
For the third method, leveraging the prior knowledge of clean labels in noisy data is actually a crucial issue in practice, but existing robust learning methods typically focus more on eliminating noisy data. However, the data collected by ``weak annotator" or crowd-sourcing can be too noisy for existing robust methods to train an accurate model. Moreover, existing work that utilize additional clean labels are usually designed for some specific problems such as image classification. These methods typically utilize clean labels in large-scale noisy data based on their additional domain knowledge; however, these approaches are difficult to handle extremely noisy data and relied on their domain knowledge heavily, which makes them difficult be used in more general problems. The proposed contributions of our work for this task include: (1) Formulating a framework to leverage the clean labels in noisy data; (2) Proposing a self-paced robust learning algorithm to train models under the supervision of clean labels; (3) Providing a theoretical analysis for the convergence of the proposed algorithm; and (4) Conducting extensive experiments for performance evaluations.
For the fourth method, the presence of data corruption in user-generated streaming data, such as social media, motivates a new fundamental problem that learns reliable regression coefficient when features are not accessible entirely at one time. Until now, several important challenges still cannot be handled concurrently: 1) corrupted data estimation when only partial features are accessible; 2) online feature selection when data contains adversarial corruption; and 3) scaling to a massive dataset. This paper proposes a novel RObust regression algorithm via Online Feature Selection (textit{RoOFS}) that concurrently addresses all the above challenges. Specifically, the algorithm iteratively updates the regression coefficients and the uncorrupted set via a robust online feature substitution method. We also prove that our algorithm has a restricted error bound compared to the optimal solution. Extensive empirical experiments in both synthetic and real-world data sets demonstrated that the effectiveness of our new method is superior to that of existing methods in the recovery of both feature selection and regression coefficients, with very competitive efficiency.
For the fifth method, existing self-paced learning approaches typically focus on modeling the entire dataset at once; however, this may introduce a bottleneck in terms of memory and computation, as today's fast-growing datasets are becoming too large to be handled integrally. The proposed contributions of our work for this task include: (1) Reformulate the self-paced problem into a distributed setting.; (2) A distributed self-paced learning algorithm based on consensus ADMM is proposed to solve the textit{SPL} problem in a distributed setting; (3) A theoretical analysis is provided for the convergence of our proposed textit{DSPL} algorithm; and (4) Extensive experiments have been conducted utilizing both synthetic and real-world data based on a robust regression task.
For the last method, personality prediction in multiple factors, such as openness and agreeableness, is growing in interest especially in the context of social media, which contains massive online posts or likes that can potentially reveal an individual's personality. However, the data collected from social media inevitably contains massive amounts of noise and corruption. To address it, traditional robust methods still suffer from several important challenges, including 1) existence of correlated corruption among multiple factors, 2) difficulty in estimating the corruption ratio in multi-factor data, and 3) scalability to massive datasets. This paper proposes a novel robust multi-factor personality prediction model that concurrently addresses all the above challenges by developing a distributed robust regression algorithm. Specifically, the algorithm optimizes regression coefficients of each factor in parallel with a heuristically estimated corruption ratio and then consolidates the uncorrupted set from multiple factors in two strategies: global consensus and majority voting. We also prove that our algorithm benefits from strong guarantees in terms of convergence rates and coefficient recovery, which can be utilized as a generic framework for the multi-factor robust regression problem with correlated corruption property. Extensive experiment on synthetic and real dataset demonstrates that our algorithm is superior to those of existing methods in both effectiveness and efficiency. / Doctor of Philosophy / Social media has experienced a rapid growth during the past decade. Millions of users of sites such as Twitter have been generating and sharing a wide variety of content including texts, images, and other metadata. In addition, social media can be treated as a social sensor that reflects different aspects of our society. Event analytics in social media have enormous significance for applications like disease surveillance, business intelligence, and disaster management. Social media data possesses a number of important characteristics including dynamics, heterogeneity, noisiness, timeliness, big volume, and network properties. These characteristics cause various new challenges and hence invoke many interesting research topics, which will be addressed here. This dissertation focuses on the development of five novel methods for social media-based spatiotemporal event detection and forecasting. The first of these is a novel unsupervised approach for detecting the dynamic keywords of spatial events in targeted domains. This method has been deployed in a practical project for monitoring civil unrest events in several Latin American regions. The second builds on this by discovering the underlying development progress of events, jointly considering the structural contexts and spatiotemporal burstiness. The third seeks to forecast future events using social media data. The basic idea here is to search for subtle patterns in specific cities as indicators of ongoing or future events, where each pattern is defined as a burst of context features (keywords) that are relevant to a specific event. For instance, an initial expression of discontent gas price increases could actually be a potential precursor to a more general protest about government policies. Beyond social media data, in the fourth method proposed here, multiple data sources are leveraged to reflect different aspects of the society for event forecasting. This addresses several important problems, including the common phenomena that different sources may come from different geographical levels and have different available time periods. The fifth study is a novel flu forecasting method based on epidemics modeling and social media mining. A new framework is proposed to integrate prior knowledge of disease propagation mechanisms and real-time information from social media.
|
2 |
Software Techniques For Dependable ExecutionJanuary 2018 (has links)
abstract: Advances in semiconductor technology have brought computer-based systems intovirtually all aspects of human life. This unprecedented integration of semiconductor based systems in our lives has significantly increased the domain and the number
of safety-critical applications – application with unacceptable consequences of failure. Software-level error resilience schemes are attractive because they can provide commercial-off-the-shelf microprocessors with adaptive and scalable reliability.
Among all software-level error resilience solutions, in-application instruction replication based approaches have been widely used and are deemed to be the most effective. However, existing instruction-based replication schemes only protect some part of computations i.e. arithmetic and logical instructions and leave the rest as unprotected. To improve the efficacy of instruction-level redundancy-based approaches, we developed several error detection and error correction schemes. nZDC (near Zero silent
Data Corruption) is an instruction duplication scheme which protects the execution of whole application. Rather than detecting errors on register operands of memory and control flow operations, nZDC checks the results of such operations. nZDC en
sures the correct execution of memory write instruction by reloading stored value and checking it against redundantly computed value. nZDC also introduces a novel control flow checking mechanism which replicates compare and branch instructions and
detects both wrong direction branches as well as unwanted jumps. Fault injection experiments show that nZDC can improve the error coverage of the state-of-the-art schemes by more than 10x, without incurring any more performance penalty. Further
more, we introduced two error recovery solutions. InCheck is our backward recovery solution which makes light-weighted error-free checkpoints at the basic block granularity. In the case of error, InCheck reverts the program execution to the beginning of last executed basic block and resumes the execution by the aid of preserved in formation. NEMESIS is our forward recovery scheme which runs three versions of computation and detects errors by checking the results of all memory write and branch
operations. In the case of a mismatch, NEMESIS diagnosis routine decides if the error is recoverable. If yes, NEMESIS recovery routine reverts the effect of error from the program state and resumes program normal execution from the error detection
point. / Dissertation/Thesis / Doctoral Dissertation Computer Engineering 2018
|
3 |
InCheck - An Integrated Recovery Methodology for Fine-grained Soft-Error Detection SchemesJanuary 2016 (has links)
abstract: Soft errors are considered as a key reliability challenge for sub-nano scale transistors. An ideal solution for such a challenge should ultimately eliminate the effect of soft errors from the microprocessor. While forward recovery techniques achieve fast recovery from errors by simply voting out the wrong values, they incur the overhead of three copies execution. Backward recovery techniques only need two copies of execution, but suffer from check-pointing overhead.
In this work I explored the efficiency of integrating check-pointing into the application and the effectiveness of recovery that can be performed upon it. After evaluating the available fine-grained approaches to perform recovery, I am introducing InCheck, an in-application recovery scheme that can be integrated into instruction-duplication based techniques, thus providing a fast error recovery. The proposed technique makes light-weight checkpoints at the basic-block granularity, and uses them for recovery purposes.
To evaluate the effectiveness of the proposed technique, 10,000 fault injection experiments were performed on different hardware components of a modern ARM in-order simulated processor. InCheck was able to recover from all detected errors by replaying about 20 instructions, however, the state of the art recovery scheme failed more than 200 times. / Dissertation/Thesis / Masters Thesis Computer Science 2016
|
4 |
Error isolation in distributed systemsBehrens, Diogo 25 May 2016 (has links) (PDF)
In distributed systems, if a hardware fault corrupts the state of a process, this error might propagate as a corrupt message and contaminate other processes in the system, causing severe outages. Recently, state corruptions of this nature have been observed surprisingly often in large computer populations, e.g., in large-scale data centers. Moreover, since the resilience of processors is expected to decline in the near future, the likelihood of state corruptions will increase even further.
In this work, we argue that preventing the propagation of state corruption should be a first-class requirement for large-scale fault-tolerant distributed systems. In particular, we propose developers to target error isolation, the property in which each correct process ignores any corrupt message it receives.
Typically, a process cannot decide whether a received message is corrupt or not. Therefore, we introduce hardening as a class of principled approaches to implement error isolation in distributed systems. Hardening techniques are (semi-)automatic transformations that enforce that each process appends an evidence of good behavior in the form of error codes to all messages it sends. The techniques “virtualize” state corruptions into more benign failures such as crashes and message omissions: if a faulty process fails to detect its state corruption and abort, then hardening guarantees that any corrupt message the process sends has invalid error codes. Correct processes can then inspect received messages and drop them in case they are corrupt.
With this dissertation, we contribute theoretically and practically to the state of the art in fault-tolerant distributed systems. To show that hardening is possible, we design, formalize, and prove correct different hardening techniques that enable existing crash-tolerant designs to handle state corruption with minimal developer intervention. To show that hardening is practical, we implement and evaluate these techniques, analyzing their effect on the system performance and their ability to detect state corruptions in practice.
|
5 |
Scalable error isolation for distributed systems: modeling, correctness proofs, and additional experimentsBehrens, Diogo, Serafini, Marco, Arnautov, Sergei, Junqueira, Flavio, Fetzer, Christof 01 June 2016 (has links) (PDF)
This technical report complements the paper entitled “Scalable error isolation for distributed systems” published at USENIX NSDI 15.
|
6 |
Error isolation in distributed systemsBehrens, Diogo 14 January 2016 (has links)
In distributed systems, if a hardware fault corrupts the state of a process, this error might propagate as a corrupt message and contaminate other processes in the system, causing severe outages. Recently, state corruptions of this nature have been observed surprisingly often in large computer populations, e.g., in large-scale data centers. Moreover, since the resilience of processors is expected to decline in the near future, the likelihood of state corruptions will increase even further.
In this work, we argue that preventing the propagation of state corruption should be a first-class requirement for large-scale fault-tolerant distributed systems. In particular, we propose developers to target error isolation, the property in which each correct process ignores any corrupt message it receives.
Typically, a process cannot decide whether a received message is corrupt or not. Therefore, we introduce hardening as a class of principled approaches to implement error isolation in distributed systems. Hardening techniques are (semi-)automatic transformations that enforce that each process appends an evidence of good behavior in the form of error codes to all messages it sends. The techniques “virtualize” state corruptions into more benign failures such as crashes and message omissions: if a faulty process fails to detect its state corruption and abort, then hardening guarantees that any corrupt message the process sends has invalid error codes. Correct processes can then inspect received messages and drop them in case they are corrupt.
With this dissertation, we contribute theoretically and practically to the state of the art in fault-tolerant distributed systems. To show that hardening is possible, we design, formalize, and prove correct different hardening techniques that enable existing crash-tolerant designs to handle state corruption with minimal developer intervention. To show that hardening is practical, we implement and evaluate these techniques, analyzing their effect on the system performance and their ability to detect state corruptions in practice.
|
7 |
Scalable error isolation for distributed systems: modeling, correctness proofs, and additional experimentsBehrens, Diogo, Serafini, Marco, Arnautov, Sergei, Junqueira, Flavio, Fetzer, Christof 01 June 2016 (has links)
This technical report complements the paper entitled “Scalable error isolation for distributed systems” published at USENIX NSDI 15.
|
8 |
Non-Equilibrium Disordering Processes In binary Systems Due to an Active AgentTriampo, Wannapong 11 April 2001 (has links)
In this thesis, we study the kinetic disordering of systems interacting with an agent or a walker. Our studies divide naturally into two classes: for the first, the dynamics of the walker conserves the total magnetization of the system, for the second, it does not. These distinct dynamics are investigated in part I and II respectively.
In part I, we investigate the disordering of an initially phase-segregated binary alloy due to a highly mobile vacancy which exchanges with the alloy atoms. This dynamics clearly conserves the total magnetization. We distinguish three versions of dynamic rules for the vacancy motion, namely a pure random walk , an "active" and a biased walk. For the random walk case, we review and reproduce earlier work by Z. Toroczkai et. al., [9] which will serve as our base-line. To test the robustness of these findings and to make our model more accessible to experimental studies, we investigated the effects of finite temperatures ("active walks") as well as external fields (biased walks). To monitor the disordering process, we define a suitable disorder parameter, namely the number of broken bonds, which we study as a function of time, system size and vacancy number. Using Monte Carlo simulations and a coarse-grained field theory, we observe that the disordering process exhibits three well separated temporal regimes. We show that the later stages exhibit dynamic scaling, characterized by a set of exponents and scaling functions. For the random and the biased case, these exponents and scaling functions are computed analytically in excellent agreement with the simulation results. The exponents are remarkably universal. We conclude this part with some comments on the early stage, the interfacial roughness and other related features.
In part II, we introduce a model of binary data corruption induced by a Brownian agent or random walker. Here, the magnetization is not conserved, being related to the density of corrupted bits ρ. Using both continuum theory and computer simulations, we study the average density of corrupted bits, and the associated density-density correlation function, as well as several other related quantities. In the second half, we extend our investigations in three main directions which allow us to make closer contact with real binary systems. These are i) a detailed analysis of two dimensions, ii) the case of competing agents, and iii) the cases of asymmetric and quenched random couplings. Our analytic results are in good agreement with simulation results. The remarkable finding of this study is the robustness of the phenomenological model which provides us with the tool, continuum theory, to understand the nature of such a simple model. / Ph. D.
|
Page generated in 0.0927 seconds