Global ETD Search

11	Using Reinforcement Learning in Partial Order Plan Space Ceylan, Hakan 05 1900 (has links) Partial order planning is an important approach that solves planning problems without completely specifying the orderings between the actions in the plan. This property provides greater flexibility in executing plans; hence making the partial order planners a preferred choice over other planning methodologies. However, in order to find partially ordered plans, partial order planners perform a search in plan space rather than in space of world states and an uninformed search in plan space leads to poor efficiency. In this thesis, I discuss applying a reinforcement learning method, called First-visit Monte Carlo method, to partial order planning in order to design agents which do not need any training data or heuristics but are still able to make informed decisions in plan space based on experience. Communicating effectively with the agent is crucial in reinforcement learning. I address how this task was accomplished in plan space and the results from an evaluation of a blocks world test bed. Intelligent agents (Computer software) Artificial intelligence. artificial intelligence reinforcement learning partial order planning
12	Using Markov Decision Processes and Reinforcement Learning to Guide Penetration Testers in the Search for Web Vulnerabilities / Användandet av Markov Beslutsprocesser och Förstärkt Inlärning för att Guida Penetrationstestare i Sökandet efter Sårbarheter i Webbapplikationer Pettersson, Anders, Fjordefalk, Ossian January 2019 (has links) Bug bounties are an increasingly popular way of performing penetration tests of web applications. User statistics of bug bounty platforms show that a lot of hackers struggle to find bugs. This report explores a way of using Markov decision processes and reinforcement learning to help hackers find vulnerabilities in web applications by building a tool that suggests attack surfaces to examine and vulnerability reports to read to get the relevant knowledge. The attack surfaces, vulnerabilities and reports are all derived from a taxonomy of web vulnerabilities created in a collaborating project. A Markov decision process (MDP) was defined, this MDP includes the environment, different states of knowledge and actions that can take a user from one state of knowledge to another. To be able to suggest the best possible next action to perform, the MDP uses a policy that describes the value of entering each state. Each state is given a value that is called Q-value. This value indicates how close that state is to another state where a vulnerability has been found. This means that a state has a high Q-value if the knowledge gives a user a high probability of finding a vulnerability and vice versa. This policy was created using a reinforcement learning algorithm called Q-learning. The tool was implemented as a web application using Java Spring Boot and ReactJS. The resulting tool is best suited for new hackers in the learning process. The current version is trained on the indexed reports of the vulnerability taxonomy but future versions should be trained on user behaviour collected from the tool. / Bug bounties är ett alltmer populärt sätt att utföra penetrationstester av webbapplikationer. Användarstatistik från bug bounty-plattformar visar att många hackare har svårt att hitta buggar. Denna rapport undersöker ett sätt att använda Markov-beslutsprocesser och förstärkt inlärning för att hjälpa hackare att hitta sårbarheter i webbapplikationer genom att bygga ett verktyg som föreslår attackytor att undersöka och sårbarhetsrapporter att läsa för att tillgodogöra sig rätt kunskaper. Attackytor, sårbarheter och rapporter är alla hämtade från en taxonomi över webbsårbarheter skapad i ett samarbetande projekt. En Markovbeslutsprocess (MDP) definierades. Denna MDP inkluderar miljön, olika kunskapstillstånd och handlingar som kan ta användaren från ett kunskapstillstånd till ett annat. För kunna föreslå nästa handling på bästa möjliga sätt använder MDPn en policy som beskriver värdet av att träda in i alla de olika tillstånden. Alla tillstånd ges ett värde som kallas Q-värde. Detta värde indikerar hur nära ett tillstånd har till ett annat tillstånd där en sårbarhet har hittats. Detta betyder att ett tillstånd har ett högt Q-värde om kunskapen ger användaren en hög sannolikhet att hitta en sårbarhet och vice versa. Policyn skapades med hjälp av en typ av förstärkt inlärningsalgoritm kallad Q-inlärning. Verktyget implementerades som en webbapplikation med hjälp av Java Spring Boot och ReactJS. Det resulterande verktyget är bäst lämpat för nya hackare i inlärningsstadiet. Den nuvarande versionen är tränad på indexerade rapporter från sårbarhetstaxonomin men framtida versioner bör tränas på användarbeteende insamlat från verktyget. Computer and Information Sciences Data- och informationsvetenskap
13	Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and an Application Lakshminarayanan, Chandrashekar January 2015 (has links) (PDF) Problems involving optimal sequential making in uncertain dynamic systems arise in domains such as engineering, science and economics. Such problems can often be cast in the framework of Markov Decision Process (MDP). Solving an MDP requires computing the optimal value function and the optimal policy. The idea of dynamic programming (DP) and the Bellman equation (BE) are at the heart of solution methods. The three important exact DP methods are value iteration, policy iteration and linear programming. The exact DP methods compute the optimal value function and the optimal policy. However, the exact DP methods are inadequate in practice because the state space is often large and in practice, one might have to resort to approximate methods that compute sub-optimal policies. Further, in certain cases, the system observations are known only in the form of noisy samples and we need to design algorithms that learn from these samples. In this thesis we study interesting theoretical questions pertaining to approximate and learning algorithms, and also present an interesting application of MDPs in the domain of crowd sourcing. Approximate Dynamic Programming (ADP) methods handle the issue of large state space by computing an approximate value function and/or a sub-optimal policy. In this thesis, we are concerned with conditions that result in provably good policies. Motivated by the limitations of the PBE in the conventional linear algebra, we study the PBE in the (min, +) linear algebra. It is a well known fact that deterministic optimal control problems with cost/reward criterion are (min, +)/(max, +) linear and ADP methods have been developed for such systems in literature. However, it is straightforward to show that inﬁnite horizon discounted reward/cost MDPs are neither (min, +) nor (max, +) linear. We develop novel ADP schemes namely the Approximate Q Iteration (AQI) and Variational Approximate Q Iteration (VAQI), where the approximate solution is a (min, +) linear combination of a set of basis functions whose span constitutes a subsemimodule. We show that the new ADP methods are convergent and we present a bound on the performance of the sub-optimal policy. The Approximate Linear Program (ALP) makes use of linear function approximation (LFA) and oﬀers theoretical performance guarantees. Nevertheless, the ALP is diﬃcult to solve due to the presence of a large number of constraints and in practice, a reduced linear program (RLP) is solved instead. The RLP has a tractable number of constraints sampled from the original constraints of the ALP. Though the RLP is known to perform well in experiments, theoretical guarantees are available only for a speciﬁc RLP obtained under idealized assumptions. In this thesis, we generalize the RLP to deﬁne a generalized reduced linear program (GRLP) which has a tractable number of constraints that are obtained as positive linear combinations of the original constraints of the ALP. The main contribution here is the novel theoretical framework developed to obtain error bounds for any given GRLP. Reinforcement Learning (RL) algorithms can be viewed as sample trajectory based solution methods for solving MDPs. Typically, RL algorithms that make use of stochastic approximation (SA) are iterative schemes taking small steps towards the desired value at each iteration. Actor-Critic algorithms form an important sub-class of RL algorithms, wherein, the critic is responsible for policy evaluation and the actor is responsible for policy improvement. The actor and critic iterations have deferent step-size schedules, in particular, the step-sizes used by the actor updates have to be generally much smaller than those used by the critic updates. Such SA schemes that use deferent step-size schedules for deferent sets of iterates are known as multitimescale stochastic approximation schemes. One of the most important conditions required to ensure the convergence of the iterates of a multi-timescale SA scheme is that the iterates need to be stable, i.e., they should be uniformly bounded almost surely. However, the conditions that imply the stability of the iterates in a multi-timescale SA scheme have not been well established. In this thesis, we provide veritable conditions that imply stability of two timescale stochastic approximation schemes. As an example, we also demonstrate that the stability of a widely used actor-critic RL algorithm follows from our analysis. Crowd sourcing (crowd) is a new mode of organizing work in multiple groups of smaller chunks of tasks and outsourcing them to a distributed and large group of people in the form of an open call. Recently, crowd sourcing has become a major pool for human intelligence tasks (HITs) such as image labeling, form digitization, natural language processing, machine translation evaluation and user surveys. Large organizations/requesters are increasingly interested in crowd sourcing the HITs generated out of their internal requirements. Task starvation leads to huge variation in the completion times of the tasks posted on to the crowd. This is an issue for frequent requesters desiring predictability in the completion times of tasks speciﬁed in terms of percentage of tasks completed within a stipulated amount of time. An important task attribute that aﬀects the completion time of a task is its price. However, a pricing policy that does not take the dynamics of the crowd into account might fail to achieve the desired predictability in completion times. Here, we make use of the MDP framework to compute a pricing policy that achieves predictable completion times in simulations as well as real world experiments. Dynamic Programming (DP) Markov Decision Process (MDP) Bellman Equation CBE Machine Learning Bellman Operator Crowdsourcing Approximate Linear Programming (ALP) Reinforcement Learning Stochastic Approximation Approximate Dynamic Programming (ADP) Approximate Linear Program Linear Function Approximation (LFA) Reduced Linear Program (RLP) Crowd Sourcing Computer Science and Automation

Page generated in 0.1219 seconds