Global ETD Search

1	Monte Carlo Sampling and Regret Minimization for Equilibrium Computation and Decision-Making in Large Extensive Form Games Lanctot, Marc Unknown Date No description available. game theory regret minimization extensive form game
2	Audit Games Sinha, Arunesh 01 July 2014 (has links) Modern organizations (e.g., hospitals, banks, social networks, search engines) hold large volumes of personal information, and rely heavily on auditing for enforcement of privacy policies. These audit mechanisms combine automated methods with human input to detect and punish violators. Since human audit resources are limited, and often not sufficient to investigate all potential violations, current state-of-the -art audit tools provide heuristics to guide human effort. However, numerous reports of privacy breaches caused by malicious insiders bring to question the effectiveness of these audit mechanisms. Our thesis is that effective audit resource allocation and punishment levels can be efficiently computed by modeling the audit process as a game between a rational auditor and a rational or worst-case auditee. We present several results in support of the thesis. In the worst-case adversary setting, we design a game model taking into account organizational cost of auditing and loss from violations. We propose the notion of low regret as a desired audit property and provide a regret minimizing audit algorithm that outputs an optimal audit resource allocation strategy. The algorithm improves upon prior regret bounds in the partial information setting. In the rational adversary setting, we enable punishments by the auditor, and model the adversary's utility as a trade-off between the benefit from violations and loss due to punishment when detected. Our Stackelberg game model generalizes an existing deployed security game model with punishment parameters. It applies to natural auditing settings with multiple auditors where each auditor is restricted to audit a subset of the potential violations. We provide novel polynomial time algorithms to approximate the non-convex optimization problem used to compute the Stackelberg equilibrium. The algorithms output optimal audit resource allocation strategy and punishment levels. We also provide a method to reduce the optimization problem size, achieving up to 5x speedup for realistic instances of the audit problem, and for the related security game instances. Game Theory Auditing Security Privacy Non-Convex Optimization Regret Minimization
3	Using counterfactual regret minimization to create a competitive multiplayer poker agent Abou Risk, Nicholas 11 1900 (has links) Games have been used to evaluate and advance techniques in the eld of Articial Intelligence since before computers were invented. Many of these games have been deterministic perfect information games (e.g. Chess and Checkers). A deterministic game has no chance element and in a perfect information game, all information is visible to all players. However, many real-world scenarios involving competing agents can be more accurately modeled as stochastic (non-deterministic), im- perfect information games, and this dissertation investigates such games. Poker is one such game played by millions of people around the world; it will be used as the testbed of the research presented in this dissertation. For a specic set of games, two-player zero-sum perfect recall games, a recent technique called Counterfactual Regret Minimization (CFR) computes strategies that are provably convergent to an -Nash equilibrium. A Nash equilibrium strategy is very useful in two-player games as it maximizes its utility against a worst-case opponent. However, once we move to multiplayer games, we lose all theoretical guarantees for CFR. Furthermore, we have no theoretical guarantees about the performance of a strategy from a multiplayer Nash equilibrium against two arbitrary op- ponents. Despite the lack of theoretical guarantees, my thesis is that CFR-generated agents may perform well in multiplayer games. I created several 3-player limit Texas Holdem Poker agents and the results of the 2009 Computer Poker Competition demonstrate that these are the strongest 3-player computer Poker agents in the world. I also contend that a good strategy can be obtained by grafting a set of two-player subgame strategies to a 3-player base strategy when one of the players is eliminated. counterfactual regret minimization CFR poker multiplayer poki heads-up experts 3-player
4	Using counterfactual regret minimization to create a competitive multiplayer poker agent Abou Risk, Nicholas Unknown Date No description available. counterfactual regret minimization CFR poker multiplayer poki heads-up experts 3-player
5	Minimizing Regret in Combinatorial Bandits and Reinforcement Learning Talebi Mazraeh Shahi, Mohammad Sadegh January 2017 (has links) This thesis investigates sequential decision making tasks that fall in the framework of reinforcement learning (RL). These tasks involve a decision maker repeatedly interacting with an environment modeled by an unknown finite Markov decision process (MDP), who wishes to maximize a notion of reward accumulated during her experience. Her performance can be measured through the notion of regret, which compares her accumulated expected reward against that achieved by an oracle algorithm always following an optimal behavior. In order to maximize her accumulated reward, or equivalently to minimize the regret, she needs to face a trade-off between exploration and exploitation. The first part of this thesis investigates combinatorial multi-armed bandit (MAB) problems, which are RL problems whose state-space is a singleton. It also addresses some applications that can be cast as combinatorial MAB problems. The number of arms in such problems generically grows exponentially with the number of basic actions, but the rewards of various arms are correlated. Hence, the challenge in such problems is to exploit the underlying combinatorial structure.For these problems, we derive asymptotic (i.e., when the time horizon grows large) lower bounds on the regret of any admissible algorithm and investigate how these bounds scale with the dimension of the underlying combinatorial structure. We then propose several algorithms and provide finite-time analyses of their regret. The proposed algorithms efficiently exploit the structure of the problem, provide better performance guarantees than existing algorithms, and significantly outperform these algorithms in practice. The second part of the thesis concerns RL in an unknown and discrete MDP under the average-reward criterion. We develop some variations of the transportation lemma that could serve as novel tools for the regret analysis of RL algorithms. Revisiting existing regret lower bounds allows us to derive alternative bounds, which motivate that the local variance of the bias function of the MDP, i.e., the variance with respect to next-state transition laws, could serve as a notion of problem complexity for regret minimization in RL. Leveraging these tools also allows us to report a novel regret analysis of the KL-UCRL algorithm for ergodic MDPs. The leading term in our regret bound depends on the local variance of the bias function, thus coinciding with observations obtained from our presented lower bounds. Numerical evaluations in some benchmark MDPs indicate that the leading term of the derived bound can provide an order of magnitude improvement over previously known results for this algorithm. / <p>QC 20171215</p> Multi-armed Bandits Reinforcement Learning Regret Minimization Statistics Engineering and Technology Teknik och teknologier
6	Teorie her a poker / Game theory and poker Schmid, Martin January 2013 (has links) This thesis introduces the basic concepts of the game theory. Necessary models and solution concepts are described. Follows the summary of the computational complexity of these concepts and corresponding algorithms. Poker is formalized as one of the game theory game models. State of the art algorithms for the ex- tensive form games are explained with the application to the Poker. The thesis also introduces the Annual Computer Poker Competition and participating pro- grams. Finally, new result about the extensive form games with many actions is presented. Keywords: Game theory, Poker, Nash equilibrium, Extensive form games
7	Hraní her a Deepstack / General Game Playing and Deepstack Schlindenbuch, Hynek January 2019 (has links) General game playing is an area of artificial intelligence which focuses on creating agents capable of playing many games from some class. The agents receive the rules just before the match and therefore cannot be specialized for each game. Deepstack is the first artificial intelligence to beat professional human players in heads-up no-limit Texas hold'em poker. While it is specialized for poker, at its core is a general algorithm for playing two-player zero-sum games with imperfect information - continual resolving. In this thesis we introduce a general version of continual resolving and compare its performance against Online Outcome Sampling Monte Carlo Counterfactual Regret Minimization in several games.
8	Analyse de stratégies bayésiennes et fréquentistes pour l'allocation séquentielle de ressources / Analysis of bayesian and frequentist strategies for sequential resource allocation Kaufmann, Emilie 01 October 2014 (has links) Dans cette thèse, nous étudions des stratégies d’allocation séquentielle de ressources. Le modèle statistique adopté dans ce cadre est celui du bandit stochastique à plusieurs bras. Dans ce modèle, lorsqu’un agent tire un bras du bandit, il reçoit pour récompense une réalisation d’une distribution de probabilité associée au bras. Nous nous intéressons à deux problèmes de bandit différents : la maximisation de la somme des récompenses et l’identification des meilleurs bras (où l’agent cherche à identifier le ou les bras conduisant à la meilleure récompense moyenne, sans subir de perte lorsqu’il tire un «mauvais» bras). Nous nous attachons à proposer pour ces deux objectifs des stratégies de tirage des bras, aussi appelées algorithmes de bandit, que l’on peut qualifier d’optimales. La maximisation des récompenses est équivalente à la minimisation d’une quantité appelée regret. Grâce à une borne inférieure asymptotique sur le regret d’une stratégie uniformément efficace établie par Lai et Robbins, on peut définir la notion d’algorithme asymptotiquement optimal comme un algorithme dont le regret atteint cette borne inférieure. Dans cette thèse, nous proposons pour deux algorithmes d’inspiration bayésienne, Bayes-UCB et Thompson Sampling, une analyse à temps fini dans le cadre des modèles de bandit à récompenses binaires, c’est-à-dire une majoration non asymptotique de leur regret. Cette majoration permetd’établir l’optimalité asymptotique des deux algorithmes. Dans le cadre de l’identification des meilleurs bras, on peut chercher à déterminer le nombre total d’échantillons des bras nécessaires pour identifier, avec forte probabilité, le ou les meilleurs bras, sans la contrainte de maximiser la somme des observations. Nous définissons deux termes de complexité pour l’identification des meilleurs bras dans deux cadres considérés dans la littérature, qui correspondent à un budget fixé ou à un niveau de confiance fixé. Nous proposons de nouvelles bornes inférieures sur ces complexités, et nous analysons de nouveaux algorithmes, dont certains atteignent les bornes inférieures dans des cas particuliers de modèles de bandit à deux bras, et peuvent donc être qualifiés d’optimaux. / In this thesis, we study strategies for sequential resource allocation, under the so-called stochastic multi-armed bandit model. In this model, when an agent draws an arm, he receives as a reward a realization from a probability distribution associated to the arm. In this document, we consider two different bandit problems. In the reward maximization objective, the agent aims at maximizing the sum of rewards obtained during his interaction with the bandit, whereas in the best arm identification objective, his goal is to find the set of m best arms (i.e. arms with highest mean reward), without suffering a loss when drawing ‘bad’ arms. For these two objectives, we propose strategies, also called bandit algorithms, that are optimal (or close to optimal), in a sense precised below. Maximizing the sum of rewards is equivalent to minimizing a quantity called regret. Thanks to an asymptotic lower bound on the regret of any uniformly efficient algorithm given by Lai and Robbins, one can define asymptotically optimal algorithms as algorithms whose regret reaches this lower bound. In this thesis, we propose, for two Bayesian algorithms, Bayes-UCB and Thompson Sampling, a finite-time analysis, that is a non-asymptotic upper bound on their regret, in the particular case of bandits with binary rewards. This upper bound allows to establish the asymptotic optimality of both algorithms. In the best arm identification framework, a possible goal is to determine the number of samples of the armsneeded to identify, with high probability, the set of m best arms. We define a notion of complexity for best arm identification in two different settings considered in the literature: the fixed-budget and fixed-confidence settings. We provide new lower bounds on these complexity terms and we analyse new algorithms, some of which reach the lower bound in particular cases of two-armed bandit models and are therefore optimal Statistiques Bayesien Modèle de bandit Apprentissage Minimisation du regret Identification des meilleurs bras Statistics Bayesian Bandit models Learning Regret minimization Best arm identification
9	Regret Minimization in the Gain Estimation Problem Tourkaman, Mahan January 2019 (has links) A novel approach to the gain estimation problem,using a multi-armed bandit formulation, is studied. The gain estimation problem deals with the problem of estimating the largest L2-gain that signal of bounded norm experiences when passing through a linear and time-invariant system. Under certain conditions, this new approach is guaranteed to surpass traditional System Identification methods in terms of accuracy.The bandit algorithms Upper Confidence Bound, Thompson Sampling and Weighted Thompson Sampling are implemented with the aim of designing the optimal input for maximizing the gain of an unknown system. The regret performance of each algorithm is studied using simulations on a test system. Upper Confidence Bound, with exploration parameter set to zero, performed the best among all tested values for this parameter. Weighted Thompson Sampling performed better than Thompson Sampling. Exploration vs exploitation gain estimation multi-armed bandit regret minimization system identification. Elektroteknik och elektronik
10	Bluffing AI in Strategy Board Game Leijonhufvud, Johan, Henriksson, Albin January 2021 (has links) Games have been a field of interest for researchin artificial intelligence for decades. As of now, it is over 5years ago an AI for the strategy game Go, AlphaGo, beat worldchampion Lee Sedol 4-1, which was considered to be an enormousmilestone for AI. Our goal is to make an AI that can play theclassic strategy board game Stratego at a competent level. Thisis achieved by making the AI learn by repeatedly playing againstitself to figure out what strategy to use in various situations byusing CFR - counterfactual regret minimization. According toour experiments, we were able to accomplish our goal in makinga Stratego AI that could play at a sophisticated level for a smallerversion of the game. We concluded that it was able to play betterthan an amateur human player. / Spel har varit ett intresseområde inomutvecklingen av artificiell intelligens i årtionden. Det är redanfem år sedan AlphaGo slog världsmästaren Lee Sedol i Go 2016,vilket betraktas vara ett stort steg för utvecklingen av AI. Vårtmål är att skapa en AI som kan spela strategispelet Stratego på en kompetent nivå. Detta kommer att implementeras genom att AI:n spelar mot sig själv en stor mängd gånger och uppdaterarsin strategi baserat på konceptet CFR counterfactual regretminimization. Enligt våra experiment lyckades vi med vårt mål i att skapa en kompetent Stratego AI för en mindre version avStratego. Vår uppfattning är att den spelar bättre än en människapå amatörnivå. / Kandidatexjobb i elektroteknik 2021, KTH, Stockholm Strategy game Stratego bot Tactical AI Counterfactual regret minimization Chance sampling Elektroteknik och elektronik

Search results