Global ETD Search

11	Räuber, Rebellen, Rivalen, Rächer : Studien zu Latrones im Römischen Reich / Grünewald, Thomas. January 1999 (has links) Habilitationsschrift--Fachbereich I, Geistes- und Gesellschaftswissenschaften--Duisburg--Gerhard-Mercator-Universität-Gesamthochschule, 1997/98. / Bibliogr. p. 237-238. Index.
12	Contribution à l'étude du banditisme social à Cuba : l'histoire et le mythe de Manuel Garcia, " Rey de los Campos de Cuba ", 1851-1895. Poumier-Taquechel, Maria. January 1986 (has links) TH. 3e cycle--Esp.--Toulouse 2, 1982. / Index.
13	Multi-armed bandits with unconventional feedback / Bandits multi-armés avec rétroaction partielle Gajane, Pratik 14 November 2017 (has links) Dans cette thèse, nous étudions des problèmes de prise de décisions séquentielles dans lesquels, pour chacune de ses décisions, l'apprenant reçoit une information qu'il utilise pour guider ses décisions futures. Pour aller au-delà du retour d’information conventionnel tel qu'il a été bien étudié pour des problèmes de prise de décision séquentielle tels que les bandits multi-bras, nous considérons des formes de retour d’information partielle motivées par des applications pratiques.En premier, nous considérons le problème des bandits duellistes, dans lequel l'apprenant sélectionne deux actions à chaque pas de temps et reçoit en retour une information relative (i.e. de préférence) entre les valeurs instantanées de ces deux actions.En particulier, nous proposons un algorithme optimal qui permet à l'apprenant d'obtenir un regret cumulatif quasi-optimal (le regret est la différence entre la récompense cumulative optimale et la récompense cumulative constatée de l’apprenant). Dans un second temps, nous considérons le problème des bandits corrompus, dans lequel un processus de corruption stochastique perturbe le retour d’information. Pour ce problème aussi, nous concevons des algorithmes pour obtenir un regret cumulatif asymptotiquement optimal. En outre, nous examinons la relation entre ces deux problèmes dans le cadre du monitoring partiel qui est un paradigme générique pour la prise de décision séquentielle avec retour d'information partielle. / The multi-armed bandit (MAB) problem is a mathematical formulation of the exploration-exploitation trade-off inherent to reinforcement learning, in which the learner chooses an action (symbolized by an arm) from a set of available actions in a sequence of trials in order to maximize their reward. In the classical MAB problem, the learner receives absolute bandit feedback i.e. it receives as feedback the reward of the arm it selects. In many practical situations however, different kind of feedback is more readily available. In this thesis, we study two of such kinds of feedbacks, namely, relative feedback and corrupt feedback.The main practical motivation behind relative feedback arises from the task of online ranker evaluation. This task involves choosing the optimal ranker from a finite set of rankers using only pairwise comparisons, while minimizing the comparisons between sub-optimal rankers. This is formalized by the MAB problem with relative feedback, in which the learner selects two arms instead of one and receives the preference feedback. We consider the adversarial formulation of this problem which circumvents the stationarity assumption over the mean rewards for the arms. We provide a lower bound on the performance measure for any algorithm for this problem. We also provide an algorithm called "Relative Exponential-weight algorithm for Exploration and Exploitation" with performance guarantees. We present a thorough empirical study on several information retrieval datasets that confirm the validity of these theoretical results.The motivating theme behind corrupt feedback is that the feedback the learner receives is a corrupted form of the corresponding reward of the selected arm. Practically such a feedback is available in the tasks of online advertising, recommender systems etc. We consider two goals for the MAB problem with corrupt feedback: best arm identification and exploration-exploitation. For both the goals, we provide lower bounds on the performance measures for any algorithm. We also provide various algorithms for these settings. The main contribution of this module is the algorithms "KLUCB-CF" and "Thompson Sampling-CF" which asymptotically attain the best possible performance. We present experimental results to demonstrate the performance of these algorithms. We also show how this problem setting can be used for the practical application of enforcing differential privacy. Bandits Multi-Bras Retour D’information Partielle Dueling Bandits Corrupt Bandits Évaluation du Ranker Vie Privée Différentielle Multi-Armed Bandit Partial Feedback Dueling Bandits Corrupt Bandits Ranker Evaluation Differential Privacy
14	De l'échantillonage optimal en grande et petite dimension / On optimal sampling in high and low dimension Carpentier, Alexandra 05 October 2012 (has links) Pendant ma thèse, j’ai eu la chance d’apprendre et de travailler sous la supervision de mon directeur de thèse Rémi, et ce dans deux domaines qui me sont particulièrement chers. Je veux parler de la Théorie des Bandits et du Compressed Sensing. Je les voie comme intimement liés non par les méthodes mais par leur objectif commun: l’échantillonnage optimal de l’espace. Tous deux sont centrés sur les manières d’échantillonner l’espace efficacement : la Théorie des Bandits en petite dimension et le Compressed Sensing en grande dimension. Dans cette dissertation, je présente la plupart des travaux que mes co-auteurs et moi-même avons écrit durant les trois années qu’a duré ma thèse. / During my PhD, I had the chance to learn and work under the great supervision of my advisor Rémi (Munos) in two fields that are of particular interest to me. These domains are Bandit Theory and Compressed Sensing. While studying these domains I came to the conclusion that they are connected if one looks at them trough the prism of optimal sampling. Both these fields are concerned with strategies on how to sample the space in an efficient way: Bandit Theory in low dimension, and Compressed Sensing in high dimension. In this Dissertation, I present most of the work my co-authors and I produced during the three years that my PhD lasted. Théorie des bandits stochastiques Compressed sensing 006.31
15	El bandolerismo gallego en la primera mitad del siglo XIX / López Morán, Beatriz. January 1995 (has links) Texte remanié de: Tesis doctoral--Facultad de historia--Universidad de Santiago. / Notes bibliogr. Bibliogr. p. 445-454.
16	Algorithmes budgétisés d'itérations sur les politiques obtenues par classification / Budgeted classification-based policy iteration Gabillon, Victor 12 June 2014 (has links) Cette thèse étudie une classe d'algorithmes d'apprentissage par renforcement (RL), appelée « itération sur les politiques obtenues par classification » (CBPI). Contrairement aux méthodes standards de RL, CBPI n'utilise pas de représentation explicite de la fonction valeur. CBPI réalise des déroulés (des trajectoires) et estime la fonction action-valeur de la politique courante pour un nombre limité d'états et d'actions. En utilisant un ensemble d'apprentissage construit à partir de ces estimations, la politique gloutonne est apprise comme le produit d'un classificateur. La politique ainsi produite à chaque itération de l'algorithme, n'est plus définie par une fonction valeur (approximée), mais par un classificateur. Dans cette thèse, nous proposons de nouveaux algorithmes qui améliorent les performances des méthodes CBPI existantes, spécialement lorsque le nombre d’interactions avec l’environnement est limité. Nos améliorations se portent sur les deux limitations de CBPI suivantes : 1) les déroulés utilisés pour estimer les fonctions action-valeur doivent être tronqués et leur nombre est limité, créant un compromis entre le biais et la variance dans ces estimations, et 2) les déroulés sont répartis de manière uniforme entre les états déroulés et les actions disponibles, alors qu'une stratégie plus évoluée pourrait garantir un ensemble d'apprentissage plus précis. Nous proposons des algorithmes CBPI qui répondent à ces limitations, respectivement : 1) en utilisant une approximation de la fonction valeur pour améliorer la précision (en équilibrant biais et variance) des estimations, et 2) en échantillonnant de manière adaptative les déroulés parmi les paires d'état-action. / This dissertation is motivated by the study of a class of reinforcement learning (RL) algorithms, called classification-based policy iteration (CBPI). Contrary to the standard RL methods, CBPI do not use an explicit representation for value function. Instead, they use rollouts and estimate the action-value function of the current policy at a collection of states. Using a training set built from these rollout estimates, the greedy policy is learned as the output of a classifier. Thus, the policy generated at each iteration of the algorithm, is no longer defined by a (approximated) value function, but instead by a classifier. In this thesis, we propose new algorithms that improve the performance of the existing CBPI methods, especially when they have a fixed budget of interaction with the environment. Our improvements are based on the following two shortcomings of the existing CBPI algorithms: 1) The rollouts that are used to estimate the action-value functions should be truncated and their number is limited, and thus, we have to deal with bias-variance tradeoff in estimating the rollouts, and 2) The rollouts are allocated uniformly over the states in the rollout set and the available actions, while a smarter allocation strategy could guarantee a more accurate training set for the classifier. We propose CBPI algorithms that address these issues, respectively, by: 1) the use of a value function approximation to improve the accuracy (balancing the bias and variance) of the rollout estimates, and 2) adaptively sampling the rollouts over the state-action pairs. Jeux de bandits Apprentissage par renforcement 006.31
17	Tutoring Students with Adaptive Strategies Wan, Hao 18 January 2017 (has links) Adaptive learning is a crucial part in intelligent tutoring systems. It provides students with appropriate tutoring interventions, based on studentsâ€™ characteristics, status, and other related features, in order to optimize their learning outcomes. It is required to determine studentsâ€™ knowledge level or learning progress, based on which it then uses proper techniques to choose the optimal interventions. In this dissertation work, I focus on these aspects related to the process in adaptive learning: student modeling, k-armed bandits, and contextual bandits. Student modeling. The main objective of student modeling is to develop cognitive models of students, including modeling content skills and knowledge about learning. In this work, we investigate the effect of prerequisite skill in predicting studentsâ€™ knowledge in post skills, and we make use of the prerequisite performance in different student models. As a result, this makes them superior to traditional models. K-armed bandits. We apply k-armed bandit algorithms to personalize interventions for students, to optimize their learning outcomes. Due to the lack of diverse interventions and small difference of intervention effectiveness in educational experiments, we also propose a simple selection strategy, and compare it with several k-armed bandit algorithms. Contextual bandits. In contextual bandit problem, additional side information, also called context, can be used to determine which action to select. First, we construct a feature evaluation mechanism, which determines which feature to be combined with bandits. Second, we propose a new decision tree algorithm, which is capable of detecting aptitude treatment effect for students. Third, with combined bandits with the decision tree, we apply the contextual bandits to make personalization in two different types of data, simulated data and real experimental data. decision tree prerequisite contextual bandits k-armed bandits student model adaptive learning
18	Exploration-exploitation with Thompson sampling in linear systems / Algorithmes de Thompson sampling pour l’exploration-exploitation dans les systèmes linéaires Abeille, Marc 13 December 2017 (has links) Cette thèse est dédiée à l'étude du Thompson Sampling (TS), une heuristique qui vise à surmonter le dilemme entre exploration et exploitation qui est inhérent à tout processus décisionnel face à l'incertain. Contrairement aux algorithmes issus de l'heuristique optimiste face à l'incertain (OFU), où l'exploration provient du choix du modèle le plus favorable possible au vu de la connaissance accumulée, les algorithmes TS introduisent de l'aléa dans le processus décisionnel en sélectionnant aléatoirement un modèle plausible, ce qui les rend bien moins coûteux numériquement. Cette étude se concentre sur les problèmes paramétriques linéaires, qui autorisent les espaces état-action continus (infinis), en particulier les problèmes de Bandits Linéaires (LB) et les problèmes de contrôle Linéaire et Quadratique (LQ). Nous proposons dans cette thèse de nouvelles analyses du regret des algorithmes TS pour chacun de ces deux problèmes. Bien que notre démonstration pour les LB garantisse une borne supérieure identique aux résultats préexistants, la structure de la preuve offre une nouvelle vision du fonctionnement de l'algorithme TS, et nous permet d'étendre cette analyse aux problèmes LQ. Nous démontrons la première borne supérieure pour le regret de l'algorithme TS dans les problèmes LQ, qui garantie dans le cadre fréquentiste un regret au plus d'ordre O(\sqrt{T}). Enfin, nous proposons une application des méthodes d'exploration-exploitation pour les problèmes d'optimisation de portefeuille, et discutons dans ce cadre le besoin ou non d'explorer activement. / This dissertation is dedicated to the study of the Thompson Sampling (TS) algorithms designed to address the exploration-exploitation dilemma that is inherent in sequential decision-making under uncertainty. As opposed to algorithms derived from the optimism-in-the-face-of-uncertainty (OFU) principle, where the exploration is performed by selecting the most favorable model within the set of plausible one, TS algorithms rely on randomization to enhance the exploration, and thus are much more computationally efficient. We focus on linearly parametrized problems that allow for continuous state-action spaces, namely the Linear Bandit (LB) problems and the Linear Quadratic (LQ) control problems. We derive two novel analyses for the regret of TS algorithms in those settings. While the obtained regret bound for LB is similar to previous results, the proof sheds new light on the functioning of TS, and allows us to extend the analysis to LQ problems. As a result, we prove the first regret bound for TS in LQ, and show that the frequentist regret is of order O(sqrt{T}) which matches the existing guarantee for the regret of OFU algorithms in LQ. Finally, we propose an application of exploration-exploitation techniques to the practical problem of portfolio construction, and discuss the need for active exploration in this setting. Algorithme Thompson sampling Bandits multi-Bras Bandits linéaires 519.62
19	Bandits Manchots sur Flux de Données Non Stationnaires / Multi-armed Bandits on non Stationary Data Streams Allesiardo, Robin 19 October 2016 (has links) Le problème des bandits manchots est un cadre théorique permettant d'étudier le compromis entre exploration et exploitation lorsque l'information observée est partielle. Dans celui-ci, un joueur dispose d'un ensemble de K bras (ou actions), chacun associé à une distribution de récompenses D(µk) de moyenne µk Є [0, 1] et de support [0, 1]. A chaque tour t Є [1, T], il choisit un bras kt et observe la récompense y kt tirée depuis D (µkt). La difficulté du problème vient du fait que le joueur observe uniquement la récompense associée au bras joué; il ne connaît pas celle qui aurait pu être obtenue en jouant un autre bras. À chaque choix, il est ainsi confronté au dilemme entre l'exploration et l'exploitation; explorer lui permet d'affiner sa connaissance des distributions associées aux bras explorés tandis qu'exploiter lui permet d'accumuler davantage de récompenses en jouant le meilleur bras empirique (sous réserve que le meilleur bras empirique soit effectivement le meilleur bras). Dans la première partie de la thèse nous aborderons le problème des bandits manchots lorsque les distributions générant les récompenses sont non-stationnaires. Nous étudierons dans un premier temps le cas où même si les distributions varient au cours du temps, le meilleur bras ne change pas. Nous étudierons ensuite le cas où le meilleur bras peut aussi changer au cours du temps. La seconde partie est consacrée aux algorithmes de bandits contextuels où les récompenses dépendent de l'état de l'environnement. Nous étudierons l'utilisation des réseaux de neurones et des forêts d'arbres dans le cas des bandits contextuels puis les différentes approches à base de méta-bandits permettant de sélectionner en ligne l'expert le plus performant durant son apprentissage. / The multi-armed bandit is a framework allowing the study of the trade-off between exploration and exploitation under partial feedback. At each turn t Є [1,T] of the game, a player has to choose an arm kt in a set of K and receives a reward ykt drawn from a reward distribution D(µkt) of mean µkt and support [0,1]. This is a challeging problem as the player only knows the reward associated with the played arm and does not know what would be the reward if she had played another arm. Before each play, she is confronted to the dilemma between exploration and exploitation; exploring allows to increase the confidence of the reward estimators and exploiting allows to increase the cumulative reward by playing the empirical best arm (under the assumption that the empirical best arm is indeed the actual best arm).In the first part of the thesis, we will tackle the multi-armed bandit problem when reward distributions are non-stationary. Firstly, we will study the case where, even if reward distributions change during the game, the best arm stays the same. Secondly, we will study the case where the best arm changes during the game. The second part of the thesis tacles the contextual bandit problem where means of reward distributions are now dependent of the environment's current state. We will study the use of neural networks and random forests in the case of contextual bandits. We will then propose meta-bandit based approach for selecting online the most performant expert during its learning. Apprentissage en ligne Bandits Manchots Non Stationnarité Apprentissage automatique Online Learning Multi-Armed Bandits Non stationarity Machine learning
20	Online Learning for Linearly Parametrized Control Problems Abbasi-Yadkori, Yasin Unknown Date No description available. Online Learning Confidence Sets Linear Bandits Reinforcement Learning

Search results