Global ETD Search

1	Exploration-exploitation with Thompson sampling in linear systems / Algorithmes de Thompson sampling pour l’exploration-exploitation dans les systèmes linéaires Abeille, Marc 13 December 2017 (has links) Cette thèse est dédiée à l'étude du Thompson Sampling (TS), une heuristique qui vise à surmonter le dilemme entre exploration et exploitation qui est inhérent à tout processus décisionnel face à l'incertain. Contrairement aux algorithmes issus de l'heuristique optimiste face à l'incertain (OFU), où l'exploration provient du choix du modèle le plus favorable possible au vu de la connaissance accumulée, les algorithmes TS introduisent de l'aléa dans le processus décisionnel en sélectionnant aléatoirement un modèle plausible, ce qui les rend bien moins coûteux numériquement. Cette étude se concentre sur les problèmes paramétriques linéaires, qui autorisent les espaces état-action continus (infinis), en particulier les problèmes de Bandits Linéaires (LB) et les problèmes de contrôle Linéaire et Quadratique (LQ). Nous proposons dans cette thèse de nouvelles analyses du regret des algorithmes TS pour chacun de ces deux problèmes. Bien que notre démonstration pour les LB garantisse une borne supérieure identique aux résultats préexistants, la structure de la preuve offre une nouvelle vision du fonctionnement de l'algorithme TS, et nous permet d'étendre cette analyse aux problèmes LQ. Nous démontrons la première borne supérieure pour le regret de l'algorithme TS dans les problèmes LQ, qui garantie dans le cadre fréquentiste un regret au plus d'ordre O(\sqrt{T}). Enfin, nous proposons une application des méthodes d'exploration-exploitation pour les problèmes d'optimisation de portefeuille, et discutons dans ce cadre le besoin ou non d'explorer activement. / This dissertation is dedicated to the study of the Thompson Sampling (TS) algorithms designed to address the exploration-exploitation dilemma that is inherent in sequential decision-making under uncertainty. As opposed to algorithms derived from the optimism-in-the-face-of-uncertainty (OFU) principle, where the exploration is performed by selecting the most favorable model within the set of plausible one, TS algorithms rely on randomization to enhance the exploration, and thus are much more computationally efficient. We focus on linearly parametrized problems that allow for continuous state-action spaces, namely the Linear Bandit (LB) problems and the Linear Quadratic (LQ) control problems. We derive two novel analyses for the regret of TS algorithms in those settings. While the obtained regret bound for LB is similar to previous results, the proof sheds new light on the functioning of TS, and allows us to extend the analysis to LQ problems. As a result, we prove the first regret bound for TS in LQ, and show that the frequentist regret is of order O(sqrt{T}) which matches the existing guarantee for the regret of OFU algorithms in LQ. Finally, we propose an application of exploration-exploitation techniques to the practical problem of portfolio construction, and discuss the need for active exploration in this setting. Algorithme Thompson sampling Bandits multi-Bras Bandits linéaires 519.62
2	Multi-armed bandits with unconventional feedback / Bandits multi-armés avec rétroaction partielle Gajane, Pratik 14 November 2017 (has links) Dans cette thèse, nous étudions des problèmes de prise de décisions séquentielles dans lesquels, pour chacune de ses décisions, l'apprenant reçoit une information qu'il utilise pour guider ses décisions futures. Pour aller au-delà du retour d’information conventionnel tel qu'il a été bien étudié pour des problèmes de prise de décision séquentielle tels que les bandits multi-bras, nous considérons des formes de retour d’information partielle motivées par des applications pratiques.En premier, nous considérons le problème des bandits duellistes, dans lequel l'apprenant sélectionne deux actions à chaque pas de temps et reçoit en retour une information relative (i.e. de préférence) entre les valeurs instantanées de ces deux actions.En particulier, nous proposons un algorithme optimal qui permet à l'apprenant d'obtenir un regret cumulatif quasi-optimal (le regret est la différence entre la récompense cumulative optimale et la récompense cumulative constatée de l’apprenant). Dans un second temps, nous considérons le problème des bandits corrompus, dans lequel un processus de corruption stochastique perturbe le retour d’information. Pour ce problème aussi, nous concevons des algorithmes pour obtenir un regret cumulatif asymptotiquement optimal. En outre, nous examinons la relation entre ces deux problèmes dans le cadre du monitoring partiel qui est un paradigme générique pour la prise de décision séquentielle avec retour d'information partielle. / The multi-armed bandit (MAB) problem is a mathematical formulation of the exploration-exploitation trade-off inherent to reinforcement learning, in which the learner chooses an action (symbolized by an arm) from a set of available actions in a sequence of trials in order to maximize their reward. In the classical MAB problem, the learner receives absolute bandit feedback i.e. it receives as feedback the reward of the arm it selects. In many practical situations however, different kind of feedback is more readily available. In this thesis, we study two of such kinds of feedbacks, namely, relative feedback and corrupt feedback.The main practical motivation behind relative feedback arises from the task of online ranker evaluation. This task involves choosing the optimal ranker from a finite set of rankers using only pairwise comparisons, while minimizing the comparisons between sub-optimal rankers. This is formalized by the MAB problem with relative feedback, in which the learner selects two arms instead of one and receives the preference feedback. We consider the adversarial formulation of this problem which circumvents the stationarity assumption over the mean rewards for the arms. We provide a lower bound on the performance measure for any algorithm for this problem. We also provide an algorithm called "Relative Exponential-weight algorithm for Exploration and Exploitation" with performance guarantees. We present a thorough empirical study on several information retrieval datasets that confirm the validity of these theoretical results.The motivating theme behind corrupt feedback is that the feedback the learner receives is a corrupted form of the corresponding reward of the selected arm. Practically such a feedback is available in the tasks of online advertising, recommender systems etc. We consider two goals for the MAB problem with corrupt feedback: best arm identification and exploration-exploitation. For both the goals, we provide lower bounds on the performance measures for any algorithm. We also provide various algorithms for these settings. The main contribution of this module is the algorithms "KLUCB-CF" and "Thompson Sampling-CF" which asymptotically attain the best possible performance. We present experimental results to demonstrate the performance of these algorithms. We also show how this problem setting can be used for the practical application of enforcing differential privacy. Bandits Multi-Bras Retour D’information Partielle Dueling Bandits Corrupt Bandits Évaluation du Ranker Vie Privée Différentielle Multi-Armed Bandit Partial Feedback Dueling Bandits Corrupt Bandits Ranker Evaluation Differential Privacy
3	Méthodes adaptatives pour les applications d'accès à l'information centrées sur l'utilisateur / Adaptive Methods for User-Centric Information Access Applications Lagrée, Paul 12 October 2017 (has links) Lorsque les internautes naviguent sur le Web, ils laissent de nombreuses traces que nous nous proposons d’exploiter pour améliorer les applications d'accès à l'information. Nous étudions des techniques centrées sur les utilisateurs qui tirent parti des nombreux types de rétroaction pour perfectionner les services offerts aux utilisateurs. Nous nous concentrons sur des applications telles que la recommandation et le marketing d’influence dans lesquelles les utilisateurs génèrent des signaux (clics, "j'aime", etc.) que nous intégrons dans nos algorithmes afin de fournir des services fortement contextualisés. La première partie de cette thèse est consacrée à une approche interactive de la recherche d'information sur les médias sociaux. Le problème consiste à récupérer un ensemble de k résultats dans un réseau social sous la contrainte que la requête peut être incomplète (par exemple, si le dernier terme est un préfixe). Chaque fois que l'utilisateur met à jour sa requête, le système met à jour l'ensemble des résultats de recherche en conséquence. Nous adoptons une interprétation de la pertinence de l'information qui tient compte du réseau, selon laquelle l'information produite par les utilisateurs proches de l'utilisateur faisant la requête est jugée plus pertinente. Ensuite, nous étudions une version générique de la maximisation de l'influence, dans laquelle nous voulons maximiser l'influence des campagnes d'information ou de marketing en sélectionnant de manière adaptative les utilisateurs initiant la propagation de l'information parmi un petit sous-ensemble de la population. Notre approche ne fait aucune hypothèse sur le modèle de diffusion sous-jacent ni même sur la structure du réseau de diffusion. Notre méthode a d'importantes applications dans le marketing d’influence qui vise à s’appuyer sur les influenceurs de réseaux sociaux pour promouvoir des produits ou des idées. Enfin, nous abordons le problème bien connu du démarrage à froid auquel sont confrontés les systèmes de recommandation par une approche adaptative. Si aucune information n’est disponible concernant l'appréciation d’un article, le système de recommandation doit recueillir des signaux (clics, etc.) afin d'estimer la valeur de l'article. Cependant, afin de minimiser les mauvaises recommandations faites aux utilisateurs, le système ne doit pas recueillir ces signaux de façon négligente. Nous introduisons un algorithme dynamique qui vise à alterner intelligemment les recommandations visant à accumuler de l'information et celles s'appuyant sur les données déjà recueillies. / When users interact on modern Web systems, they let numerous footprints which we propose to exploit in order to develop better applications for information access. We study a family of techniques centered on users, which take advantage of the many types of feedback to adapt and improve services provided to users. We focus on applications like recommendation and influencer marketing in which users generate discrete feedback (e.g. clicks, "likes", reposts, etc.) that we incorporate in our algorithms in order to deliver strongly contextualized services. The first part of this dissertation is dedicated to an approach for as-you-type search on social media. The problem consists in retrieving a set of k search results in a social-aware environment under the constraint that the query may be incomplete (e.g., if the last term is a prefix). Every time the user updates his / her query, the system updates the set of search results accordingly. We adopt a "network-aware" interpretation of information relevance, by which information produced by users who are closer to the user issuing a request is considered more relevant. Then, we study a generic version of influence maximization, in which we want to maximize the influence of marketing or information campaigns by adaptively selecting "spread seeds" from a small subset of the population. Influencer marketing is a straightforward application of this, in which the focus of a campaign is placed on precise key individuals who are typically able to reach millions of consumers. This represents an unprecedented tool for online marketing that we propose to improve using an adaptive approach. Notably, our approach makes no assumptions on the underlying diffusion model and no diffusion network is needed. Finally, we propose to address the well-known cold start problem faced by recommender systems with an adaptive approach. If no information is available regarding the user appreciation of an item, the recommender system needs to gather feedback (e.g., clicks) so as to estimate the value of the item. However, in order to minimize "bad" recommendations, a well-designed system should not collect feedback carelessly. We introduce a dynamic algorithm that aims to intelligently achieve the balance between "bad" and "good" recommendations. Méthodes adaptatives Réseaux sociaux Bandits multi-bras Recommandation Marketing d'influence Adaptive methods Online social networks Multi-armed bandits Recommendation Influencer marketing

1

Page generated in 0.0427 seconds