Global ETD Search

31	Cumulative Distribution Networks: Inference, Estimation and Applications of Graphical Models for Cumulative Distribution Functions Huang, Jim C. 01 March 2010 (has links) This thesis presents a class of graphical models for directly representing the joint cumulative distribution function (CDF) of many random variables, called cumulative distribution networks (CDNs). Unlike graphical models for probability density and mass functions, in a CDN, the marginal probabilities for any subset of variables are obtained by computing limits of functions in the model. We will show that the conditional independence properties in a CDN are distinct from the conditional independence properties of directed, undirected and factor graph models, but include the conditional independence properties of bidirected graphical models. As a result, CDNs are a parameterization for bidirected models that allows us to represent complex statistical dependence relationships between observable variables. We will provide a method for constructing a factor graph model with additional latent variables for which graph separation of variables in the corresponding CDN implies conditional independence of the separated variables in both the CDN and in the factor graph with the latent variables marginalized out. This will then allow us to construct multivariate extreme value distributions for which both a CDN and a corresponding factor graph representation exist. In order to perform inference in such graphs, we describe the `derivative-sum-product' (DSP) message-passing algorithm where messages correspond to derivatives of the joint cumulative distribution function. We will then apply CDNs to the problem of learning to rank, or estimating parametric models for ranking, where CDNs provide a natural means with which to model multivariate probabilities over ordinal variables such as pairwise preferences. We will show that many previous probability models for rank data, such as the Bradley-Terry and Plackett-Luce models, can be viewed as particular types of CDN. Applications of CDNs will be described for the problems of ranking players in multiplayer team-based games, document retrieval and discovering regulatory sequences in computational biology using the above methods for inference and estimation of CDNs. Graphical models Cumulative distribution function Inference Message-passing Learning to rank Information retrieval Computational biology Bioinformatics Genomics Extreme value distribution microRNA Gene regulation Copula
32	Predicting forest strata from point clouds using geometric deep learning Arvidsson, Simon, Gullstrand, Marcus January 2021 (has links) Introduction: Number of strata (NoS) is an informative descriptor of forest structure and is therefore useful in forest management. Collection of NoS as well as other forest properties is performed by fieldworkers and could benefit from automation. Objectives: This study investigates automated prediction of NoS from airborne laser scanned point clouds over Swedish forest plots.Methods: A previously suggested approach of using vertical gap probability is compared through experimentation against the geometric neural network PointNet++ configured for ordinal prediction. For both approaches, the mean accuracy is measured for three datasets: coniferous forest, deciduous forest, and a combination of all forests. Results: PointNet++ displayed a better point performance for two out of three datasets, attaining a top mean accuracy of 46.2%. However only the coniferous subset displayed a statistically significant superiority for PointNet++. Conclusion: This study demonstrates the potential of geometric neural networks for data mining of forest properties. The results show that impediments in the data may need to be addressed for further improvements. Supervised learning Learning to rank Point-based models Shape inference PointNet layering layers classification ordingal regression forest agency ranking SLU 3D plots Computer Systems Datorsystem
33	Collecte orientée sur le Web pour la recherche d’information spécialisée / Focused document gathering on the Web for domain-specific information retrieval De Groc, Clément 05 June 2013 (has links) Les moteurs de recherche verticaux, qui se concentrent sur des segments spécifiques du Web, deviennent aujourd'hui de plus en plus présents dans le paysage d'Internet. Les moteurs de recherche thématiques, notamment, peuvent obtenir de très bonnes performances en limitant le corpus indexé à un thème connu. Les ambiguïtés de la langue sont alors d'autant plus contrôlables que le domaine est bien ciblé. De plus, la connaissance des objets et de leurs propriétés rend possible le développement de techniques d'analyse spécifiques afin d'extraire des informations pertinentes.Dans le cadre de cette thèse, nous nous intéressons plus précisément à la procédure de collecte de documents thématiques à partir du Web pour alimenter un moteur de recherche thématique. La procédure de collecte peut être réalisée en s'appuyant sur un moteur de recherche généraliste existant (recherche orientée) ou en parcourant les hyperliens entre les pages Web (exploration orientée).Nous étudions tout d'abord la recherche orientée. Dans ce contexte, l'approche classique consiste à combiner des mot-clés du domaine d'intérêt, à les soumettre à un moteur de recherche et à télécharger les meilleurs résultats retournés par ce dernier.Après avoir évalué empiriquement cette approche sur 340 thèmes issus de l'OpenDirectory, nous proposons de l'améliorer en deux points. En amont du moteur de recherche, nous proposons de formuler des requêtes thématiques plus pertinentes pour le thème afin d'augmenter la précision de la collecte. Nous définissons une métrique fondée sur un graphe de cooccurrences et un algorithme de marche aléatoire, dans le but de prédire la pertinence d'une requête thématique. En aval du moteur de recherche, nous proposons de filtrer les documents téléchargés afin d'améliorer la qualité du corpus produit. Pour ce faire, nous modélisons la procédure de collecte sous la forme d'un graphe triparti et appliquons un algorithme de marche aléatoire biaisé afin d'ordonner par pertinence les documents et termes apparaissant dans ces derniers.Dans la seconde partie de cette thèse, nous nous focalisons sur l'exploration orientée du Web. Au coeur de tout robot d'exploration orientée se trouve une stratégie de crawl qui lui permet de maximiser le rapatriement de pages pertinentes pour un thème, tout en minimisant le nombre de pages visitées qui ne sont pas en rapport avec le thème. En pratique, cette stratégie définit l'ordre de visite des pages. Nous proposons d'apprendre automatiquement une fonction d'ordonnancement indépendante du thème à partir de données existantes annotées automatiquement. / Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithms and the user interface can take advantage of domain knowledge, such as domain objects or characteristics, to satisfy user information needs.In this thesis, we tackle the first inevitable step of a all topical search engine : focused document gathering from the Web. A thorough study of the state of art leads us to consider two strategies to gather topical documents from the Web: either relying on an existing search engine index (focused search) or directly crawling the Web (focused crawling).The first part of our research has been dedicated to focused search. In this context, a standard approach consists in combining domain-specific terms into queries, submitting those queries to a search engine and down- loading top ranked documents. After empirically evaluating this approach over 340 topics, we propose to enhance it in two different ways: Upstream of the search engine, we aim at formulating more relevant queries in or- der to increase the precision of the top retrieved documents. To do so, we define a metric based on a co-occurrence graph and a random walk algorithm, which aims at predicting the topical relevance of a query. Downstream of the search engine, we filter the retrieved documents in order to improve the document collection quality. We do so by modeling our gathering process as a tripartite graph and applying a random walk with restart algorithm so as to simultaneously order by relevance the documents and terms appearing in our corpus.In the second part of this thesis, we turn to focused crawling. We describe our focused crawler implementation that was designed to scale horizontally. Then, we consider the problem of crawl frontier ordering, which is at the very heart of a focused crawler. Such ordering strategy allows the crawler to prioritize its fetches, maximizing the number of in-domain documents retrieved while minimizing the non relevant ones. We propose to apply learning to rank algorithms to efficiently order the crawl frontier, and define a method to learn a ranking function from existing crawls. Collecte orientée Recherche d’information Web Crawling orientée Recherche orientée Apprentissage automatique Recherche de l'information Focused crawling Focused search Domain-specific information retrieval Web information retrieval Information Retrieval Learning to Rank Machine Learning
34	On recommendation systems in a sequential context / Des Systèmes de Recommandation dans un Contexte Séquentiel Guillou, Frédéric 02 December 2016 (has links) Cette thèse porte sur l'étude des Systèmes de Recommandation dans un cadre séquentiel, où les retours des utilisateurs sur des articles arrivent dans le système l'un après l'autre. Après chaque retour utilisateur, le système doit le prendre en compte afin d'améliorer les recommandations futures. De nombreuses techniques de recommandation ou méthodologies d'évaluation ont été proposées par le passé pour les problèmes de recommandation. Malgré cela, l'évaluation séquentielle, qui est pourtant plus réaliste et se rapproche davantage du cadre d'évaluation d'un vrai système de recommandation, a été laissée de côté. Le contexte séquentiel nécessite de prendre en considération différents aspects non visibles dans un contexte fixe. Le premier de ces aspects est le dilemme dit d'exploration vs. exploitation: le modèle effectuant les recommandations doit trouver le bon compromis entre recueillir de l'information sur les goûts des utilisateurs à travers des étapes d'exploration, et exploiter la connaissance qu'il a à l'heure actuelle pour maximiser le feedback reçu. L'importance de ce premier point est mise en avant à travers une première évaluation, et nous proposons une approche à la fois simple et efficace, basée sur la Factorisation de Matrice et un algorithme de Bandit Manchot, pour produire des recommandations appropriées. Le second aspect pouvant apparaître dans le cadre séquentiel surgit dans le cas où une liste ordonnée d'articles est recommandée au lieu d'un seul article. Dans cette situation, le feedback donné par l'utilisateur est multiple: la partie explicite concerne la note donnée par l'utilisateur concernant l'article choisi, tandis que la partie implicite concerne les articles cliqués (ou non cliqués) parmi les articles de la liste. En intégrant les deux parties du feedback dans un modèle d'apprentissage, nous proposons une approche basée sur la Factorisation de Matrice, qui peut recommander de meilleures listes ordonnées d'articles, et nous évaluons cette approche dans un contexte séquentiel particulier pour montrer son efficacité. / This thesis is dedicated to the study of Recommendation Systems under a sequential setting, where the feedback given by users on items arrive one after another in the system. After each feedback, the system has to integrate it and try to improve future recommendations. Many techniques or evaluation methods have already been proposed to study the recommendation problem. Despite that, such sequential setting, which is more realistic and represent a closer framework to a real Recommendation System evaluation, has surprisingly been left aside. Under a sequential context, recommendation techniques need to take into consideration several aspects which are not visible for a fixed setting. The first one is the exploration-exploitation dilemma: the model making recommendations needs to find a good balance between gathering information about users' tastes or items through exploratory recommendation steps, and exploiting its current knowledge of the users and items to try to maximize the feedback received. We highlight the importance of this point through the first evaluation study and propose a simple yet efficient approach to make effective recommendation, based on Matrix Factorization and Multi-Armed Bandit algorithms. The second aspect emphasized by the sequential context appears when a list of items is recommended to the user instead of a single item. In such a case, the feedback given by the user includes two parts: the explicit feedback as the rating, but also the implicit feedback given by clicking (or not clicking) on other items of the list. By integrating both feedback into a Matrix Factorization model, we propose an approach which can suggest better ranked list of items, and we evaluate it in a particular setting. Systèmes de Recommandation Recommandation Séquentielle Filtrage Collaboratif Factorisation de Matrice Bandit Manchot Feedback Séquentiel Apprentissage de Classement Recommendation Systems Sequential Recommendation Collaborative Filtering Matrix Factorization Multi-Armed Bandits Sequential Feedback Learning to Rank
35	Traduction assistée par ordinateur et corpus comparables : contributions à la traduction compositionnelle Delpech, Estelle 02 July 2013 (has links) (PDF) Notre travail concerne l'extraction de lexiques bilingues à partir de corpus comparables, avec une application à la traduction spécialisée. Nous avons d'abord évalué les méthodes classiques d'acquisition de lexiques en corpus comparables (basées l'hypothèse distributionnelle : plus deux termes apparaissent dans des contextes similaires, plus il y a de chances qu'ils soient des traductions) d'un point de vue applicatif. L'évaluation a montré que les traducteurs sont mal à l'aise avec les lexiques extraits : la traduction correcte est trop souvent noyée dans une liste de traductions candidates et ils préfèreraient utiliser un lexique plus petit mais plus précis. Partant de ce constat, nous nous sommes orientés vers une autre approche qui a fait récemment ses preuves pour l'exploitation des corpus comparables et produit des lexiques plus adaptés aux besoins des traducteurs : la traduction compositionnelle (la traduction du terme source est fonction de la traduction de ses parties). Nous nous sommes concentrés sur la traduction d'unités monolexicales : le terme source est découpé en morphèmes, les morphèmes sont traduits puis recomposés en un terme cible. Dans ce cadre, nous avons poursuivi trois axes de recherche : la génération de traductions fertiles (cas où le terme cible contient plus de mots lexicaux que le terme source), l'indépendance aux structures morphologiques et l'ordonnancement des traductions candidates. traduction assistée par ordinateur corpus comparables compositionnalité learning-to-rank évaluation centrée utilisateur morphologie computationnelle
36	Machine Learning and Rank Aggregation Methods for Gene Prioritization from Heterogeneous Data Sources Laha, Anirban January 2013 (has links) (PDF) Gene prioritization involves ranking genes by possible relevance to a disease of interest. This is important in order to narrow down the set of genes to be investigated biologically, and over the years, several computational approaches have been proposed for automat-ically prioritizing genes using some form of gene-related data, mostly using statistical or machine learning methods. Recently, Agarwal and Sengupta (2009) proposed the use of learning-to-rank methods, which have been used extensively in information retrieval and related fields, to learn a ranking of genes from a given data source, and used this approach to successfully identify novel genes related to leukemia and colon cancer using only gene expression data. In this work, we explore the possibility of combining such learning-to-rank methods with rank aggregation techniques to learn a ranking of genes from multiple heterogeneous data sources, such as gene expression data, gene ontology data, protein-protein interaction data, etc. Rank aggregation methods have their origins in voting theory, and have been used successfully in meta-search applications to aggregate webpage rankings from different search engines. Here we use graph-based learning-to-rank methods to learn a ranking of genes from each individual data source represented as a graph, and then apply rank aggregation methods to aggregate these rankings into a single ranking over the genes. The thesis describes our approach, reports experiments with various data sets, and presents our findings and initial conclusions. Gene Prioritization Gene Ranking Bipartite Ranking Learning To Rank Rank Aggregation Methods Bipartite Instance Ranking Rank Aggregration Ranking of Genes Gene Data Sources Genes Bipartite Ranking Bipartite Graph Ranking Bioinformatics
37	Optimizing Search Engine Field Weights with Limited Data : Offline exploration of optimal field weight combinations through regression analysis / Optimering av sökmotorers fältvikter med begränsad data : Offline-utforskning av optimala fältviktskombinationer genom regressionsanalys Kader, Zino January 2023 (has links) Modern search engines, particularly those utilizing the BM25 ranking algorithm, offer a multitude of tunable parameters designed to refine search results. Among these parameters, the weight of each searchable field plays a crucial role in enhancing search outcomes. Traditional methods of discovering optimal weight combinations, however, are often exploratory, demanding substantial time and risking the delivery of substandard results during testing. This thesis proposes a streamlined solution: an ordinal-regression-based model specifically engineered to identify optimal weight combinations with minimal data input, within an offline testing environment. The evaluation corpus comprises a comprehensive snapshot of a product search database from Tradera. The top $100$ search queries and corresponding search results pages on the Tradera platform were divided into a training set and an evaluation set. The model underwent iterative training on the training set, and subsequent testing on the evaluation set, with progressively increasing amounts of labeled data. This methodological approach allowed examining the model's proficiency in deriving high-performance weight combinations from limited data. The empirical experiments conducted confirmed that the proposed model successfully generated promising weight combinations, even with restricted data, and exhibited robust generalization to the evaluation dataset. In conclusion, this research substantiates the significant potential for enhancing search results by tuning searchable field weights using a regression-based model, even in data-scarce scenarios. / Moderna sökmotorer, i synnerhet sådana som använder rankningsalgoritmen BM25, erbjuder en mängd justerbara parametrar utformade för att förbättra sökresultat. Bland dessa parametrar spelar vikten av varje sökbart fält en avgörande roll för att förbättra sökresultaten. Traditionella metoder för att hitta optimala viktkombinationer är dock ofta utforskande, kräver mycket tid och riskerar att ge undermåliga sökresultat under testningsperioden. Denna avhandling föreslår en strömlinjeformad lösning: en ordinal-regressionsbaserad modell specifikt utvecklad för att identifiera optimala viktkombinationer med minimal träningsdata, inom en offline testmiljö. Utvärderingskorpus består av en omfattande ögonblicksbild av en produktsökdatabas från Tradera. De $100$ vanligaste sökfrågorna och motsvarande sökresultatssidor på Traderas plattform delades in i en träningsuppsättning och en utvärderingsuppsättning. Modellen genomgick iterativ träning på träningsuppsättningen, och därefter testning på utvärderingsuppsättningen, med successivt ökande mängder av kategoriserad data. Denna metodologiska strategi möjliggjorde undersökning av modellens förmåga att härleda högpresterande viktkombinationer från begränsad data. De empiriska experimenten som genomfördes bekräftade att den föreslagna modellen framgångsrikt genererade lovande viktkombinationer, även med begränsad data, och uppvisade robust generalisering till utvärderingsdatamängden. Sammanfattningsvis bekräftar denna forskning den betydande potentialen för förbättring av sökresultat genom att justera sökbara fältvikter med hjälp av en regressionsbaserad modell, även i datasnåla scenarion. Information retrieval Search engines BM25 (Best Match 25) Regression analysis Parameter estimation Learning to rank Informationsinhämtning Sökmotorer BM25 (Best Match 25) Regressionsanalys Parameterskattning Maskininlärning för rangordning Computer Sciences Datavetenskap (datalogi) Software Engineering Programvaruteknik Computer Engineering Datorteknik

Page generated in 0.1403 seconds