Global ETD Search

61	The Value of Everything: Ranking and Association with Encyclopedic Knowledge Coursey, Kino High 12 1900 (has links) This dissertation describes WikiRank, an unsupervised method of assigning relative values to elements of a broad coverage encyclopedic information source in order to identify those entries that may be relevant to a given piece of text. The valuation given to an entry is based not on textual similarity but instead on the links that associate entries, and an estimation of the expected frequency of visitation that would be given to each entry based on those associations in context. This estimation of relative frequency of visitation is embodied in modifications to the random walk interpretation of the PageRank algorithm. WikiRank is an effective algorithm to support natural language processing applications. It is shown to exceed the performance of previous machine learning algorithms for the task of automatic topic identification, providing results comparable to that of human annotators. Second, WikiRank is found useful for the task of recognizing text-based paraphrases on a semantic level, by comparing the distribution of attention generated by two pieces of text using the encyclopedic resource as a common reference. Finally, WikiRank is shown to have the ability to use its base of encyclopedic knowledge to recognize terms from different ontologies as describing the same thing, and thus allowing for the automatic generation of mapping links between ontologies. The conclusion of this thesis is that the "knowledge access heuristic" is valuable and that a ranking process based on a large encyclopedic resource can form the basis for an extendable general purpose mechanism capable of identifying relevant concepts by association, which in turn can be effectively utilized for enumeration and comparison at a semantic level. PageRank topic identification similarity detection mapreduce ontology matching associative ranking Wikipedia WikiRank Information retrieval. Computational linguistics. Relevance. Wikipedia.
62	Complex graph algorithms using relational database Ahmed, Aly 24 August 2021 (has links) Data processing for Big Data plays a vital role for decision-makers in organizations and government, enhances the user experience, and provides quality results in prediction analysis. However, many modern data processing solutions make a significant investment in hardware and maintenance costs, such as Hadoop and Spark, often neglecting the well established and widely used relational database management systems (RDBMS's). In this dissertation, we study three fundamental graph problems in RDBMS. The first problem we tackle is computing shortest paths (SP) from a source to a target in large network graphs. We explore SQL based solutions and leverage the intelligent scheduling that a RDBMS performs when executing set-at-a-time expansions of graph vertices, which is in contrast to vertex-at-a-time expansions in classical SP algorithms. Our algorithms perform orders of magnitude faster than baselines and outperform counterparts in native graph databases. Second, we studied the PageRank problem which is vital in Google Search and social network analysis to determine how to sort search results and identify important nodes in a graph. PageRank is an iterative algorithm which imposes challenges when implementing it over large graphs. We study computing PageRank using RDBMS for very large graphs using a consumer-grade machine and compare the results to a dedicated graph database. We show that our RDBMS solution is able to process graphs of more than a billion edges in few minutes, whereas native graph databases fail to handle graphs of much smaller sizes. Last, we present a carefully engineered RDBMS solution to the problem of triangle enumeration for very large graphs. We show that RDBMS's are suitable tools for enumerating billions of triangles in billion-scale networks on a consumer grade machine. Also, we compare our RDBMS solution's performance to a native graph database and show that our RDBMS solution outperforms by orders of magnitude. / Graduate Shortest Path pagerank RDBMS Matrix partitioning Big Data Triangle Enumeration Graph Database PTE Compact Forward Table partitioning Billion Scale Graph
63	The Dao of Wikipedia: Extracting Knowledge from the Structure of Wikilinks Consonni, Cristian 24 October 2019 (has links) Wikipedia is a multilingual encyclopedia written collaboratively by volunteers online, and it is now the largest, most visited encyclopedia in existence. Wikipedia has arisen through the self-organized collaboration of contributors, and since its launch in January 2001, its potential as a research resource has become apparent to scientists, its appeal lies in the fact that it strikes a middle ground between accurate, manually created, limited-coverage resources, and noisy knowledge mined from the web. For this reason, Wikipedia's content has been exploited for a variety of applications: to build knowledge bases, to study interactions between users on the Internet, and to investigate social and cultural issues such as gender bias in history, or the spreading of information. Similarly to what happened for the Web at large, a structure has emerged from the collaborative creation of Wikipedia: its articles contain hundreds of millions of links. In Wikipedia parlance, these internal links are called wikilinks. These connections explain the topics being covered in articles and provide a way to navigate between different subjects, contextualizing the information, and making additional information available. In this thesis, we argue that the information contained in the link structure of Wikipedia can be harnessed to gain useful insights by extracting it with dedicated algorithms. More prosaically, in this thesis, we explore the link structure of Wikipedia with new methods. In the first part, we discuss in depth the characteristics of Wikipedia, and we describe the process and challenges we have faced to extract the network of links. Since Wikipedia is available in several language editions and its entire edition history is publicly available, we have extracted the wikilink network at various points in time, and we have performed data integration to improve its quality. In the second part, we show that the wikilink network can be effectively used to find the most relevant pages related to an article provided by the user. We introduce a novel algorithm, called CycleRank, that takes advantage of the link structure of Wikipedia considering cycles of links, thus giving weight to both incoming and outgoing connections, to produce a ranking of articles with respect to an article chosen by the user. In the last part, we explore applications of CycleRank. First, we describe the Engineroom EU project, where we faced the challenge to find which were the most relevant Wikipedia pages connected to the Wikipedia article about the Internet. Finally, we present another contribution using Wikipedia article accesses to estimate how the information about diseases propagates. In conclusion, with this thesis, we wanted to show that browsing Wikipedia's wikilinks is not only fascinating and serendipitous, but it is an effective way to extract useful information that is latent in the user-generated encyclopedia. Settore INF/01 - Informatica
64	Algorithms for large graphs Das Sarma, Atish 01 July 2010 (has links) No description available. Random walks Distances Distributed computing Graphs PageRank Distributed algorithms Online algorithms Streaming algorithms Algorithms Parallel algorithms Graph algorithms Computer algorithms
65	Optimisation de vecteurs propres de Perron et applications : du référencement de pages web à la chronothérapie Fercoq, Olivier 17 September 2012 (has links) (PDF) Les moteurs de recherche jouent un rôle essentiel sur le Web. Ils rassemblent des informations sur les pages web et pour chaque requête d'un internaute, ils donnent une liste ordonnée de pages pertinentes. Ils utilisent divers algorithmes pour classer les pages en fonction de leur contenu textuel ou de la structure d'hyperlien du Web. Ici, nous nous concentrons sur les algorithmes qui utilisent cette structure d'hyperliens, comme le PageRank, HITS, SALSA et HOTS. La notion fondamentale pour tous ces algorithmes est le graphe du web. C'est le graphe orienté qui a un noeud pour chaque page web et un arc entre les noeuds i et j si il y a un hyperlien entre les pages i et j. Le problème original considéré dans cette thèse est l'optimisation du référencement des pages d'un site web donné. Il consiste à trouver une stratégie optimale de liens qui maximise une fonction scalaire d'un classement donné sous des contraintes de design. Le PageRank, HITS et SALSA classent les pages par un vecteur de Perron, c'est-à-dire qu'ils correspondent au vecteur propre principal d'une matrice à coefficients positifs. Quand on optimise le référencement, on optimise donc une fonction scalaire du vecteur propre de Perron sur un ensemble de matrices positives irréductibles. La matrice est construite à partir du graphe du web, donc commander les hyperliens revient à commander la matrice elle-même. Nous étudions d'abord un problème général d'optimisation du PageRank avec une fonction d'utilité correspondant au revenu total du site et des contraintes de design. Ce cas est d'un intérêt particulier car pour de nombreux sites la valeur du PageRank est corrélée au chiffre d'affaires. Nous avons réduit le problème d'optimisation du PageRank à des problèmes de décision markoviens dont les ensembles d'action sont définis implicitement comme étant les points extrêmes de polytopes qui ont un oracle de séparation polynomial. Nous montrons que de tels problèmes de décision markoviens sont solubles en temps polynomial et nous donnons un algorithme qui passe à l'échelle pour la résolution effective du problème d'optimisation du PageRank sur de grandes bases de données. Ensuite, nous étudions le problème général de l'optimisation d'une fonction scalaire du vecteur propre de Perron sur un ensemble de matrices positives irréductibles. Cela couvre tous les classements par vecteur de Perron, HITS et SALSA compris. Nous montrons que la matrice des dérivées partielles de la fonction objectif a un petit rang et peut être calculée par un algorithme qui a les mêmes propriétés de convergence que la méthode de la puissance utilisée pour calculer le classement. Nous donnons un algorithme d'optimisation qui couple les itérations puissance et gradient et nous prouvons sa convergence vers un point stationnaire du problème d'optimisation. En considérant HOTS comme un vecteur de Perron non linéaire, nous montrons que l'algorithme HOTS converge géométriquement et nous résolvons l'optimisation locale de HOTS. Finalement, nous étendons le domaine d'application des méthodes d'optimisation du vecteur propre et de la valeur propre de Perron à l'optimisation de la chimiothérapie, sous l'hypothèse que les cellules se comportent différemment suivant l'heure de la journée. Nous voulons profiter de cette caractéristique pour minimiser le taux de croissance des cellules cancéreuses tout en maintenant le taux de croissance des cellules saines au dessus d'un seuil de toxicité donné. L'objectif et la contrainte peuvent s'écrire comme les valeurs propres de Floquet de modèles d'EDP structurés en âge avec des coefficients périodiques, qui sont approchés par des valeurs propres de Perron dans le problème discrétisé. Nous cherchons des stratégies d'injection de médicament localement optimales par une méthode des multiplicateurs où les minimisations sans contrainte sont faites en utilisant l'algorithme couplant les itérations puissance et gradient développé dans le cadre de l'optimisation du référencement. optimisation PageRank processus de décision markoviens dynamiques de population problèmes de grande taille
66	Méthodes d'apprentissage semi-supervisé basé sur les graphes et détection rapide des nœuds centraux Sokol, Marina 29 April 2014 (has links) (PDF) Les méthodes d'apprentissage semi-supervisé constituent une catégorie de méthodes d'apprentissage automatique qui combinent points étiquetés et données non labellisées pour construire le classifieur. Dans la première partie de la thèse, nous proposons un formalisme d'optimisation général, commun à l'ensemble des méthodes d'apprentissage semi-supervisé et en particulier aux Laplacien Standard, Laplacien Normalisé et PageRank. En utilisant la théorie des marches aléatoires, nous caractérisons les différences majeures entre méthodes d'apprentissage semi-supervisé et nous définissons des critères opérationnels pour guider le choix des paramètres du noyau ainsi que des points étiquetés. Nous illustrons la portée des résultats théoriques obtenus sur des données synthétiques et réelles, comme par exemple la classification par le contenu et par utilisateurs des systèmes pair-à-pair. Cette application montre de façon édifiante que la famille de méthodes proposée passe parfaitement à l'échelle. Les algorithmes développés dans la deuxième partie de la thèse peuvent être appliquées pour la sélection des données étiquetées, mais également aux autres applications dans la recherche d'information. Plus précisément, nous proposons des algorithmes randomisés pour la détection rapide des nœuds de grands degrés et des nœuds avec de grandes valeurs de PageRank personnalisé. A la fin de la thèse, nous proposons une nouvelle mesure de centralité, qui généralise à la fois la centralité d'intermédiarité et PageRank. Cette nouvelle mesure est particulièrement bien adaptée pour la détection de la vulnérabilité de réseau. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Apprentissage automatique Apprentissage semi-supervisé PageRank Mesures de centralité
67	SEO optimalizace webových stránek / Search Engine Optimization Komárek, Miroslav January 2009 (has links) The goal of this thesis is to analyze methods and techniques applying to Search Engine Optimization (SEO). The thesis combines theoretical knowledge extracted from scientific sources and from author's experiences. Search Engine Optimization methods are divided into two major chapters according to sphere of their activity. On Page Factors -- factors occuring on web pages and Off Page Factors -- factors affecting web pages but occuring out of web pages. There is a case study at the end of this thesis. Acquired knowledge are used there. The case study is based on real-world project and includes detailed results and economical indicators.
68	From Correlation to Causality: Does Network Information improve Cancer Outcome Prediction? Roy, Janine 16 April 2014 (has links) Motivation: Disease progression in cancer can vary substantially between patients. Yet, patients often receive the same treatment. Recently, there has been much work on predicting disease progression and patient outcome variables from gene expression in order to personalize treatment options. A widely used approach is high-throughput experiments that aim to explore predictive signature genes which would provide identification of clinical outcome of diseases. Microarray data analysis helps to reveal underlying biological mechanisms of tumor progression, metastasis, and drug-resistance in cancer studies. Despite first diagnostic kits in the market, there are open problems such as the choice of random gene signatures or noisy expression data. The experimental or computational noise in data and limited tissue samples collected from patients might furthermore reduce the predictive power and biological interpretability of such signature genes. Nevertheless, signature genes predicted by different studies generally represent poor similarity; even for the same type of cancer. Integration of network information with gene expression data could provide more efficient signatures for outcome prediction in cancer studies. One approach to deal with these problems employs gene-gene relationships and ranks genes using the random surfer model of Google's PageRank algorithm. Unfortunately, the majority of published network-based approaches solely tested their methods on a small amount of datasets, questioning the general applicability of network-based methods for outcome prediction. Methods: In this thesis, I provide a comprehensive and systematically evaluation of a network-based outcome prediction approach -- NetRank - a PageRank derivative -- applied on several types of gene expression cancer data and four different types of networks. The algorithm identifies a signature gene set for a specific cancer type by incorporating gene network information with given expression data. To assess the performance of NetRank, I created a benchmark dataset collection comprising 25 cancer outcome prediction datasets from literature and one in-house dataset. Results: NetRank performs significantly better than classical methods such as foldchange or t-test as it improves the prediction performance in average for 7%. Besides, we are approaching the accuracy level of the authors' signatures by applying a relatively unbiased but fully automated process for biomarker discovery. Despite an order of magnitude difference in network size, a regulatory, a protein-protein interaction and two predicted networks perform equally well. Signatures as published by the authors and the signatures generated with classical methods do not overlap -- not even for the same cancer type -- whereas the network-based signatures strongly overlap. I analyze and discuss these overlapping genes in terms of the Hallmarks of cancer and in particular single out six transcription factors and seven proteins and discuss their specific role in cancer progression. Furthermore several tests are conducted for the identification of a Universal Cancer Signature. No Universal Cancer Signature could be identified so far, but a cancer-specific combination of general master regulators with specific cancer genes could be discovered that achieves the best results for all cancer types. As NetRank offers a great value for cancer outcome prediction, first steps for a secure usage of NetRank in a public cloud are described. Conclusion: Experimental evaluation of network-based methods on a gene expression benchmark dataset suggests that these methods are especially suited for outcome prediction as they overcome the problems of random gene signatures and noisy expression data. Through the combination of network information with gene expression data, network-based methods identify highly similar signatures over all cancer types, in contrast to classical methods that fail to identify highly common gene sets across the same cancer types. In general allows the integration of additional information in gene expression analysis the identification of more reliable, accurate and reproducible biomarkers and provides a deeper understanding of processes occurring in cancer development and progression.:1 Definition of Open Problems 2 Introduction 2.1 Problems in cancer outcome prediction 2.2 Network-based cancer outcome prediction 2.3 Universal Cancer Signature 3 Methods 3.1 NetRank algorithm 3.2 Preprocessing and filtering of the microarray data 3.3 Accuracy 3.4 Signature similarity 3.5 Classical approaches 3.6 Random signatures 3.7 Networks 3.8 Direct neighbor method 3.9 Dataset extraction 4 Performance of NetRank 4.1 Benchmark dataset for evaluation 4.2 The influence of NetRank parameters 4.3 Evaluation of NetRank 4.4 General findings 4.5 Computational complexity of NetRank 4.6 Discussion 5 Universal Cancer Signature 5.1 Signature overlap – a sign for Universal Cancer Signature 5.2 NetRank genes are highly connected and confirmed in literature 5.3 Hallmarks of Cancer 5.4 Testing possible Universal Cancer Signatures 5.5 Conclusion 6 Cloud-based Biomarker Discovery 6.1 Introduction to secure Cloud computing 6.2 Cancer outcome prediction 6.3 Security analysis 6.4 Conclusion 7 Contributions and Conclusions info:eu-repo/classification/ddc/004 ddc:004
69	NONLINEAR DIFFUSIONS ON GRAPHS FOR CLUSTERING, SEMI-SUPERVISED LEARNING AND ANALYZING PREDICTIONS Meng Liu (14075697) 09 November 2022 (has links) <p>Graph diffusion is the process of spreading information from one or few nodes to the rest of the graph through edges. The resulting distribution of the information often implies latent structure of the graph where nodes more densely connected can receive more signal. This makes graph diffusions a powerful tool for local clustering, which is the problem of finding a cluster or community of nodes around a given set of seeds. Most existing literatures on using graph diffusions for local graph clustering are linear diffusions as their dynamics can be fully interpreted through linear systems. They are also referred as eigenvector, spectral, or random walk based methods. While efficient, they often have difficulty capturing the correct boundary of a target label or target cluster. On the contrast, maxflow-mincut based methods that can be thought as 1-norm nonlinear variants of the linear diffusions seek to "improve'' or "refine'' a given cluster and can often capture the boundary correctly. However, there is a lack of literature to adopt them for problems such as community detection, local graph clustering, semi-supervised learning, etc. due to the complexity of their formulation. We addressed these issues by performing extensive numerical experiments to demonstrate the performance of flow-based methods in graphs from various sources. We also developed an efficient LocalGraphClustering Python Package that allows others to easily use these methods in their own problems. While studying these flow-based methods, we find that they cannot grow from small seed set. Although there are hybrid procedures that incorporate ideas from both linear diffusions and flow-based methods, they have many hard to set parameters. To tackle these issues, we propose a simple generalization of the objective function behind linear diffusion and flow-based methods which we call generalized local graph min-cut problem. We further show that by involving p-norm in this cut problem, we can develop a nonlinear diffusion procedure that can find local clusters from small seed set and capture the correct boundary simultaneously. Our method can be thought as a nonlinear generalization of the Anderson-Chung-Lang push procedure to approximate a personalized PageRank vector efficiently and is a strongly local algorithm-one whose runtime depends on the size of the output rather than the size of the graph. We also show that the p-norm cut functions improve on the standard Cheeger inequalities for linear diffusion methods. We further extend our generalized local graph min-cut problem and the corresponding diffusion solver to hypergraph-based machine learning problems. Although many methods for local graph clustering exist, there are relatively few for localized clustering in hypergraphs. Moreover, those that exist often lack flexibility to model a general class of hypergraph cut functions or cannot scale to large problems. Our new hypergraph diffusion method on the other hand enables us to compute with a wide variety of cardinality-based hypergraph cut functions and still maintains the strongly local property. We also show that the clusters found by solving the new objective function satisfy a Cheeger-like quality guarantee.</p> <p>Besides clustering, recent work on graph-based learning often focuses on node embeddings and graph neural networks. Although these GNN based methods can beat traditional ones especially when node attributes data is available, it is challenging to understand them because they are highly over-parameterized. To solve this issue, we propose a novel framework that combines topological data analysis and diffusion to transform the complex prediction space into human understandable pictures. The method can be applied to other datasets not in graph formats and scales up to large datasets across different domains and enable us to find many useful insights about the data and the model.</p> Graph, social and multimedia data Topology topological data analysis clustering pagerank semi-supervised learning visualization neural networks diffusions graph social network hypergraph
70	Delving into gene-set multiplex networks facilitated by a k-nearest neighbor-based measure of similarity / k-最近傍法に基づく類似性尺度による、遺伝子セットの多重ネットワーク解析 Zheng, Cheng 25 March 2024 (has links) 京都大学 / 新制・課程博士 / 博士(医学) / 甲第25192号 / 医博第5078号 / 新制\|\|医\|\|1072(附属図書館) / 京都大学大学院医学研究科医学専攻 / (主査)教授村川泰裕, 教授斎藤通紀, 教授李聖林 / 学位規則第4条第1項該当 / Doctor of Agricultural Science / Kyoto University / DFAM gene set multiplex network k-nearest neighbor-based similarity multiplex clustering coefficient multiplex PageRank centrality gene set enrichment analysis gene set co-expression 490

Search results