Global ETD Search

1	Missing Link Discovery In Wikipedia: A Comparative Study Sunercan, Omer 01 February 2010 (has links) (PDF) The fast growing online encyclopedia concept presents original and innovative features by taking advantage of information technologies. The links connecting the articles is one of the most important instances of these features. In this thesis, we present our work on discovering missing links in Wikipedia articles. This task is important for both readers and authors of Wikipedia. Readers will bene&amp / #64257 / t from the increased article quality with better navigation support. On the other hand, the system can be employed to support authors during editing. This study combines the strengths of different approaches previously applied for the task, and proposes its own techniques to reach satisfactory results. Because of the subjectivity in the nature of the task / automatic evaluation is hard to apply. Comparing approaches seems to be the best method to evaluate new techniques, and we offer a semi-automatized method for evaluation of the results. The recall is calculated automatically using existing links in Wikipedia. The precision is calculated according to manual evaluations of human assessors. Comparative results for different techniques are presented, showing the success of our improvements. Our system employs Turkish Wikipedia (Vikipedi) and, according to our knowledge, it is the &amp / #64257 / rst study on it. We aim to exploit the Turkish Wikipedia as a semantic resource to examine whether it is scalable enough for such purposes. T Information Technology 58.5-58.64 Link Analysis, Link Discovery, Wikipedia
2	Automated Discovery of Pedigrees and Their Structures in Collections of STR DNA Specimens Using a Link Discovery Tool Haun, Alex Brian 01 May 2010 (has links) In instances of mass fatality, such as plane crashes, natural disasters, or terrorist attacks, investigators may encounter hundreds or thousands of DNA specimens representing victims. For example, during the January 2010 Haiti earthquake, entire communities were destroyed, resulting in the loss of thousands of lives. With such a large number of victims the discovery of family pedigrees is possible, but often requires the manual application of analytical methods, which are tedious, time-consuming, and expensive. The method presented in this thesis allows for automated pedigree discovery by extending Link Discovery Tool (LDT), a graph visualization tool designed for discovering linkages in large criminal networks. The proposed algorithm takes advantage of spatial clustering of graphs of DNA specimens to discover pedigree structures in large collections of specimens, saving both time and money in the identification process. DNA pedigree graph link discovery specimen likelihood ratio Bioinformatics Other Genetics and Genomics
3	Automated Discovery of Pedigrees and Their Structures in Collections of STR DNA Specimens Using a Link Discovery Tool Haun, Alex Brian 01 May 2010 (has links) In instances of mass fatality, such as plane crashes, natural disasters, or terrorist attacks, investigators may encounter hundreds or thousands of DNA specimens representing victims. For example, during the January 2010 Haiti earthquake, entire communities were destroyed, resulting in the loss of thousands of lives. With such a large number of victims the discovery of family pedigrees is possible, but often requires the manual application of analytical methods, which are tedious, time-consuming, and expensive. The method presented in this thesis allows for automated pedigree discovery by extending Link Discovery Tool (LDT), a graph visualization tool designed for discovering linkages in large criminal networks. The proposed algorithm takes advantage of spatial clustering of graphs of DNA specimens to discover pedigree structures in large collections of specimens, saving both time and money in the identification process. DNA pedigree graph link discovery specimen likelihood ratio Bioinformatics Other Genetics and Genomics
4	Record Linkage for Web Data Hassanzadeh, Oktie 15 August 2013 (has links) Record linkage refers to the task of finding and linking records (in a single database or in a set of data sources) that refer to the same entity. Automating the record linkage process is a challenging problem, and has been the topic of extensive research for many years. However, the changing nature of the linkage process and the growing size of data sources create new challenges for this task. This thesis studies the record linkage problem for Web data sources. Our hypothesis is that a generic and extensible set of linkage algorithms combined within an easy-to-use framework that integrates and allows tailoring and combining of these algorithms can be used to effectively link large collections of Web data from different domains. To this end, we first present a framework for record linkage over relational data, motivated by the fact that many Web data sources are powered by relational database engines. This framework is based on declarative specification of the linkage requirements by the user and allows linking records in many real-world scenarios. We present algorithms for translation of these requirements to queries that can run over a relational data source, potentially using a semantic knowledge base to enhance the accuracy of link discovery. Effective specification of requirements for linking records across multiple data sources requires understanding the schema of each source, identifying attributes that can be used for linkage, and their corresponding attributes in other sources. Schema or attribute matching is often done with the goal of aligning schemas, so attributes are matched if they play semantically related roles in their schemas. In contrast, we seek to find attributes that can be used to link records between data sources, which we refer to as linkage points. In this thesis, we define the notion of linkage points and present the first linkage point discovery algorithms. We then address the novel problem of how to publish Web data in a way that facilitates record linkage. We hypothesize that careful use of existing, curated Web sources (their data and structure) can guide the creation of conceptual models for semi-structured Web data that in turn facilitate record linkage with these curated sources. Our solution is an end-to-end framework for data transformation and publication, which includes novel algorithms for identification of entity types and their relationships out of semi-structured Web data. A highlight of this thesis is showcasing the application of the proposed algorithms and frameworks in real applications and publishing the results as high-quality data sources on the Web. Data Management Databases Record Linkage Entity Resolution Link Discovery Data Integration Data Quality 0984
5	Record Linkage for Web Data Hassanzadeh, Oktie 15 August 2013 (has links) Record linkage refers to the task of finding and linking records (in a single database or in a set of data sources) that refer to the same entity. Automating the record linkage process is a challenging problem, and has been the topic of extensive research for many years. However, the changing nature of the linkage process and the growing size of data sources create new challenges for this task. This thesis studies the record linkage problem for Web data sources. Our hypothesis is that a generic and extensible set of linkage algorithms combined within an easy-to-use framework that integrates and allows tailoring and combining of these algorithms can be used to effectively link large collections of Web data from different domains. To this end, we first present a framework for record linkage over relational data, motivated by the fact that many Web data sources are powered by relational database engines. This framework is based on declarative specification of the linkage requirements by the user and allows linking records in many real-world scenarios. We present algorithms for translation of these requirements to queries that can run over a relational data source, potentially using a semantic knowledge base to enhance the accuracy of link discovery. Effective specification of requirements for linking records across multiple data sources requires understanding the schema of each source, identifying attributes that can be used for linkage, and their corresponding attributes in other sources. Schema or attribute matching is often done with the goal of aligning schemas, so attributes are matched if they play semantically related roles in their schemas. In contrast, we seek to find attributes that can be used to link records between data sources, which we refer to as linkage points. In this thesis, we define the notion of linkage points and present the first linkage point discovery algorithms. We then address the novel problem of how to publish Web data in a way that facilitates record linkage. We hypothesize that careful use of existing, curated Web sources (their data and structure) can guide the creation of conceptual models for semi-structured Web data that in turn facilitate record linkage with these curated sources. Our solution is an end-to-end framework for data transformation and publication, which includes novel algorithms for identification of entity types and their relationships out of semi-structured Web data. A highlight of this thesis is showcasing the application of the proposed algorithms and frameworks in real applications and publishing the results as high-quality data sources on the Web. Data Management Databases Record Linkage Entity Resolution Link Discovery Data Integration Data Quality 0984
6	[en] AN ARCHITECTURE FOR RDF DATA SOURCES RECOMMENDATION / [pt] ARQUITETURA PARA RECOMENDAÇÃO DE FONTES DE DADOS RDF JOSE EDUARDO TALAVERA HERRERA 25 March 2013 (has links) [pt] Dentro do processo de publicação de dados na Web recomenda-se interligar os dados entre diferentes fontes, através de recursos similares que descrevam um domínio em comum. No entanto, com o crescimento do número dos conjuntos de dados publicados na Web de Dados, as tarefas de descoberta e seleção de dados tornam-se cada vez mais complexas. Além disso, a natureza distribuída e interconectada dos dados, fazem com que a sua análise e entendimento sejam muito demorados. Neste sentido, este trabalho visa oferecer uma arquitetura Web para a identificação de fontes de dados em RDF, com o objetivo de prover melhorias nos processos de publicação, interconex ão, e exploração de dados na Linked Open Data. Para tal, nossa abordagem utiliza o modelo de MapReduce sobre o paradigma de computa ção nas nuvens. Assim, podemos efetuar buscas paralelas por palavraschave sobre um índice de dados semânticos existente na Web. Estas buscas permitem identificar fontes candidatas para ligar os dados. Por meio desta abordagem, foi possível integrar diferentes ferramentas da web semântica em um processo de busca para descobrir fontes de dados relevantes, e relacionar tópicos de interesse denidos pelo usuário. Para atingir nosso objetivo foi necessária a indexação e análise de texto para aperfeiçoar a busca de recursos na Linked Open Data. Para mostrar a ecácia de nossa abordagem foi desenvolvido um estudo de caso, utilizando um subconjunto de dados de uma fonte na Linked Open Data, através do seu serviço SPARQL endpoint. Os resultados do nosso trabalho revelam que a geração de estatísticas sobre os dados da fonte é, de fato, um grande diferencial no processo de busca. Estas estatísticas ajudam ao usuário no processo de escolha de indivíduos. Um processo especializado de extração de palavras-chave é aplicado para cada indivíduo com o objetivo de gerar diferentes buscas sobre o índice semântico. Mostramos a escalabilidade de nosso processo de recomendação de fontes RDF através de diferentes amostras de indivíduos. / [en] In the Web publishing process of data it is recommended to link the data from different sources using similar resources that describe a domain in common. However, the growing number of published data sets on the Web have made the data discovery and data selection tasks become increasingly complex. Moreover, the distributed and interconnected nature of the data causes the understanding and analysis to become too prolonged. In this context, this work aims to provide a Web architecture for identifying RDF data sources with the goal of improving the publishing, interconnection, and data exploration processes within the Linked Open Data. Our approach utilizes the MapReduce computing model on top of the cloud computing paradigm. In this manner, we are able to make parallel keyword searches over existing semantic data indexes available on the web. This will allow to identify candidate sources to link the data. Through this approach, it was possible to integrate different semantic web tools and relevant data sources in a search process, and also to relate topics of interest denied by the user. In order to achieve our objectives it was necessary to index and analyze text to improve the search of resources in the Linked Open Data. To show the effectiveness of our approach we developed a case study using a subset of data from a source in the Linked Open Data through its SPARQL endpoint service. The results of our work reveal that the generation and usage of data source s statistics do make a great difference within the search process. These statistics help the user within the choosing individuals process. Furthermore, a specialized keyword extraction process is run for each individual in order to create different search processes using the semantic index. We show the scalability of our RDF recommendation process by sampling several individuals. [pt] RECUPERACAO DE INFORMACAO [en] INFORMATION RETRIEVAL [pt] SIMILARIDADE [en] SIMILARITY [pt] DESCOBERTA DE LINKS [en] LINK DISCOVERY
7	Using Explicit Semantic Analysis to Link in Multi-Lingual Document Collections / Using Explicit Semantic Analysis to Link in Multi-Lingual Document Collections Žilka, Lukáš January 2012 (has links) Udržování prolinkování dokumentů v ryhle rostoucích kolekcích je problematické. To je dále zvětšeno vícejazyčností těchto kolekcí. Navrhujeme použít Explicitní Sémantickou Analýzu k identifikaci relevantních dokumentů a linků napříč jazyky, bez použití strojového překladu. Navrhli jsme a implementovali několik přistupů v prototypu linkovacího systému. Evaluace byla provedena na Čínské, České, Anglické a Španělské Wikipedii. Diskutujeme evaluační metodologii pro linkovací systémy, a hodnotíme souhlasnost mezi odkazy v různých jazykoých verzích Wikipedie. Hodnotíme vlastnosti Explicitní Sémantické Analýzy důležité pro její praktické použití.
8	Active learning of link specifications using decision tree learning Obraczka, Daniel 13 February 2018 (has links) In this work we presented an implementation that uses decision trees to learn highly accurate link specifications. We compared our approach with three state-of-the-art classifiers on nine datasets and showed, that our approach gives comparable results in a reasonable amount of time. It was also shown, that we outperform the state-of-the-art on four datasets by up to 30%, but are still behind slightly on average. The effect of user feedback on the active learning variant was inspected pertaining to the number of iterations needed to deliver good results. It was shown that we can get FScores above 0.8 with most datasets after 14 iterations. info:eu-repo/classification/ddc/000 ddc:000
9	Link Discovery: Algorithms and Applications Ngonga Ngomo, Axel-Cyrille 03 December 2018 (has links) Ziel dieser Arbeit ist die Erarbeitung von effizienten (semi-)automatischen Verfahren zur Verknüpfung von Wissensbasen. Eine Vielzahl von Lösungsklassen können zu diesem Zweck eingesetzt werden. In dieser Arbeit werden ausschließlich deklarative Ansätze erörtert. Deklarative Ansätze gehen davon aus, dass das direkte Errechnen von Mappings zwischen Mengen von Ressourcen in vielen Fällen nur schwer möglich ist oder eines nicht vertretbaren Aufwands bedarf. Diese Ansätze zielen daher darauf ab, eine Ähnlichkeitsfunktion sowie einen Schwellwert zu finden, die zur Approximation eines Mappings genutzt werden können. Zwei Herausforderungen gehen mit dieser Modellierung des Problems einher: (a) Effizienz sowie (b) Genauigkeit und Vollständigkeit. Lösungen zu beiden Herausforderungen sowie auf echten Daten basierende Anwendungen dieser Lösungen werden in der Arbeit vorgestellt. info:eu-repo/classification/ddc/500 ddc:500
10	Dependency discovery for data integration Bauckmann, Jana January 2013 (has links) Data integration aims to combine data of different sources and to provide users with a unified view on these data. This task is as challenging as valuable. In this thesis we propose algorithms for dependency discovery to provide necessary information for data integration. We focus on inclusion dependencies (INDs) in general and a special form named conditional inclusion dependencies (CINDs): (i) INDs enable the discovery of structure in a given schema. (ii) INDs and CINDs support the discovery of cross-references or links between schemas. An IND “A in B” simply states that all values of attribute A are included in the set of values of attribute B. We propose an algorithm that discovers all inclusion dependencies in a relational data source. The challenge of this task is the complexity of testing all attribute pairs and further of comparing all of each attribute pair's values. The complexity of existing approaches depends on the number of attribute pairs, while ours depends only on the number of attributes. Thus, our algorithm enables to profile entirely unknown data sources with large schemas by discovering all INDs. Further, we provide an approach to extract foreign keys from the identified INDs. We extend our IND discovery algorithm to also find three special types of INDs: (i) Composite INDs, such as “AB in CD”, (ii) approximate INDs that allow a certain amount of values of A to be not included in B, and (iii) prefix and suffix INDs that represent special cross-references between schemas. Conditional inclusion dependencies are inclusion dependencies with a limited scope defined by conditions over several attributes. Only the matching part of the instance must adhere the dependency. We generalize the definition of CINDs distinguishing covering and completeness conditions and define quality measures for conditions. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. The challenge for this task is twofold: (i) Which (and how many) attributes should be used for the conditions? (ii) Which attribute values should be chosen for the conditions? Previous approaches rely on pre-selected condition attributes or can only discover conditions applying to quality thresholds of 100%. Our approaches were motivated by two application domains: data integration in the life sciences and link discovery for linked open data. We show the efficiency and the benefits of our approaches for use cases in these domains. / Datenintegration hat das Ziel, Daten aus unterschiedlichen Quellen zu kombinieren und Nutzern eine einheitliche Sicht auf diese Daten zur Verfügung zu stellen. Diese Aufgabe ist gleichermaßen anspruchsvoll wie wertvoll. In dieser Dissertation werden Algorithmen zum Erkennen von Datenabhängigkeiten vorgestellt, die notwendige Informationen zur Datenintegration liefern. Der Schwerpunkt dieser Arbeit liegt auf Inklusionsabhängigkeiten (inclusion dependency, IND) im Allgemeinen und auf der speziellen Form der Bedingten Inklusionsabhängigkeiten (conditional inclusion dependency, CIND): (i) INDs ermöglichen das Finden von Strukturen in einem gegebenen Schema. (ii) INDs und CINDs unterstützen das Finden von Referenzen zwischen Datenquellen. Eine IND „A in B“ besagt, dass alle Werte des Attributs A in der Menge der Werte des Attributs B enthalten sind. Diese Arbeit liefert einen Algorithmus, der alle INDs in einer relationalen Datenquelle erkennt. Die Herausforderung dieser Aufgabe liegt in der Komplexität alle Attributpaare zu testen und dabei alle Werte dieser Attributpaare zu vergleichen. Die Komplexität bestehender Ansätze ist abhängig von der Anzahl der Attributpaare während der hier vorgestellte Ansatz lediglich von der Anzahl der Attribute abhängt. Damit ermöglicht der vorgestellte Algorithmus unbekannte Datenquellen mit großen Schemata zu untersuchen. Darüber hinaus wird der Algorithmus erweitert, um drei spezielle Formen von INDs zu finden, und ein Ansatz vorgestellt, der Fremdschlüssel aus den erkannten INDs filtert. Bedingte Inklusionsabhängigkeiten (CINDs) sind Inklusionsabhängigkeiten deren Geltungsbereich durch Bedingungen über bestimmten Attributen beschränkt ist. Nur der zutreffende Teil der Instanz muss der Inklusionsabhängigkeit genügen. Die Definition für CINDs wird in der vorliegenden Arbeit generalisiert durch die Unterscheidung von überdeckenden und vollständigen Bedingungen. Ferner werden Qualitätsmaße für Bedingungen definiert. Es werden effiziente Algorithmen vorgestellt, die überdeckende und vollständige Bedingungen mit gegebenen Qualitätsmaßen auffinden. Dabei erfolgt die Auswahl der verwendeten Attribute und Attributkombinationen sowie der Attributwerte automatisch. Bestehende Ansätze beruhen auf einer Vorauswahl von Attributen für die Bedingungen oder erkennen nur Bedingungen mit Schwellwerten von 100% für die Qualitätsmaße. Die Ansätze der vorliegenden Arbeit wurden durch zwei Anwendungsbereiche motiviert: Datenintegration in den Life Sciences und das Erkennen von Links in Linked Open Data. Die Effizienz und der Nutzen der vorgestellten Ansätze werden anhand von Anwendungsfällen in diesen Bereichen aufgezeigt. Datenabhängigkeiten-Entdeckung Datenintegration Schema-Entdeckung Link-Entdeckung Inklusionsabhängigkeit dependency discovery data integration schema discovery link discovery inclusion dependency Data processing Computer science

Search results