Global ETD Search

1	Missing Link Discovery In Wikipedia: A Comparative Study Sunercan, Omer 01 February 2010 (has links) (PDF) The fast growing online encyclopedia concept presents original and innovative features by taking advantage of information technologies. The links connecting the articles is one of the most important instances of these features. In this thesis, we present our work on discovering missing links in Wikipedia articles. This task is important for both readers and authors of Wikipedia. Readers will bene&amp / #64257 / t from the increased article quality with better navigation support. On the other hand, the system can be employed to support authors during editing. This study combines the strengths of different approaches previously applied for the task, and proposes its own techniques to reach satisfactory results. Because of the subjectivity in the nature of the task / automatic evaluation is hard to apply. Comparing approaches seems to be the best method to evaluate new techniques, and we offer a semi-automatized method for evaluation of the results. The recall is calculated automatically using existing links in Wikipedia. The precision is calculated according to manual evaluations of human assessors. Comparative results for different techniques are presented, showing the success of our improvements. Our system employs Turkish Wikipedia (Vikipedi) and, according to our knowledge, it is the &amp / #64257 / rst study on it. We aim to exploit the Turkish Wikipedia as a semantic resource to examine whether it is scalable enough for such purposes. T Information Technology 58.5-58.64 Link Analysis, Link Discovery, Wikipedia
2	Automated Discovery of Pedigrees and Their Structures in Collections of STR DNA Specimens Using a Link Discovery Tool Haun, Alex Brian 01 May 2010 (has links) In instances of mass fatality, such as plane crashes, natural disasters, or terrorist attacks, investigators may encounter hundreds or thousands of DNA specimens representing victims. For example, during the January 2010 Haiti earthquake, entire communities were destroyed, resulting in the loss of thousands of lives. With such a large number of victims the discovery of family pedigrees is possible, but often requires the manual application of analytical methods, which are tedious, time-consuming, and expensive. The method presented in this thesis allows for automated pedigree discovery by extending Link Discovery Tool (LDT), a graph visualization tool designed for discovering linkages in large criminal networks. The proposed algorithm takes advantage of spatial clustering of graphs of DNA specimens to discover pedigree structures in large collections of specimens, saving both time and money in the identification process. DNA pedigree graph link discovery specimen likelihood ratio Bioinformatics Other Genetics and Genomics
3	Automated Discovery of Pedigrees and Their Structures in Collections of STR DNA Specimens Using a Link Discovery Tool Haun, Alex Brian 01 May 2010 (has links) In instances of mass fatality, such as plane crashes, natural disasters, or terrorist attacks, investigators may encounter hundreds or thousands of DNA specimens representing victims. For example, during the January 2010 Haiti earthquake, entire communities were destroyed, resulting in the loss of thousands of lives. With such a large number of victims the discovery of family pedigrees is possible, but often requires the manual application of analytical methods, which are tedious, time-consuming, and expensive. The method presented in this thesis allows for automated pedigree discovery by extending Link Discovery Tool (LDT), a graph visualization tool designed for discovering linkages in large criminal networks. The proposed algorithm takes advantage of spatial clustering of graphs of DNA specimens to discover pedigree structures in large collections of specimens, saving both time and money in the identification process. DNA pedigree graph link discovery specimen likelihood ratio Bioinformatics Other Genetics and Genomics
4	Record Linkage for Web Data Hassanzadeh, Oktie 15 August 2013 (has links) Record linkage refers to the task of finding and linking records (in a single database or in a set of data sources) that refer to the same entity. Automating the record linkage process is a challenging problem, and has been the topic of extensive research for many years. However, the changing nature of the linkage process and the growing size of data sources create new challenges for this task. This thesis studies the record linkage problem for Web data sources. Our hypothesis is that a generic and extensible set of linkage algorithms combined within an easy-to-use framework that integrates and allows tailoring and combining of these algorithms can be used to effectively link large collections of Web data from different domains. To this end, we first present a framework for record linkage over relational data, motivated by the fact that many Web data sources are powered by relational database engines. This framework is based on declarative specification of the linkage requirements by the user and allows linking records in many real-world scenarios. We present algorithms for translation of these requirements to queries that can run over a relational data source, potentially using a semantic knowledge base to enhance the accuracy of link discovery. Effective specification of requirements for linking records across multiple data sources requires understanding the schema of each source, identifying attributes that can be used for linkage, and their corresponding attributes in other sources. Schema or attribute matching is often done with the goal of aligning schemas, so attributes are matched if they play semantically related roles in their schemas. In contrast, we seek to find attributes that can be used to link records between data sources, which we refer to as linkage points. In this thesis, we define the notion of linkage points and present the first linkage point discovery algorithms. We then address the novel problem of how to publish Web data in a way that facilitates record linkage. We hypothesize that careful use of existing, curated Web sources (their data and structure) can guide the creation of conceptual models for semi-structured Web data that in turn facilitate record linkage with these curated sources. Our solution is an end-to-end framework for data transformation and publication, which includes novel algorithms for identification of entity types and their relationships out of semi-structured Web data. A highlight of this thesis is showcasing the application of the proposed algorithms and frameworks in real applications and publishing the results as high-quality data sources on the Web. Data Management Databases Record Linkage Entity Resolution Link Discovery Data Integration Data Quality 0984
5	Record Linkage for Web Data Hassanzadeh, Oktie 15 August 2013 (has links) Record linkage refers to the task of finding and linking records (in a single database or in a set of data sources) that refer to the same entity. Automating the record linkage process is a challenging problem, and has been the topic of extensive research for many years. However, the changing nature of the linkage process and the growing size of data sources create new challenges for this task. This thesis studies the record linkage problem for Web data sources. Our hypothesis is that a generic and extensible set of linkage algorithms combined within an easy-to-use framework that integrates and allows tailoring and combining of these algorithms can be used to effectively link large collections of Web data from different domains. To this end, we first present a framework for record linkage over relational data, motivated by the fact that many Web data sources are powered by relational database engines. This framework is based on declarative specification of the linkage requirements by the user and allows linking records in many real-world scenarios. We present algorithms for translation of these requirements to queries that can run over a relational data source, potentially using a semantic knowledge base to enhance the accuracy of link discovery. Effective specification of requirements for linking records across multiple data sources requires understanding the schema of each source, identifying attributes that can be used for linkage, and their corresponding attributes in other sources. Schema or attribute matching is often done with the goal of aligning schemas, so attributes are matched if they play semantically related roles in their schemas. In contrast, we seek to find attributes that can be used to link records between data sources, which we refer to as linkage points. In this thesis, we define the notion of linkage points and present the first linkage point discovery algorithms. We then address the novel problem of how to publish Web data in a way that facilitates record linkage. We hypothesize that careful use of existing, curated Web sources (their data and structure) can guide the creation of conceptual models for semi-structured Web data that in turn facilitate record linkage with these curated sources. Our solution is an end-to-end framework for data transformation and publication, which includes novel algorithms for identification of entity types and their relationships out of semi-structured Web data. A highlight of this thesis is showcasing the application of the proposed algorithms and frameworks in real applications and publishing the results as high-quality data sources on the Web. Data Management Databases Record Linkage Entity Resolution Link Discovery Data Integration Data Quality 0984
6	[en] AN ARCHITECTURE FOR RDF DATA SOURCES RECOMMENDATION / [pt] ARQUITETURA PARA RECOMENDAÇÃO DE FONTES DE DADOS RDF JOSE EDUARDO TALAVERA HERRERA 25 March 2013 (has links) [pt] Dentro do processo de publicação de dados na Web recomenda-se interligar os dados entre diferentes fontes, através de recursos similares que descrevam um domínio em comum. No entanto, com o crescimento do número dos conjuntos de dados publicados na Web de Dados, as tarefas de descoberta e seleção de dados tornam-se cada vez mais complexas. Além disso, a natureza distribuída e interconectada dos dados, fazem com que a sua análise e entendimento sejam muito demorados. Neste sentido, este trabalho visa oferecer uma arquitetura Web para a identificação de fontes de dados em RDF, com o objetivo de prover melhorias nos processos de publicação, interconex ão, e exploração de dados na Linked Open Data. Para tal, nossa abordagem utiliza o modelo de MapReduce sobre o paradigma de computa ção nas nuvens. Assim, podemos efetuar buscas paralelas por palavraschave sobre um índice de dados semânticos existente na Web. Estas buscas permitem identificar fontes candidatas para ligar os dados. Por meio desta abordagem, foi possível integrar diferentes ferramentas da web semântica em um processo de busca para descobrir fontes de dados relevantes, e relacionar tópicos de interesse denidos pelo usuário. Para atingir nosso objetivo foi necessária a indexação e análise de texto para aperfeiçoar a busca de recursos na Linked Open Data. Para mostrar a ecácia de nossa abordagem foi desenvolvido um estudo de caso, utilizando um subconjunto de dados de uma fonte na Linked Open Data, através do seu serviço SPARQL endpoint. Os resultados do nosso trabalho revelam que a geração de estatísticas sobre os dados da fonte é, de fato, um grande diferencial no processo de busca. Estas estatísticas ajudam ao usuário no processo de escolha de indivíduos. Um processo especializado de extração de palavras-chave é aplicado para cada indivíduo com o objetivo de gerar diferentes buscas sobre o índice semântico. Mostramos a escalabilidade de nosso processo de recomendação de fontes RDF através de diferentes amostras de indivíduos. / [en] In the Web publishing process of data it is recommended to link the data from different sources using similar resources that describe a domain in common. However, the growing number of published data sets on the Web have made the data discovery and data selection tasks become increasingly complex. Moreover, the distributed and interconnected nature of the data causes the understanding and analysis to become too prolonged. In this context, this work aims to provide a Web architecture for identifying RDF data sources with the goal of improving the publishing, interconnection, and data exploration processes within the Linked Open Data. Our approach utilizes the MapReduce computing model on top of the cloud computing paradigm. In this manner, we are able to make parallel keyword searches over existing semantic data indexes available on the web. This will allow to identify candidate sources to link the data. Through this approach, it was possible to integrate different semantic web tools and relevant data sources in a search process, and also to relate topics of interest denied by the user. In order to achieve our objectives it was necessary to index and analyze text to improve the search of resources in the Linked Open Data. To show the effectiveness of our approach we developed a case study using a subset of data from a source in the Linked Open Data through its SPARQL endpoint service. The results of our work reveal that the generation and usage of data source s statistics do make a great difference within the search process. These statistics help the user within the choosing individuals process. Furthermore, a specialized keyword extraction process is run for each individual in order to create different search processes using the semantic index. We show the scalability of our RDF recommendation process by sampling several individuals. [pt] RECUPERACAO DE INFORMACAO [en] INFORMATION RETRIEVAL [pt] SIMILARIDADE [en] SIMILARITY [pt] DESCOBERTA DE LINKS [en] LINK DISCOVERY
7	Dependency discovery for data integration Bauckmann, Jana January 2013 (has links) Data integration aims to combine data of different sources and to provide users with a unified view on these data. This task is as challenging as valuable. In this thesis we propose algorithms for dependency discovery to provide necessary information for data integration. We focus on inclusion dependencies (INDs) in general and a special form named conditional inclusion dependencies (CINDs): (i) INDs enable the discovery of structure in a given schema. (ii) INDs and CINDs support the discovery of cross-references or links between schemas. An IND “A in B” simply states that all values of attribute A are included in the set of values of attribute B. We propose an algorithm that discovers all inclusion dependencies in a relational data source. The challenge of this task is the complexity of testing all attribute pairs and further of comparing all of each attribute pair's values. The complexity of existing approaches depends on the number of attribute pairs, while ours depends only on the number of attributes. Thus, our algorithm enables to profile entirely unknown data sources with large schemas by discovering all INDs. Further, we provide an approach to extract foreign keys from the identified INDs. We extend our IND discovery algorithm to also find three special types of INDs: (i) Composite INDs, such as “AB in CD”, (ii) approximate INDs that allow a certain amount of values of A to be not included in B, and (iii) prefix and suffix INDs that represent special cross-references between schemas. Conditional inclusion dependencies are inclusion dependencies with a limited scope defined by conditions over several attributes. Only the matching part of the instance must adhere the dependency. We generalize the definition of CINDs distinguishing covering and completeness conditions and define quality measures for conditions. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. The challenge for this task is twofold: (i) Which (and how many) attributes should be used for the conditions? (ii) Which attribute values should be chosen for the conditions? Previous approaches rely on pre-selected condition attributes or can only discover conditions applying to quality thresholds of 100%. Our approaches were motivated by two application domains: data integration in the life sciences and link discovery for linked open data. We show the efficiency and the benefits of our approaches for use cases in these domains. / Datenintegration hat das Ziel, Daten aus unterschiedlichen Quellen zu kombinieren und Nutzern eine einheitliche Sicht auf diese Daten zur Verfügung zu stellen. Diese Aufgabe ist gleichermaßen anspruchsvoll wie wertvoll. In dieser Dissertation werden Algorithmen zum Erkennen von Datenabhängigkeiten vorgestellt, die notwendige Informationen zur Datenintegration liefern. Der Schwerpunkt dieser Arbeit liegt auf Inklusionsabhängigkeiten (inclusion dependency, IND) im Allgemeinen und auf der speziellen Form der Bedingten Inklusionsabhängigkeiten (conditional inclusion dependency, CIND): (i) INDs ermöglichen das Finden von Strukturen in einem gegebenen Schema. (ii) INDs und CINDs unterstützen das Finden von Referenzen zwischen Datenquellen. Eine IND „A in B“ besagt, dass alle Werte des Attributs A in der Menge der Werte des Attributs B enthalten sind. Diese Arbeit liefert einen Algorithmus, der alle INDs in einer relationalen Datenquelle erkennt. Die Herausforderung dieser Aufgabe liegt in der Komplexität alle Attributpaare zu testen und dabei alle Werte dieser Attributpaare zu vergleichen. Die Komplexität bestehender Ansätze ist abhängig von der Anzahl der Attributpaare während der hier vorgestellte Ansatz lediglich von der Anzahl der Attribute abhängt. Damit ermöglicht der vorgestellte Algorithmus unbekannte Datenquellen mit großen Schemata zu untersuchen. Darüber hinaus wird der Algorithmus erweitert, um drei spezielle Formen von INDs zu finden, und ein Ansatz vorgestellt, der Fremdschlüssel aus den erkannten INDs filtert. Bedingte Inklusionsabhängigkeiten (CINDs) sind Inklusionsabhängigkeiten deren Geltungsbereich durch Bedingungen über bestimmten Attributen beschränkt ist. Nur der zutreffende Teil der Instanz muss der Inklusionsabhängigkeit genügen. Die Definition für CINDs wird in der vorliegenden Arbeit generalisiert durch die Unterscheidung von überdeckenden und vollständigen Bedingungen. Ferner werden Qualitätsmaße für Bedingungen definiert. Es werden effiziente Algorithmen vorgestellt, die überdeckende und vollständige Bedingungen mit gegebenen Qualitätsmaßen auffinden. Dabei erfolgt die Auswahl der verwendeten Attribute und Attributkombinationen sowie der Attributwerte automatisch. Bestehende Ansätze beruhen auf einer Vorauswahl von Attributen für die Bedingungen oder erkennen nur Bedingungen mit Schwellwerten von 100% für die Qualitätsmaße. Die Ansätze der vorliegenden Arbeit wurden durch zwei Anwendungsbereiche motiviert: Datenintegration in den Life Sciences und das Erkennen von Links in Linked Open Data. Die Effizienz und der Nutzen der vorgestellten Ansätze werden anhand von Anwendungsfällen in diesen Bereichen aufgezeigt. Datenabhängigkeiten-Entdeckung Datenintegration Schema-Entdeckung Link-Entdeckung Inklusionsabhängigkeit dependency discovery data integration schema discovery link discovery inclusion dependency Data processing Computer science
8	Automating Geospatial RDF Dataset Integration and Enrichment / Automatische geografische RDF Datensatzintegration und Anreicherung Sherif, Mohamed Ahmed Mohamed 12 December 2016 (has links) (PDF) Over the last years, the Linked Open Data (LOD) has evolved from a mere 12 to more than 10,000 knowledge bases. These knowledge bases come from diverse domains including (but not limited to) publications, life sciences, social networking, government, media, linguistics. Moreover, the LOD cloud also contains a large number of crossdomain knowledge bases such as DBpedia and Yago2. These knowledge bases are commonly managed in a decentralized fashion and contain partly verlapping information. This architectural choice has led to knowledge pertaining to the same domain being published by independent entities in the LOD cloud. For example, information on drugs can be found in Diseasome as well as DBpedia and Drugbank. Furthermore, certain knowledge bases such as DBLP have been published by several bodies, which in turn has lead to duplicated content in the LOD . In addition, large amounts of geo-spatial information have been made available with the growth of heterogeneous Web of Data. The concurrent publication of knowledge bases containing related information promises to become a phenomenon of increasing importance with the growth of the number of independent data providers. Enabling the joint use of the knowledge bases published by these providers for tasks such as federated queries, cross-ontology question answering and data integration is most commonly tackled by creating links between the resources described within these knowledge bases. Within this thesis, we spur the transition from isolated knowledge bases to enriched Linked Data sets where information can be easily integrated and processed. To achieve this goal, we provide concepts, approaches and use cases that facilitate the integration and enrichment of information with other data types that are already present on the Linked Data Web with a focus on geo-spatial data. The first challenge that motivates our work is the lack of measures that use the geographic data for linking geo-spatial knowledge bases. This is partly due to the geo-spatial resources being described by the means of vector geometry. In particular, discrepancies in granularity and error measurements across knowledge bases render the selection of appropriate distance measures for geo-spatial resources difficult. We address this challenge by evaluating existing literature for point set measures that can be used to measure the similarity of vector geometries. Then, we present and evaluate the ten measures that we derived from the literature on samples of three real knowledge bases. The second challenge we address in this thesis is the lack of automatic Link Discovery (LD) approaches capable of dealing with geospatial knowledge bases with missing and erroneous data. To this end, we present Colibri, an unsupervised approach that allows discovering links between knowledge bases while improving the quality of the instance data in these knowledge bases. A Colibri iteration begins by generating links between knowledge bases. Then, the approach makes use of these links to detect resources with probably erroneous or missing information. This erroneous or missing information detected by the approach is finally corrected or added. The third challenge we address is the lack of scalable LD approaches for tackling big geo-spatial knowledge bases. Thus, we present Deterministic Particle-Swarm Optimization (DPSO), a novel load balancing technique for LD on parallel hardware based on particle-swarm optimization. We combine this approach with the Orchid algorithm for geo-spatial linking and evaluate it on real and artificial data sets. The lack of approaches for automatic updating of links of an evolving knowledge base is our fourth challenge. This challenge is addressed in this thesis by the Wombat algorithm. Wombat is a novel approach for the discovery of links between knowledge bases that relies exclusively on positive examples. Wombat is based on generalisation via an upward refinement operator to traverse the space of Link Specifications (LS). We study the theoretical characteristics of Wombat and evaluate it on different benchmark data sets. The last challenge addressed herein is the lack of automatic approaches for geo-spatial knowledge base enrichment. Thus, we propose Deer, a supervised learning approach based on a refinement operator for enriching Resource Description Framework (RDF) data sets. We show how we can use exemplary descriptions of enriched resources to generate accurate enrichment pipelines. We evaluate our approach against manually defined enrichment pipelines and show that our approach can learn accurate pipelines even when provided with a small number of training examples. Each of the proposed approaches is implemented and evaluated against state-of-the-art approaches on real and/or artificial data sets. Moreover, all approaches are peer-reviewed and published in a conference or a journal paper. Throughout this thesis, we detail the ideas, implementation and the evaluation of each of the approaches. Moreover, we discuss each approach and present lessons learned. Finally, we conclude this thesis by presenting a set of possible future extensions and use cases for each of the proposed approaches. Linkentdeckung maschinelles Lernen geographische Daten Lastverteilung Verfeinerungsoperator Linked Data offene Daten RDF Link Discovery Machine Learning Geographic data Load balancing refinement operator Linked Data Open data RDF Big Data ddc:000 Linked Data Open Data Maschinelles Lernen
9	Covering or complete? : Discovering conditional inclusion dependencies Bauckmann, Jana, Abedjan, Ziawasch, Leser, Ulf, Müller, Heiko, Naumann, Felix January 2012 (has links) Data dependencies, or integrity constraints, are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. In the last years conditional dependencies have been introduced to analyze and improve data quality. In short, a conditional dependency is a dependency with a limited scope defined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (CINDs). We generalize the definition of CINDs, distinguishing covering and completeness conditions. We present a new use case for such CINDs showing their value for solving complex data quality tasks. Further, we define quality measures for conditions inspired by precision and recall. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case. / Datenabhängigkeiten (wie zum Beispiel Integritätsbedingungen), werden verwendet, um die Qualität eines Datenbankschemas zu erhöhen, um Anfragen zu optimieren und um Konsistenz in einer Datenbank sicherzustellen. In den letzten Jahren wurden bedingte Abhängigkeiten (conditional dependencies) vorgestellt, die die Qualität von Daten analysieren und verbessern sollen. Eine bedingte Abhängigkeit ist eine Abhängigkeit mit begrenztem Gültigkeitsbereich, der über Bedingungen auf einem oder mehreren Attributen definiert wird. In diesem Bericht betrachten wir bedingte Inklusionsabhängigkeiten (conditional inclusion dependencies; CINDs). Wir generalisieren die Definition von CINDs anhand der Unterscheidung von überdeckenden (covering) und vollständigen (completeness) Bedingungen. Wir stellen einen Anwendungsfall für solche CINDs vor, der den Nutzen von CINDs bei der Lösung komplexer Datenqualitätsprobleme aufzeigt. Darüber hinaus definieren wir Qualitätsmaße für Bedingungen basierend auf Sensitivität und Genauigkeit. Wir stellen effiziente Algorithmen vor, die überdeckende und vollständige Bedingungen innerhalb vorgegebener Schwellwerte finden. Unsere Algorithmen wählen nicht nur die Werte der Bedingungen, sondern finden auch die Bedingungsattribute automatisch. Abschließend zeigen wir, dass unser Ansatz effizient sinnvolle und hilfreiche Ergebnisse für den vorgestellten Anwendungsfall liefert. Datenabhängigkeiten Bedingte Inklusionsabhängigkeiten Erkennen von Meta-Daten Linked Open Data Link-Entdeckung Assoziationsregeln Data Dependency Conditional Inclusion Dependency Metadata Discovery Linked Open Data Link Discovery Association Rule Mining Data processing Computer science
10	Automating Geospatial RDF Dataset Integration and Enrichment Sherif, Mohamed Ahmed Mohamed 12 May 2016 (has links) Over the last years, the Linked Open Data (LOD) has evolved from a mere 12 to more than 10,000 knowledge bases. These knowledge bases come from diverse domains including (but not limited to) publications, life sciences, social networking, government, media, linguistics. Moreover, the LOD cloud also contains a large number of crossdomain knowledge bases such as DBpedia and Yago2. These knowledge bases are commonly managed in a decentralized fashion and contain partly verlapping information. This architectural choice has led to knowledge pertaining to the same domain being published by independent entities in the LOD cloud. For example, information on drugs can be found in Diseasome as well as DBpedia and Drugbank. Furthermore, certain knowledge bases such as DBLP have been published by several bodies, which in turn has lead to duplicated content in the LOD . In addition, large amounts of geo-spatial information have been made available with the growth of heterogeneous Web of Data. The concurrent publication of knowledge bases containing related information promises to become a phenomenon of increasing importance with the growth of the number of independent data providers. Enabling the joint use of the knowledge bases published by these providers for tasks such as federated queries, cross-ontology question answering and data integration is most commonly tackled by creating links between the resources described within these knowledge bases. Within this thesis, we spur the transition from isolated knowledge bases to enriched Linked Data sets where information can be easily integrated and processed. To achieve this goal, we provide concepts, approaches and use cases that facilitate the integration and enrichment of information with other data types that are already present on the Linked Data Web with a focus on geo-spatial data. The first challenge that motivates our work is the lack of measures that use the geographic data for linking geo-spatial knowledge bases. This is partly due to the geo-spatial resources being described by the means of vector geometry. In particular, discrepancies in granularity and error measurements across knowledge bases render the selection of appropriate distance measures for geo-spatial resources difficult. We address this challenge by evaluating existing literature for point set measures that can be used to measure the similarity of vector geometries. Then, we present and evaluate the ten measures that we derived from the literature on samples of three real knowledge bases. The second challenge we address in this thesis is the lack of automatic Link Discovery (LD) approaches capable of dealing with geospatial knowledge bases with missing and erroneous data. To this end, we present Colibri, an unsupervised approach that allows discovering links between knowledge bases while improving the quality of the instance data in these knowledge bases. A Colibri iteration begins by generating links between knowledge bases. Then, the approach makes use of these links to detect resources with probably erroneous or missing information. This erroneous or missing information detected by the approach is finally corrected or added. The third challenge we address is the lack of scalable LD approaches for tackling big geo-spatial knowledge bases. Thus, we present Deterministic Particle-Swarm Optimization (DPSO), a novel load balancing technique for LD on parallel hardware based on particle-swarm optimization. We combine this approach with the Orchid algorithm for geo-spatial linking and evaluate it on real and artificial data sets. The lack of approaches for automatic updating of links of an evolving knowledge base is our fourth challenge. This challenge is addressed in this thesis by the Wombat algorithm. Wombat is a novel approach for the discovery of links between knowledge bases that relies exclusively on positive examples. Wombat is based on generalisation via an upward refinement operator to traverse the space of Link Specifications (LS). We study the theoretical characteristics of Wombat and evaluate it on different benchmark data sets. The last challenge addressed herein is the lack of automatic approaches for geo-spatial knowledge base enrichment. Thus, we propose Deer, a supervised learning approach based on a refinement operator for enriching Resource Description Framework (RDF) data sets. We show how we can use exemplary descriptions of enriched resources to generate accurate enrichment pipelines. We evaluate our approach against manually defined enrichment pipelines and show that our approach can learn accurate pipelines even when provided with a small number of training examples. Each of the proposed approaches is implemented and evaluated against state-of-the-art approaches on real and/or artificial data sets. Moreover, all approaches are peer-reviewed and published in a conference or a journal paper. Throughout this thesis, we detail the ideas, implementation and the evaluation of each of the approaches. Moreover, we discuss each approach and present lessons learned. Finally, we conclude this thesis by presenting a set of possible future extensions and use cases for each of the proposed approaches. info:eu-repo/classification/ddc/000 ddc:000

Search results