Global ETD Search

11	Extraction and integration of Web query interfaces Kabisch, Thomas 20 October 2011 (has links) Diese Arbeit fokussiert auf die Integration von Web Anfrageschnittstellen (Web Formularen). Wir identifizieren mehrere Schritte für den Integrationsprozess: Im ersten Schritt werden unbekannte Anfrageschnittstellen auf ihre Anwendungsdomäne hin analysiert. Im zweiten Schritt werden die Anfrageschnittstellen in ein maschinenlesbares Format transformiert (Extraktion). Im dritten Schritt werden Paare semantisch gleicher Elemente zwischen den verschiedenen zu integrierenden Anfragesschnittstellen identifiziert (Matching). Diese Schritte bilden die Grundlage, um Systeme, die eine integrierte Sicht auf die verschiedenen Datenquellen bieten, aufsetzen zu können. Diese Arbeit beschreibt neuartige Lösungen für alle drei der genannten Schritte. Der erste zentrale Beitrag ist ein Exktraktionsalgorithmus, der eine kleine Zahl von Designregeln dazu benutzt, um Schemabäume abzuleiten. Gegenüber früheren Lösungen, welche in der Regel lediglich eine flache Schemarepräsentation anbieten, ist der Schemabaum semantisch reichhaltiger, da er zusätzlich zu den Elementen auch Strukturinformationen abbildet. Der Extraktionsalgorithmus erreicht eine verbesserte Qualität der Element-Extraktion verglichen mit Vergängermethoden. Der zweite Beitrag der Arbeit ist die Entwicklung einer neuen Matching-Methode. Hierbei ermöglicht die Repräsentation der Schnittstellen als Schemabäume eine Verbesserung vorheriger Methoden, indem auch strukturelle Aspekte in den Matching-Algorithmus einfließen. Zusätzlich wird eine globale Optimierung durchgeführt, welche auf der Theorie der bipartiten Graphen aufbaut. Als dritten Beitrag entwickelt die Arbeit einen Algorithms für eine Klassifikation von Schnittstellen nach Anwendungsdomänen auf Basis der Schemabäume und den abgeleiteten Matches. Zusätzlich wird das System VisQI vorgestellt, welches die entwickelten Algorithmen implementiert und eine komfortable graphische Oberfläche für die Unterstützung des Integrationsprozesses bietet. / This thesis focuses on the integration of Web query interfaces. We model the integration process in several steps: First, unknown interfaces have to be classified with respect to their application domain (classification); only then a domain-wise treatment is possible. Second, interfaces must be transformed into a machine readable format (extraction) to allow their automated analysis. Third, as a pre-requisite to integration across databases, pairs of semantically similar elements among multiple interfaces need to be identified (matching). Only if all these tasks have been solved, systems that provide an integrated view to several data sources can be set up. This thesis presents new algorithms for each of these steps. We developed a novel extraction algorithm that exploits a small set of commonsense design rules to derive a hierarchical schema for query interfaces. In contrast to prior solutions that use mainly flat schema representations, the hierarchical schema better represents the structure of the interfaces, leading to better accuracy of the integration step. Next, we describe a multi-step matching method for query interfaces which builds on the hierarchical schema representation. It uses methods from the theory of bipartite graphs to globally optimize the matching result. As a third contribution, we present a new method for the domain classification problem of unknown interfaces that, for the first time, combines lexical and structural properties of schemas. All our new methods have been evaluated on real-life datasets and perform superior to previous works in their respective fields. Additionally, we present the system VisQI that implements all introduced algorithmic steps and provides a comfortable graphical user interface to support the integration process. Informationsextraktion Informationsintegration Schema Matching Web Formulare Information Extraction Information Integration Schema Matching Web Query Interfaces 004 Informatik 28 Informatik, Datenverarbeitung ST 252 ST 515 ddc:004
12	Scalable and Declarative Information Extraction in a Parallel Data Analytics System Rheinländer, Astrid 06 July 2017 (has links) Informationsextraktions (IE) auf sehr großen Datenmengen erfordert hochkomplexe, skalierbare und anpassungsfähige Systeme. Obwohl zahlreiche IE-Algorithmen existieren, ist die nahtlose und erweiterbare Kombination dieser Werkzeuge in einem skalierbaren System immer noch eine große Herausforderung. In dieser Arbeit wird ein anfragebasiertes IE-System für eine parallelen Datenanalyseplattform vorgestellt, das für konkrete Anwendungsdomänen konfigurierbar ist und für Textsammlungen im Terabyte-Bereich skaliert. Zunächst werden konfigurierbare Operatoren für grundlegende IE- und Web-Analytics-Aufgaben definiert, mit denen komplexe IE-Aufgaben in Form von deklarativen Anfragen ausgedrückt werden können. Alle Operatoren werden hinsichtlich ihrer Eigenschaften charakterisiert um das Potenzial und die Bedeutung der Optimierung nicht-relationaler, benutzerdefinierter Operatoren (UDFs) für Data Flows hervorzuheben. Anschließend wird der Stand der Technik in der Optimierung nicht-relationaler Data Flows untersucht und herausgearbeitet, dass eine umfassende Optimierung von UDFs immer noch eine Herausforderung ist. Darauf aufbauend wird ein erweiterbarer, logischer Optimierer (SOFA) vorgestellt, der die Semantik von UDFs mit in die Optimierung mit einbezieht. SOFA analysiert eine kompakte Menge von Operator-Eigenschaften und kombiniert eine automatisierte Analyse mit manuellen UDF-Annotationen, um die umfassende Optimierung von Data Flows zu ermöglichen. SOFA ist in der Lage, beliebige Data Flows aus unterschiedlichen Anwendungsbereichen logisch zu optimieren, was zu erheblichen Laufzeitverbesserungen im Vergleich mit anderen Techniken führt. Als Viertes wird die Anwendbarkeit des vorgestellten Systems auf Korpora im Terabyte-Bereich untersucht und systematisch die Skalierbarkeit und Robustheit der eingesetzten Methoden und Werkzeuge beurteilt um schließlich die kritischsten Herausforderungen beim Aufbau eines IE-Systems für sehr große Datenmenge zu charakterisieren. / Information extraction (IE) on very large data sets requires highly complex, scalable, and adaptive systems. Although numerous IE algorithms exist, their seamless and extensible combination in a scalable system still is a major challenge. This work presents a query-based IE system for a parallel data analysis platform, which is configurable for specific application domains and scales for terabyte-sized text collections. First, configurable operators are defined for basic IE and Web Analytics tasks, which can be used to express complex IE tasks in the form of declarative queries. All operators are characterized in terms of their properties to highlight the potential and importance of optimizing non-relational, user-defined operators (UDFs) for dataflows. Subsequently, we survey the state of the art in optimizing non-relational dataflows and highlight that a comprehensive optimization of UDFs is still a challenge. Based on this observation, an extensible, logical optimizer (SOFA) is introduced, which incorporates the semantics of UDFs into the optimization process. SOFA analyzes a compact set of operator properties and combines automated analysis with manual UDF annotations to enable a comprehensive optimization of data flows. SOFA is able to logically optimize arbitrary data flows from different application areas, resulting in significant runtime improvements compared to other techniques. Finally, the applicability of the presented system to terabyte-sized corpora is investigated. Hereby, we systematically evaluate scalability and robustness of the employed methods and tools in order to pinpoint the most critical challenges in building an IE system for very large data sets. Informationsextraktion Optimierung Map/Reduce Data Flow Stratosphere Operatorsemantik Information Extraction Optimization Map/Reduce Data Flow Stratosphere Operator Semantics 004 Datenverarbeitung; Informatik ST 530 ddc:004
13	Extraktion und Identifikation von Entitäten in Textdaten im Umfeld der Enterprise Search / Extraction and identification of entities in text data in the field of enterprise search Brauer, Falk January 2010 (has links) Die automatische Informationsextraktion (IE) aus unstrukturierten Texten ermöglicht völlig neue Wege, auf relevante Informationen zuzugreifen und deren Inhalte zu analysieren, die weit über bisherige Verfahren zur Stichwort-basierten Dokumentsuche hinausgehen. Die Entwicklung von Programmen zur Extraktion von maschinenlesbaren Daten aus Texten erfordert jedoch nach wie vor die Entwicklung von domänenspezifischen Extraktionsprogrammen. Insbesondere im Bereich der Enterprise Search (der Informationssuche im Unternehmensumfeld), in dem eine große Menge von heterogenen Dokumenttypen existiert, ist es oft notwendig ad-hoc Programm-module zur Extraktion von geschäftsrelevanten Entitäten zu entwickeln, die mit generischen Modulen in monolithischen IE-Systemen kombiniert werden. Dieser Umstand ist insbesondere kritisch, da potentiell für jeden einzelnen Anwendungsfall ein von Grund auf neues IE-System entwickelt werden muss. Die vorliegende Dissertation untersucht die effiziente Entwicklung und Ausführung von IE-Systemen im Kontext der Enterprise Search und effektive Methoden zur Ausnutzung bekannter strukturierter Daten im Unternehmenskontext für die Extraktion und Identifikation von geschäftsrelevanten Entitäten in Doku-menten. Grundlage der Arbeit ist eine neuartige Plattform zur Komposition von IE-Systemen auf Basis der Beschreibung des Datenflusses zwischen generischen und anwendungsspezifischen IE-Modulen. Die Plattform unterstützt insbesondere die Entwicklung und Wiederverwendung von generischen IE-Modulen und zeichnet sich durch eine höhere Flexibilität und Ausdrucksmächtigkeit im Vergleich zu vorherigen Methoden aus. Ein in der Dissertation entwickeltes Verfahren zur Dokumentverarbeitung interpretiert den Daten-austausch zwischen IE-Modulen als Datenströme und ermöglicht damit eine weitgehende Parallelisierung von einzelnen Modulen. Die autonome Ausführung der Module führt zu einer wesentlichen Beschleu-nigung der Verarbeitung von Einzeldokumenten und verbesserten Antwortzeiten, z. B. für Extraktions-dienste. Bisherige Ansätze untersuchen lediglich die Steigerung des durchschnittlichen Dokumenten-durchsatzes durch verteilte Ausführung von Instanzen eines IE-Systems. Die Informationsextraktion im Kontext der Enterprise Search unterscheidet sich z. B. von der Extraktion aus dem World Wide Web dadurch, dass in der Regel strukturierte Referenzdaten z. B. in Form von Unternehmensdatenbanken oder Terminologien zur Verfügung stehen, die oft auch die Beziehungen von Entitäten beschreiben. Entitäten im Unternehmensumfeld haben weiterhin bestimmte Charakteristiken: Eine Klasse von relevanten Entitäten folgt bestimmten Bildungsvorschriften, die nicht immer bekannt sind, auf die aber mit Hilfe von bekannten Beispielentitäten geschlossen werden kann, so dass unbekannte Entitäten extrahiert werden können. Die Bezeichner der anderen Klasse von Entitäten haben eher umschreibenden Charakter. Die korrespondierenden Umschreibungen in Texten können variieren, wodurch eine Identifikation derartiger Entitäten oft erschwert wird. Zur effizienteren Entwicklung von IE-Systemen wird in der Dissertation ein Verfahren untersucht, das alleine anhand von Beispielentitäten effektive Reguläre Ausdrücke zur Extraktion von unbekannten Entitäten erlernt und damit den manuellen Aufwand in derartigen Anwendungsfällen minimiert. Verschiedene Generalisierungs- und Spezialisierungsheuristiken erkennen Muster auf verschiedenen Abstraktionsebenen und schaffen dadurch einen Ausgleich zwischen Genauigkeit und Vollständigkeit bei der Extraktion. Bekannte Regellernverfahren im Bereich der Informationsextraktion unterstützen die beschriebenen Problemstellungen nicht, sondern benötigen einen (annotierten) Dokumentenkorpus. Eine Methode zur Identifikation von Entitäten, die durch Graph-strukturierte Referenzdaten vordefiniert sind, wird als dritter Schwerpunkt untersucht. Es werden Verfahren konzipiert, welche über einen exakten Zeichenkettenvergleich zwischen Text und Referenzdatensatz hinausgehen und Teilübereinstimmungen und Beziehungen zwischen Entitäten zur Identifikation und Disambiguierung heranziehen. Das in der Arbeit vorgestellte Verfahren ist bisherigen Ansätzen hinsichtlich der Genauigkeit und Vollständigkeit bei der Identifikation überlegen. / The automatic information extraction (IE) from unstructured texts enables new ways to access relevant information and analyze text contents, which goes beyond existing technologies for keyword-based search in document collections. However, the development of systems for extracting machine-readable data from text still requires the implementation of domain-specific extraction programs. In particular in the field of enterprise search (the retrieval of information in the enterprise settings), in which a large amount of heterogeneous document types exists, it is often necessary to develop ad-hoc program-modules and to combine them with generic program components to extract by business relevant entities. This is particularly critical, as potentially for each individual application a new IE system must be developed from scratch. In this work we examine efficient methods to develop and execute IE systems in the context of enterprise search and effective algorithms to exploit pre-existing structured data in the business context for the extraction and identification of business entities in documents. The basis of this work is a novel platform for composition of IE systems through the description of the data flow between generic and application-specific IE modules. The platform supports in particular the development and reuse of generic IE modules and is characterized by a higher flexibility as compared to previous methods. A technique developed in this work interprets the document processing as data stream between IE modules and thus enables an extensive parallelization of individual modules. The autonomous execution of each module allows for a significant runtime improvement for individual documents and thus improves response times, e.g. for extraction services. Previous parallelization approaches focused only on an improved throughput for large document collections, e.g., by leveraging distributed instances of an IE system. Information extraction in the context of enterprise search differs for instance from the extraction from the World Wide Web by the fact that usually a variety of structured reference data (corporate databases or terminologies) is available, which often describes the relationships among entities. Furthermore, entity names in a business environment usually follow special characteristics: On the one hand relevant entities such as product identifiers follow certain patterns that are not always known beforehand, but can be inferred using known sample entities, so that unknown entities can be extracted. On the other hand many designators have a more descriptive character (concatenation of descriptive words). The respective references in texts might differ due to the diversity of potential descriptions, often making the identification of such entities difficult. To address IE applications in the presence of available structured data, we study in this work the inference of effective regular expressions from given sample entities. Various generalization and specialization heuristics are used to identify patterns at different syntactic abstraction levels and thus generate regular expressions which promise both high recall and precision. Compared to previous rule learning techniques in the field of information extraction, our technique does not require any annotated document corpus. A method for the identification of entities that are predefined by graph structured reference data is examined as a third contribution. An algorithm is presented which goes beyond an exact string comparison between text and reference data set. It allows for an effective identification and disambiguation of potentially discovered entities by exploitation of approximate matching strategies. The method leverages further relationships among entities for identification and disambiguation. The method presented in this work is superior to previous approaches with regard to precision and recall. Informationsextraktion Enterprise Search Parallele Datenverarbeitung Grammatikalische Inferenz Graph-basiertes Ranking information extraction enterprise search multi core data processing grammar inference graph-based ranking Data processing Computer science
14	Serviceorientiertes Text Mining am Beispiel von Entitätsextrahierenden Diensten Pfeifer, Katja 08 September 2014 (has links) (PDF) Der Großteil des geschäftsrelevanten Wissens liegt heute als unstrukturierte Information in Form von Textdaten auf Internetseiten, in Office-Dokumenten oder Foreneinträgen vor. Zur Extraktion und Verwertung dieser unstrukturierten Informationen wurde eine Vielzahl von Text-Mining-Lösungen entwickelt. Viele dieser Systeme wurden in der jüngeren Vergangenheit als Webdienste zugänglich gemacht, um die Verwertung und Integration zu vereinfachen. Die Kombination verschiedener solcher Text-Mining-Dienste zur Lösung konkreter Extraktionsaufgaben erscheint vielversprechend, da so bestehende Stärken ausgenutzt, Schwächen der Systeme minimiert werden können und die Nutzung von Text-Mining-Lösungen vereinfacht werden kann. Die vorliegende Arbeit adressiert die flexible Kombination von Text-Mining-Diensten in einem serviceorientierten System und erweitert den Stand der Technik um gezielte Methoden zur Auswahl der Text-Mining-Dienste, zur Aggregation der Ergebnisse und zur Abbildung der eingesetzten Klassifikationsschemata. Zunächst wird die derzeit existierende Dienstlandschaft analysiert und aufbauend darauf eine Ontologie zur funktionalen Beschreibung der Dienste bereitgestellt, so dass die funktionsgesteuerte Auswahl und Kombination der Text-Mining-Dienste ermöglicht wird. Des Weiteren werden am Beispiel entitätsextrahierender Dienste Algorithmen zur qualitätssteigernden Kombination von Extraktionsergebnissen erarbeitet und umfangreich evaluiert. Die Arbeit wird durch zusätzliche Abbildungs- und Integrationsprozesse ergänzt, die eine Anwendbarkeit auch in heterogenen Dienstlandschaften, bei denen unterschiedliche Klassifikationsschemata zum Einsatz kommen, gewährleisten. Zudem werden Möglichkeiten der Übertragbarkeit auf andere Text-Mining-Methoden erörtert. Textmining Informationsextraktion Dienste Entitätsextraktion Entitätserkennung Schemaabbildung Dienstkombination text mining information extraction services NER named entity extraction schema matching ddc:004 rvk:ST 302 rvk:ST 515
15	Federated Product Information Search and Semantic Product Comparisons on the Web / Föderierte Produktinformationssuche und semantischer Produktvergleich im Web Walther, Maximilian Thilo 20 September 2011 (has links) (PDF) Product information search has become one of the most important application areas of the Web. Especially considering pricey technical products, consumers tend to carry out intensive research activities previous to the actual acquisition for creating an all-embracing view on the product of interest. Federated search backed by ontology-based product information representation shows great promise for easing this research process. The topic of this thesis is to develop a comprehensive technique for locating, extracting, and integrating information of arbitrary technical products in a widely unsupervised manner. The resulting homogeneous information sets allow a potential consumer to effectively compare technical products based on an appropriate federated product information system. / Die Produktinformationssuche hat sich zu einem der bedeutendsten Themen im Web entwickelt. Speziell im Bereich kostenintensiver technischer Produkte führen potenzielle Konsumenten vor dem eigentlichen Kauf des Produkts langwierige Recherchen durch um einen umfassenden Überblick für das Produkt von Interesse zu erlangen. Die föderierte Suche in Kombination mit ontologiebasierter Produktinformationsrepräsentation stellt eine mögliche Lösung dieser Problemstellung dar. Diese Dissertation stellt Techniken vor, die das automatische Lokalisieren, Extrahieren und Integrieren von Informationen für beliebige technische Produkte ermöglichen. Die resultierenden homogenen Produktinformationen erlauben einem potenziellen Konsumenten, zugehörige Produkte effektiv über ein föderiertes Produktinformationssystem zu vergleichen. Föderierte Suche Informationsextraktion Ontology Matching Facettierte Suche Produktinformationen Federated Search Information Extraction Ontology Matching Facetted Search Product Information ddc:004 rvk:QP 624 rvk:ST 610 Ontologie
16	Serviceorientiertes Text Mining am Beispiel von Entitätsextrahierenden Diensten Pfeifer, Katja 16 June 2014 (has links) Der Großteil des geschäftsrelevanten Wissens liegt heute als unstrukturierte Information in Form von Textdaten auf Internetseiten, in Office-Dokumenten oder Foreneinträgen vor. Zur Extraktion und Verwertung dieser unstrukturierten Informationen wurde eine Vielzahl von Text-Mining-Lösungen entwickelt. Viele dieser Systeme wurden in der jüngeren Vergangenheit als Webdienste zugänglich gemacht, um die Verwertung und Integration zu vereinfachen. Die Kombination verschiedener solcher Text-Mining-Dienste zur Lösung konkreter Extraktionsaufgaben erscheint vielversprechend, da so bestehende Stärken ausgenutzt, Schwächen der Systeme minimiert werden können und die Nutzung von Text-Mining-Lösungen vereinfacht werden kann. Die vorliegende Arbeit adressiert die flexible Kombination von Text-Mining-Diensten in einem serviceorientierten System und erweitert den Stand der Technik um gezielte Methoden zur Auswahl der Text-Mining-Dienste, zur Aggregation der Ergebnisse und zur Abbildung der eingesetzten Klassifikationsschemata. Zunächst wird die derzeit existierende Dienstlandschaft analysiert und aufbauend darauf eine Ontologie zur funktionalen Beschreibung der Dienste bereitgestellt, so dass die funktionsgesteuerte Auswahl und Kombination der Text-Mining-Dienste ermöglicht wird. Des Weiteren werden am Beispiel entitätsextrahierender Dienste Algorithmen zur qualitätssteigernden Kombination von Extraktionsergebnissen erarbeitet und umfangreich evaluiert. Die Arbeit wird durch zusätzliche Abbildungs- und Integrationsprozesse ergänzt, die eine Anwendbarkeit auch in heterogenen Dienstlandschaften, bei denen unterschiedliche Klassifikationsschemata zum Einsatz kommen, gewährleisten. Zudem werden Möglichkeiten der Übertragbarkeit auf andere Text-Mining-Methoden erörtert. info:eu-repo/classification/ddc/004 ddc:004
17	Federated Product Information Search and Semantic Product Comparisons on the Web Walther, Maximilian Thilo 09 September 2011 (has links) Product information search has become one of the most important application areas of the Web. Especially considering pricey technical products, consumers tend to carry out intensive research activities previous to the actual acquisition for creating an all-embracing view on the product of interest. Federated search backed by ontology-based product information representation shows great promise for easing this research process. The topic of this thesis is to develop a comprehensive technique for locating, extracting, and integrating information of arbitrary technical products in a widely unsupervised manner. The resulting homogeneous information sets allow a potential consumer to effectively compare technical products based on an appropriate federated product information system.:1. Introduction 1.1. Online Product Information Research 1.1.1. Current Online Product Information Research 1.1.2. Aspired Online Product Information Research 1.2. Federated Shopping Portals 1.3. Research Questions 1.4. Approach and Theses 1.4.1. Approach 1.4.2. Theses 1.4.3. Requirements 1.5. Goals and Non-Goals 1.5.1. Goals 1.5.2. Non-Goals 1.6. Contributions 1.7. Structure 2. Federated Information Systems 2.1. Information Access 2.1.1. Document Retrieval 2.1.2. Federated Search 2.1.3. Federated Ranking 2.2. Information Extraction 2.2.1. Information Extraction from Structured Sources 2.2.2. Information Extraction from Unstructured Sources 2.2.3. Information Extraction from Semi-structured Sources 2.3. Information Integration 2.3.1. Ontologies 2.3.2. Ontology Matching 2.4. Information Presentation 2.5. Product Information 2.5.1. Product Information Source Characteristics 2.5.2. Product Information Source Types 2.5.3. Product Information Integration Types 2.5.4. Product Information Types 2.6. Conclusions 3. A Federated Product Information System 3.1. Finding Basic Product Information 3.2. Enriching Product Information 3.3. Administrating Product Information 3.4. Displaying Product Information 3.5. Conclusions 4. Product Information Extraction from the Web 4.1. Vendor Product Information Search 4.1.1. Vendor Product Information Ranking 4.1.2. Vendor Product Information Extraction 4.2. Producer Product Information Search 4.2.1. Producer Product Document Retrieval 4.2.2. Producer Product Information Extraction 4.3. Third-Party Product Information Search 4.4. Conclusions 5. Product Information Integration for the Web 5.1. Product Representation 5.1.1. Domain Product Ontology 5.1.2. Application Product Ontology 5.1.3. Product Ontology Management 5.2. Product Categorization 5.3. Product Specifications Matching 5.3.1. General Procedure 5.3.2. Elementary Matchers 5.3.3. Evolutionary Matcher 5.3.4. Naïve Bayes Matcher 5.3.5. Result Selection 5.4. Product Specifications Normalization 5.4.1. Product Specifications Atomization 5.4.2. Product Specifications Value Normalization 5.5. Product Comparison 5.6. Conclusions 6. Evaluation 6.1. Implementation 6.1.1. Offers Service 6.1.2. Products Service 6.1.3. Snippets Service 6.1.4. Fedseeko 6.1.5. Fedseeko Browser Plugin 6.1.6. Fedseeko Mobile 6.1.7. Lessons Learned 6.2. Evaluation 6.2.1. Evaluation Measures 6.2.2. Gold Standard 6.2.3. Product Document Retrieval 6.2.4. Product Specifications Extraction 6.2.5. Product Specifications Matching 6.2.6. Comparison with Competitors 6.3. Conclusions 7. Conclusions and Future Work 7.1. Summary 7.2. Conclusions 7.3. Future Work A. Pseudo Code and Extraction Properties A.1. Pseudo Code A.2. Extraction Algorithm Properties A.2.1. Clustering Properties A.2.2. Purging Properties A.2.3. Dropping Properties B. Fedseeko Screenshots B.1. Offer Search B.2. Product Comparison / Die Produktinformationssuche hat sich zu einem der bedeutendsten Themen im Web entwickelt. Speziell im Bereich kostenintensiver technischer Produkte führen potenzielle Konsumenten vor dem eigentlichen Kauf des Produkts langwierige Recherchen durch um einen umfassenden Überblick für das Produkt von Interesse zu erlangen. Die föderierte Suche in Kombination mit ontologiebasierter Produktinformationsrepräsentation stellt eine mögliche Lösung dieser Problemstellung dar. Diese Dissertation stellt Techniken vor, die das automatische Lokalisieren, Extrahieren und Integrieren von Informationen für beliebige technische Produkte ermöglichen. Die resultierenden homogenen Produktinformationen erlauben einem potenziellen Konsumenten, zugehörige Produkte effektiv über ein föderiertes Produktinformationssystem zu vergleichen.:1. Introduction 1.1. Online Product Information Research 1.1.1. Current Online Product Information Research 1.1.2. Aspired Online Product Information Research 1.2. Federated Shopping Portals 1.3. Research Questions 1.4. Approach and Theses 1.4.1. Approach 1.4.2. Theses 1.4.3. Requirements 1.5. Goals and Non-Goals 1.5.1. Goals 1.5.2. Non-Goals 1.6. Contributions 1.7. Structure 2. Federated Information Systems 2.1. Information Access 2.1.1. Document Retrieval 2.1.2. Federated Search 2.1.3. Federated Ranking 2.2. Information Extraction 2.2.1. Information Extraction from Structured Sources 2.2.2. Information Extraction from Unstructured Sources 2.2.3. Information Extraction from Semi-structured Sources 2.3. Information Integration 2.3.1. Ontologies 2.3.2. Ontology Matching 2.4. Information Presentation 2.5. Product Information 2.5.1. Product Information Source Characteristics 2.5.2. Product Information Source Types 2.5.3. Product Information Integration Types 2.5.4. Product Information Types 2.6. Conclusions 3. A Federated Product Information System 3.1. Finding Basic Product Information 3.2. Enriching Product Information 3.3. Administrating Product Information 3.4. Displaying Product Information 3.5. Conclusions 4. Product Information Extraction from the Web 4.1. Vendor Product Information Search 4.1.1. Vendor Product Information Ranking 4.1.2. Vendor Product Information Extraction 4.2. Producer Product Information Search 4.2.1. Producer Product Document Retrieval 4.2.2. Producer Product Information Extraction 4.3. Third-Party Product Information Search 4.4. Conclusions 5. Product Information Integration for the Web 5.1. Product Representation 5.1.1. Domain Product Ontology 5.1.2. Application Product Ontology 5.1.3. Product Ontology Management 5.2. Product Categorization 5.3. Product Specifications Matching 5.3.1. General Procedure 5.3.2. Elementary Matchers 5.3.3. Evolutionary Matcher 5.3.4. Naïve Bayes Matcher 5.3.5. Result Selection 5.4. Product Specifications Normalization 5.4.1. Product Specifications Atomization 5.4.2. Product Specifications Value Normalization 5.5. Product Comparison 5.6. Conclusions 6. Evaluation 6.1. Implementation 6.1.1. Offers Service 6.1.2. Products Service 6.1.3. Snippets Service 6.1.4. Fedseeko 6.1.5. Fedseeko Browser Plugin 6.1.6. Fedseeko Mobile 6.1.7. Lessons Learned 6.2. Evaluation 6.2.1. Evaluation Measures 6.2.2. Gold Standard 6.2.3. Product Document Retrieval 6.2.4. Product Specifications Extraction 6.2.5. Product Specifications Matching 6.2.6. Comparison with Competitors 6.3. Conclusions 7. Conclusions and Future Work 7.1. Summary 7.2. Conclusions 7.3. Future Work A. Pseudo Code and Extraction Properties A.1. Pseudo Code A.2. Extraction Algorithm Properties A.2.1. Clustering Properties A.2.2. Purging Properties A.2.3. Dropping Properties B. Fedseeko Screenshots B.1. Offer Search B.2. Product Comparison info:eu-repo/classification/ddc/004 ddc:004 Ontologie
18	Column-specific Context Extraction for Web Tables Braunschweig, Katrin, Thiele, Maik, Eberius, Julian, Lehner, Wolfgang 14 June 2022 (has links) Relational Web tables have become an important resource for applications such as factual search and entity augmentation. A major challenge for an automatic identification of relevant tables on the Web is the fact that many of these tables have missing or non-informative column labels. Research has focused largely on recovering the meaning of columns by inferring class labels from the instances using external knowledge bases. The table context, which often contains additional information on the table's content, is frequently considered as an indicator for the general content of a table, but not as a source for column-specific details. In this paper, we propose a novel approach to identify and extract column-specific information from the context of Web tables. In our extraction framework, we consider different techniques to extract directly as well as indirectly related phrases. We perform a number of experiments on Web tables extracted from Wikipedia. The results show that column-specific information extracted using our simple heuristic significantly boost precision and recall for table and column search. info:eu-repo/classification/ddc/004 ddc:004
19	Unsupervised Natural Language Processing for Knowledge Extraction from Domain-specific Textual Resources Hänig, Christian 17 April 2013 (has links) This thesis aims to develop a Relation Extraction algorithm to extract knowledge out of automotive data. While most approaches to Relation Extraction are only evaluated on newspaper data dealing with general relations from the business world their applicability to other data sets is not well studied. Part I of this thesis deals with theoretical foundations of Information Extraction algorithms. Text mining cannot be seen as the simple application of data mining methods to textual data. Instead, sophisticated methods have to be employed to accurately extract knowledge from text which then can be mined using statistical methods from the field of data mining. Information Extraction itself can be divided into two subtasks: Entity Detection and Relation Extraction. The detection of entities is very domain-dependent due to terminology, abbreviations and general language use within the given domain. Thus, this task has to be solved for each domain employing thesauri or another type of lexicon. Supervised approaches to Named Entity Recognition will not achieve reasonable results unless they have been trained for the given type of data. The task of Relation Extraction can be basically approached by pattern-based and kernel-based algorithms. The latter achieve state-of-the-art results on newspaper data and point out the importance of linguistic features. In order to analyze relations contained in textual data, syntactic features like part-of-speech tags and syntactic parses are essential. Chapter 4 presents machine learning approaches and linguistic foundations being essential for syntactic annotation of textual data and Relation Extraction. Chapter 6 analyzes the performance of state-of-the-art algorithms of POS tagging, syntactic parsing and Relation Extraction on automotive data. The findings are: supervised methods trained on newspaper corpora do not achieve accurate results when being applied on automotive data. This is grounded in various reasons. Besides low-quality text, the nature of automotive relations states the main challenge. Automotive relation types of interest (e. g. component – symptom) are rather arbitrary compared to well-studied relation types like is-a or is-head-of. In order to achieve acceptable results, algorithms have to be trained directly on this kind of data. As the manual annotation of data for each language and data type is too costly and inflexible, unsupervised methods are the ones to rely on. Part II deals with the development of dedicated algorithms for all three essential tasks. Unsupervised POS tagging (Chapter 7) is a well-studied task and algorithms achieving accurate tagging exist. All of them do not disambiguate high frequency words, only out-of-lexicon words are disambiguated. Most high frequency words bear syntactic information and thus, it is very important to differentiate between their different functions. Especially domain languages contain ambiguous and high frequent words bearing semantic information (e. g. pump). In order to improve POS tagging, an algorithm for disambiguation is developed and used to enhance an existing state-of-the-art tagger. This approach is based on context clustering which is used to detect a word type’s different syntactic functions. Evaluation shows that tagging accuracy is raised significantly. An approach to unsupervised syntactic parsing (Chapter 8) is developed in order to suffice the requirements of Relation Extraction. These requirements include high precision results on nominal and prepositional phrases as they contain the entities being relevant for Relation Extraction. Furthermore, accurate shallow parsing is more desirable than deep binary parsing as it facilitates Relation Extraction more than deep parsing. Endocentric and exocentric constructions can be distinguished and improve proper phrase labeling. unsuParse is based on preferred positions of word types within phrases to detect phrase candidates. Iterating the detection of simple phrases successively induces deeper structures. The proposed algorithm fulfills all demanded criteria and achieves competitive results on standard evaluation setups. Syntactic Relation Extraction (Chapter 9) is an approach exploiting syntactic statistics and text characteristics to extract relations between previously annotated entities. The approach is based on entity distributions given in a corpus and thus, provides a possibility to extend text mining processes to new data in an unsupervised manner. Evaluation on two different languages and two different text types of the automotive domain shows that it achieves accurate results on repair order data. Results are less accurate on internet data, but the task of sentiment analysis and extraction of the opinion target can be mastered. Thus, the incorporation of internet data is possible and important as it provides useful insight into the customer\''s thoughts. To conclude, this thesis presents a complete unsupervised workflow for Relation Extraction – except for the highly domain-dependent Entity Detection task – improving performance of each of the involved subtasks compared to state-of-the-art approaches. Furthermore, this work applies Natural Language Processing methods and Relation Extraction approaches to real world data unveiling challenges that do not occur in high quality newspaper corpora. info:eu-repo/classification/ddc/500 ddc:500
20	WebKnox: Web Knowledge Extraction Urbansky, David 26 January 2009 (has links) This thesis focuses on entity and fact extraction from the web. Different knowledge representations and techniques for information extraction are discussed before the design for a knowledge extraction system, called WebKnox, is introduced. The main contribution of this thesis is the trust ranking of extracted facts with a self-supervised learning loop and the extraction system with its composition of known and refined extraction algorithms. The used techniques show an improvement in precision and recall in most of the matters for entity and fact extractions compared to the chosen baseline approaches. info:eu-repo/classification/ddc/004 ddc:004

Search results