Global ETD Search

1	BINDING HASH TECHNIQUE FOR XML QUERY OPTIMIZATION BRANT, MICHAEL J. 20 July 2006 (has links) No description available. XML Query Processing XML Query Optimization Semi-structured data XPath
2	Adaptive Semi-structured Information Extraction Arpteg, Anders January 2003 (has links) <p>The number of domains and tasks where information extraction tools can be used needs to be increased. One way to reach this goal is to construct user-driven information extraction systems where novice users are able to adapt them to new domains and tasks. To accomplish this goal, the systems need to become more intelligent and able to learn to extract information without need of expert skills or time-consuming work from the user.</p><p>The type of information extraction system that is in focus for this thesis is semistructural information extraction. The term semi-structural refers to documents that not only contain natural language text but also additional structural information. The typical application is information extraction from World Wide Web hypertext documents. By making effective use of not only the link structure but also the structural information within each such document, user-driven extraction systems with high performance can be built.</p><p>The extraction process contains several steps where different types of techniques are used. Examples of such types of techniques are those that take advantage of structural, pure syntactic, linguistic, and semantic information. The first step that is in focus for this thesis is the navigation step that takes advantage of the structural information. It is only one part of a complete extraction system, but it is an important part. The use of reinforcement learning algorithms for the navigation step can make the adaptation of the system to new tasks and domains more user-driven. The advantage of using reinforcement learning techniques is that the extraction agent can efficiently learn from its own experience without need for intensive user interactions.</p><p>An agent-oriented system was designed to evaluate the approach suggested in this thesis. Initial experiments showed that the training of the navigation step and the approach of the system was promising. However, additional components need to be included in the system before it becomes a fully-fledged user-driven system.</p> / Report code: LiU-Tek-Lic-2002:73. Information extraction Artificial intelligence Semi-structured data Reinforced learning Knowledge management Computer science Datavetenskap
3	TagLine: Information Extraction for Semi-Structured Text Elements In Medical Progress Notes Finch, Dezon K. 01 January 2012 (has links) Text analysis has become an important research activity in the Department of Veterans Affairs (VA). Statistical text mining and natural language processing have been shown to be very effective for extracting useful information from medical documents. However, neither of these techniques is effective at extracting the information stored in semi-structure text elements. A prototype system (TagLine) was developed as a method for extracting information from the semi-structured portions of text using machine learning. Features for the learning machine were suggested by prior work, as well as by examining the text, and selecting those attributes that help distinguish the various classes of text lines. The classes were derived empirically from the text and guided by an ontology developed by the Consortium for Health Informatics Research (CHIR), a nationwide research initiative focused on medical informatics. Decision trees and Levenshtein approximate string matching techniques were tested and compared on 5,055 unseen lines of text. The performance of the decision tree method was found to be superior to the fuzzy string match method on this task. Decision trees achieved an overall accuracy of 98.5 percent, while the string match method only achieved an accuracy of 87 percent. Overall, the results for line classification were very encouraging. The labels applied to the lines were used to evaluate TagLines' performance for identifying the semi-structures text elements, including tables, slots and fillers. Results for slots and fillers were impressive while the results for tables were also acceptable. Information Extraction Machine Learning Natural Language Processing Semi-structured data Computer Sciences Library and Information Science
4	Query Languages for Semi-structured Data Maksimovic, Gordana January 2003 (has links) Semi-structured data is defined as irregular data with structure that may change rapidly or unpredictably. An example of such data can be found inside the World-Wide Web. Since the data is irregular, the user may not know the complete structure of the database. Thus, querying such data becomes a difficult issue. In order to write meaningful queries on semi-structured data, there is a need for a query language that will support the features that are presented by this data. Standard query languages, such as SQL for relational databases and OQL for object databases, are too constraining for querying semi-structured data, because they require data to conform to a fixed schema before any data is stored into the database. This paper introduces Lorel, a query language developed particularly for querying semi-structured data. Furthermore, it investigates if the standardised query languages support any of the criteria presented for semi-structured data. The result is an evaluation of three query languages, SQL, OQL and Lorel against these criteria. Semi-structured data unstructured data data on the Web database management Lorel Computer Sciences Datavetenskap (datalogi)
5	Adaptive Semi-structured Information Extraction Arpteg, Anders January 2003 (has links) The number of domains and tasks where information extraction tools can be used needs to be increased. One way to reach this goal is to construct user-driven information extraction systems where novice users are able to adapt them to new domains and tasks. To accomplish this goal, the systems need to become more intelligent and able to learn to extract information without need of expert skills or time-consuming work from the user. The type of information extraction system that is in focus for this thesis is semistructural information extraction. The term semi-structural refers to documents that not only contain natural language text but also additional structural information. The typical application is information extraction from World Wide Web hypertext documents. By making effective use of not only the link structure but also the structural information within each such document, user-driven extraction systems with high performance can be built. The extraction process contains several steps where different types of techniques are used. Examples of such types of techniques are those that take advantage of structural, pure syntactic, linguistic, and semantic information. The first step that is in focus for this thesis is the navigation step that takes advantage of the structural information. It is only one part of a complete extraction system, but it is an important part. The use of reinforcement learning algorithms for the navigation step can make the adaptation of the system to new tasks and domains more user-driven. The advantage of using reinforcement learning techniques is that the extraction agent can efficiently learn from its own experience without need for intensive user interactions. An agent-oriented system was designed to evaluate the approach suggested in this thesis. Initial experiments showed that the training of the navigation step and the approach of the system was promising. However, additional components need to be included in the system before it becomes a fully-fledged user-driven system. / <p>Report code: LiU-Tek-Lic-2002:73.</p> Information extraction Artificial intelligence Semi-structured data Reinforced learning Knowledge management Computer Sciences Datavetenskap (datalogi)
6	A Framework for Automatic Ontology Generation from Autonomous Web Applications Modica, Giovanni 13 December 2002 (has links) Ontologies capture the structure, relationships, semantics and other essential meta information of an application. This thesis describes a framework to automate application interoperability by using dynamically generated ontologies. We propose a set of techniques to extract ontologies from data accessible on the Web in the form of semi-structured HTML pages. Ontologies retrieved from similar applications are matched together to create a general ontology describing the application domain. Information retrieval and graph matching techniques are used to match and measure the usefulness of the ontologies created. Matching algorithms are combined together to produce global ontologies based on local ontologies inherently present in Web applications. We present a system called OntoBuilder that allows users to drive the ontology creation process using a userriendly and intuitive interface. We also present experiments for a well-known case of study: car-rental applications. We successfully achieve 90% accuracy on ontology extraction and 70% accuracy for ontology matching. HTML/XML semi-structured data information retrieval ontologies web services matching
7	Internet-Scale Information Monitoring: A Continual Query Approach Tang, Wei 08 December 2003 (has links) Information monitoring systems are publish-subscribe systems that continuously track information changes and notify users (or programs acting on behalf of humans) of relevant updates according to specified thresholds. Internet-scale information monitoring presents a number of new challenges. First, automated change detection is harder when sources are autonomous and updates are performed asynchronously. Second, information source heterogeneity makes the problem of modelling and representing changes harder than ever. Third, efficient and scalable mechanisms are needed to handle a large and growing number of users and thousands or even millions of monitoring triggers fired at multiple sources. In this dissertation, we model users' monitoring requests using continual queries (CQs) and present a suite of efficient and scalable solutions to large scale information monitoring over structured or semi-structured data sources. A CQ is a standing query that monitors information sources for interesting events (triggers) and notifies users when new information changes meet specified thresholds. In this dissertation, we first present the system level facilities for building an Internet-scale continual query system, including the design and development of two operational CQ monitoring systems OpenCQ and WebCQ, the engineering issues involved, and our solutions. We then describe a number of research challenges that are specific to large-scale information monitoring and the techniques developed in the context of OpenCQ and WebCQ to address these challenges. Example issues include how to efficiently process large number of continual queries, what mechanisms are effective for building a scalable distributed trigger system that is capable of handling tens of thousands of triggers firing at hundreds of data sources, how to effectively disseminate fresh information to the right users at the right time. We have developed a suite of techniques to optimize the processing of continual queries, including an effective CQ grouping scheme, an auxiliary data structure to support group-based indexing of CQs, and a differential CQ evaluation algorithm (DRA). The third contribution is the design of an experimental evaluation model and testbed to validate the solutions. We have engaged our evaluation using both measurements on real systems (OpenCQ/WebCQ) and simulation-based approach. To our knowledge, the research documented in this dissertation is to date the first one to present a focused study of research and engineering issues in building large-scale information monitoring systems using continual queries. Differential re-evaluation Continual queries Web page monitoring Semi-structured data Information monitoring Web sites Management Information technology Internet programming
8	[en] A MODEL FOR EXPLORATION OF SEMI-STRUCTURED DATASETS / [pt] UM MODELO PARA EXPLORAÇÃO DE DADOS SEMIESTRUTURADOS THIAGO RIBEIRO NUNES 05 February 2018 (has links) [pt] Tarefas de exploração de informação são reconhecidas por possuir características tais como alta complexidade, falta de conhecimento do usuário sobre o domínio da tarefa e incertezas sobre as estratégias de solução. O estado-da-arte em exploração de dados inclui uma variedade de modelos e ferramentas baseadas em diferentes paradigmas de interação, como por exemplo, busca por palavras-chave, busca facetada e orientação-a-conjuntos. Não obstante os muitos avanços das últimas décadas, a falta de uma abordagem formal do processo de exploração, juntamente com a falta de uma adoção mais pragmática do princípio de separação-de-responsabilidades no design dessas ferramentas são a causa de muitas limitações. Dentre as limitações, essa tese aborda a falta de expressividade, caracterizada por restrições na gama de estratégias de solução possíveis, e dificuldades de análise e comparação entre as ferramentas propostas. A partir desta observação, o presente trabalho propõe um modelo formal de ações e processos de exploração, uma nova abordagem para o projeto de ferramentas de exploração e uma ferramenta que generaliza o estado-da-arte em exploração de informação. As avaliações do modelo, realizadas por meio de estudos de caso, análises e comparações o estado-da-arte, corroboram a utilidade da abordagem. / [en] Information exploration processes are usually recognized by their inherent complexity, lack of knowledge and uncertainty, concerning both the domain and the solution strategies. Even though there has been much work on the development of computational systems supporting exploration tasks, such as faceted search and set-oriented interfaces, the lack of a formal understanding of the exploration process and the absence of a proper separation of concerns approach in the design phase is the cause of many expressivity issues and serious limitations. This work proposes a novel design approach of exploration tools based on a formal framework for representing exploration actions and processes. Moreover, we present a new exploration system that generalizes the majority of the state-of-the art exploration tools. The evaluation of the proposed framework is guided by case studies and comparisons with state-of-the-art tools. The results show the relevance of our approach both for the design of new exploration tools with higher expressiveness, and formal assessments and comparisons between different tools. [pt] FRAMEWORK [en] FRAMEWORK [pt] EXPLORACAO [en] EXPLORATION [pt] MODELO FORMAL [en] FORMAL MODEL [pt] DADOS SEMIESTRUTRADOS [en] SEMI-STRUCTURED DATA
9	Towards Efficient Data Analysis and Management of Semi-structured Data Tatikonda, Shirish 08 September 2010 (has links) No description available. Computer Science semi-structured data data mining data management high performance computing databases architecture-conscious techniques trees multicore systems
10	Automated Extraction of Data from Insurance Websites / Automatiserad Datautvinning från Försäkringssidor Hodzic, Amar January 2022 (has links) Websites have become a critical source of information for many organizations in today's digital era. However, extracting and organizing semi-structured data from web pages from multiple websites poses challenges. This is especially true when a high level of automation is desired while maintaining generality. A natural progression in the quest for automation is to extend the methods for web data extraction from only being able to handle a single website to handling multiple ones, usually within the same domain. Although these websites share the same domain, the structure of the data can vary greatly. A key question becomes how generalized such a system can be to encompass a large number of websites while maintaining adequate accuracy. The thesis examined the efficiency of automated web data extraction on multiple Swedish insurance company websites. Previous work showed that good results can be achieved with a known English data set that contains web pages from a number of domains. The state-of-the-art model MarkupLM was chosen and trained with supervised learning using two pre-trained models, a Swedish and an English model, on a labeled training set of car insurance customers' web data using zero-shot learning. The results show that such a model can achieve good accuracy on a domain scale with Swedish as the source language with a relatively small data set by leveraging pre-trained models. / Webbsidor har blivit en kritisk källa av information för många organisationer idag. Men att extrahera och strukturera semistrukturerade data från webbsidor från flertal webbplatser är en utmaning. Speciellt när det är önskvärt med en hög nivå av automatisering i kombination med en generaliserbar lösning. En naturlig utveckling i målat av automation är att utöka metoderna för datautvinning från att endast kunna hantera en specifik webbplats till flertal webbplatser inom samma domän. Men även om dessa webbplatser delar samma domän så kan strukturen på data variera i stor utsträckning. En nyckelfråga blir då hur pass generell en sådan lösning kan vara samtidigt som en adekvat prestanda uppehålls. Detta arbete undersöker prestandan av automatiserad datautvinning från ett flertal svenska försäkringssidor. Tidigare arbete visar på att goda resultat kan uppnås på ett känt engelskt dataset som innehåller webbsidor från ett flertal domän. Den toppmoderna modellen MarkupLM valdes och blev tränad med två olika förtränade modeller, en svensk och en engelsk modell, med märkt data från konsumenters bilförsäkringsdata. Modellen blev utvärderad på data från webbplatser som inte ingick i träningsdatat. Resultaten visar på att en sådan modell kan nå god prestanda på domänskala när innehållsspråket är svenska trots en relativt liten datamängd när förtränade modeller används. Insurance Semi-structured data Web data extraction Deep learning Försäkring Semistrukturerad data Webbdataextraktion Djupinlärning Computer Sciences Datavetenskap (datalogi)

Search results