Global ETD Search

41	A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System Wessman, Alan E. 26 January 2005 (has links) (PDF) Extraction of information from semi-structured or unstructured documents, such as Web pages, is a useful yet complex task. Research has demonstrated that ontologies may be used to achieve a high degree of accuracy in data extraction while maintaining resiliency in the face of document changes. Ontologies do not, however, diminish the complexity of a data-extraction system. As research in the field progresses, the need for a modular data-extraction system that de-couples the various functional processes involved continues to grow. In this thesis we propose a framework for such a system. The nature of the framework allows new algorithms and ideas to be incorporated into a data extraction system without requiring wholesale rewrites of a large part of the system’s source code. It also allows researchers to focus their attention on parts of the system relevant to their research without having to worry about introducing incompatibilities with the remaining components. We demonstrate the value of the framework by providing a implementation of it, and we show that our implementation is capable of achieving accuracy in its extraction results comparable to that achieved by the legacy BYU-Ontos data-extraction system. We also suggest alternate ways in which the framework may be extended and implemented, and we supply documentation on the framework for future use by data-extraction researchers. data extraction ontology framework extraction plan inference conceptual modeling data frame information extraction OSMX OSM Ontos OntosEngine OntologyEditor Computer Sciences
42	Automatisk dataextrahering och kategorisering av kvitton / Automatic data extraction and categorisation of receipts Larsson, Christoffer, Wångenberg Olsson, Adam January 2019 (has links) Anställda på företag gör ibland utlägg på köp åt företaget som de behöver dokumentera manuellt. För att underlätta dokumentation av utlägg hos anställda på företaget Consid AB har detta arbete haft i syfte att utveckla en tjänst som från en bild på ett kvitto kan extrahera relevant data såsom pris, datum, företagsnamn samt kategorisera kvittot. Resultatet som arbetet har medfört är en tjänst som kan extrahera text från kvitton med en säkerhet på i snitt 73 % på att texten är rätt. Efter tester kan det även fastställas att tjänsten kan hitta pris, datum och företagsnamn från ca. 64 % av testade kvitton med olika kvalité och innehåll. Tjänsten som byggdes har även implementerat två olika kategoriseringsmetoder där hälften av de testade kvittona kan kategoriseras av de båda metoderna. Efter analyser av metoder och resultat har slutsatser kunnat dragits i att tjänsten innehåller ett flertal brister samt att mer tid bör läggas för att optimera och testa tjänsten ytterligare. / Employees at companies sometimes make purchases on behalf of the company which they manually need to document. To ease the documentation of purchases made by employees at Consid AB, this study has had the goal to develop a service that from an image of a receipt can extract relevant data such as price, date, company name along with a category of the purchase. The resulting service can extract text from receipts with a confidence of 73 % in that the text is correct. Tests of the service shows that it can find price, date and company name on around 64 % of test receipts with different quality and contents. The resulting service has also implemented two different methods for categorisation where half of the test receipts could be categorised by both methods. After analysing methods and results, conclusions have been made in that the service contains of numerous flaws and that more time needs to be put in to optimise and test the service. OCR categorisation receipt data extraction machine learning word vectors OCR kategorisering kvitto dataextrahering maskininlärning ord-vektorer Computer Sciences Datavetenskap (datalogi)
43	Data for evidence: Defining, collecting and analysing specific data from pedelec accidents as an example of individual, targeted road safety work for new forms of mobility Panwinkler, Tobias 19 December 2022 (has links) Cycling, as one of the oldest forms of mobility, is currently experiencing a renaissance. It supports active mobility and can have a positive influence on public health, the environment, climate and the traffic situation. Pedelecs (bicycles with an electric motor supporting the user up to a speed of 25 kmph) represent a new form of active mobility and are currently enjoying great popularity as they have the same benefits compared to conventional bicycles and, in addition, make cycling accessible to new user groups. With the growing number of pedelecs, however, potential for conflict also increases. Unfortunately, the majority of accidents cannot yet be analysed accordingly, as pedelec-specifiic characteristics are missing from the accident data. This fact in itself has already been proven as a barrier. Most accident studies focusing on pedelecs are based on police data from standardised accident forms [e.g. 1, 2, 3, 4]. Their findings can be summarised in the following key statements: Accidents with pedelecs are less frequent but more severe than those with conventional bicycles. For both, accidents on urban roads dominate, but pedelec accidents occur significantly more often on rural roads than conventional bicycle accidents. And: injured pedelec users, especially those fatally injured, are on average significantly older than injured users of conventional bicycles. But, standardised accident forms were initially designed for accidents with double-track motor vehicles, in particular passenger cars. Accidents with bicycles (especially pedelecs), are difficult to categorise with this systematic as important information is missing. For example, 'falling on ground' is not an accident category as cars normally won't do so, but for pedelec accidents, this information is fundamental. This acts as a barrier as bicycle-specific causes of accidents cannot be analysed. However, accident statistics are the most important basis for evidence-based measures in road safety work. The aim of this paper is therefore to identify and categorise pedelec-specific accident characteristics and to evaluate pedelec accidents on the basis of these characteristics to identify frequent and severe accident constellations. [From: Introduction]
44	Multi agent system for web database processing, on data extraction from online social networks. Abdulrahman, Ruqayya January 2012 (has links) In recent years, there has been a ood of continuously changing information from a variety of web resources such as web databases, web sites, web services and programs. Online Social Networks (OSNs) represent such a eld where huge amounts of information are being posted online over time. Due to the nature of OSNs, which o er a productive source for qualitative and quantitative personal information, researchers from various disciplines contribute to developing methods for extracting data from OSNs. However, there is limited research which addresses extracting data automatically. To the best of the author's knowledge, there is no research which focuses on tracking the real time changes of information retrieved from OSN pro les over time and this motivated the present work. This thesis presents di erent approaches for automated Data Extraction (DE) from OSN: crawler, parser, Multi Agent System (MAS) and Application Programming Interface (API). Initially, a parser was implemented as a centralized system to traverse the OSN graph and extract the pro- le's attributes and list of friends from Myspace, the top OSN at that time, by parsing the Myspace pro les and extracting the relevant tokens from the parsed HTML source les. A Breadth First Search (BFS) algorithm was used to travel across the generated OSN friendship graph in order to select the next pro le for parsing. The approach was implemented and tested on two types of friends: top friends and all friends. In case of top friends, 500 seed pro les have been visited; 298 public pro les were parsed to get 2197 top friends pro les and 2747 friendship edges, while in case of all friends, 250 public pro les have been parsed to extract 10,196 friends' pro les and 17,223 friendship edges. This approach has two main limitations. The system is designed as a centralized system that controlled and retrieved information of each user's pro le just once. This means that the extraction process will stop if the system fails to process one of the pro les; either the seed pro le ( rst pro le to be crawled) or its friends. To overcome this problem, an Online Social Network Retrieval System (OSNRS) is proposed to decentralize the DE process from OSN through using MAS. The novelty of OSNRS is its ability to monitor pro les continuously over time. The second challenge is that the parser had to be modi ed to cope with changes in the pro les' structure. To overcome this problem, the proposed OSNRS is improved through use of an API tool to enable OSNRS agents to obtain the required elds of an OSN pro le despite modi cations in the representation of the pro le's source web pages. The experimental work shows that using API and MAS simpli es and speeds up the process of tracking a pro le's history. It also helps security personnel, parents, guardians, social workers and marketers in understanding the dynamic behaviour of OSN users. This thesis proposes solutions for web database processing on data extraction from OSNs by the use of parser and MAS and discusses the limitations and improvements. / Taibah University Data extraction Online social network Agent Formal specification Multi agent system (MAS) Application programming interface (API) Web database processing
45	Automated Extraction of Data from Insurance Websites / Automatiserad Datautvinning från Försäkringssidor Hodzic, Amar January 2022 (has links) Websites have become a critical source of information for many organizations in today's digital era. However, extracting and organizing semi-structured data from web pages from multiple websites poses challenges. This is especially true when a high level of automation is desired while maintaining generality. A natural progression in the quest for automation is to extend the methods for web data extraction from only being able to handle a single website to handling multiple ones, usually within the same domain. Although these websites share the same domain, the structure of the data can vary greatly. A key question becomes how generalized such a system can be to encompass a large number of websites while maintaining adequate accuracy. The thesis examined the efficiency of automated web data extraction on multiple Swedish insurance company websites. Previous work showed that good results can be achieved with a known English data set that contains web pages from a number of domains. The state-of-the-art model MarkupLM was chosen and trained with supervised learning using two pre-trained models, a Swedish and an English model, on a labeled training set of car insurance customers' web data using zero-shot learning. The results show that such a model can achieve good accuracy on a domain scale with Swedish as the source language with a relatively small data set by leveraging pre-trained models. / Webbsidor har blivit en kritisk källa av information för många organisationer idag. Men att extrahera och strukturera semistrukturerade data från webbsidor från flertal webbplatser är en utmaning. Speciellt när det är önskvärt med en hög nivå av automatisering i kombination med en generaliserbar lösning. En naturlig utveckling i målat av automation är att utöka metoderna för datautvinning från att endast kunna hantera en specifik webbplats till flertal webbplatser inom samma domän. Men även om dessa webbplatser delar samma domän så kan strukturen på data variera i stor utsträckning. En nyckelfråga blir då hur pass generell en sådan lösning kan vara samtidigt som en adekvat prestanda uppehålls. Detta arbete undersöker prestandan av automatiserad datautvinning från ett flertal svenska försäkringssidor. Tidigare arbete visar på att goda resultat kan uppnås på ett känt engelskt dataset som innehåller webbsidor från ett flertal domän. Den toppmoderna modellen MarkupLM valdes och blev tränad med två olika förtränade modeller, en svensk och en engelsk modell, med märkt data från konsumenters bilförsäkringsdata. Modellen blev utvärderad på data från webbplatser som inte ingick i träningsdatat. Resultaten visar på att en sådan modell kan nå god prestanda på domänskala när innehållsspråket är svenska trots en relativt liten datamängd när förtränade modeller används. Insurance Semi-structured data Web data extraction Deep learning Försäkring Semistrukturerad data Webbdataextraktion Djupinlärning Computer Sciences Datavetenskap (datalogi)
46	Microbial phenomics information extractor (MicroPIE): a natural language processing tool for the automated acquisition of prokaryotic phenotypic characters from text sources Mao, Jin, Moore, Lisa R., Blank, Carrine E., Wu, Elvis Hsin-Hui, Ackerman, Marcia, Ranade, Sonali, Cui, Hong 13 December 2016 (has links) Background: The large-scale analysis of phenomic data (i.e., full phenotypic traits of an organism, such as shape, metabolic substrates, and growth conditions) in microbial bioinformatics has been hampered by the lack of tools to rapidly and accurately extract phenotypic data from existing legacy text in the field of microbiology. To quickly obtain knowledge on the distribution and evolution of microbial traits, an information extraction system needed to be developed to extract phenotypic characters from large numbers of taxonomic descriptions so they can be used as input to existing phylogenetic analysis software packages. Results: We report the development and evaluation of Microbial Phenomics Information Extractor (MicroPIE, version 0.1.0). MicroPIE is a natural language processing application that uses a robust supervised classification algorithm (Support Vector Machine) to identify characters from sentences in prokaryotic taxonomic descriptions, followed by a combination of algorithms applying linguistic rules with groups of known terms to extract characters as well as character states. The input to MicroPIE is a set of taxonomic descriptions (clean text). The output is a taxon-by-character matrix-with taxa in the rows and a set of 42 pre-defined characters (e.g., optimum growth temperature) in the columns. The performance of MicroPIE was evaluated against a gold standard matrix and another student-made matrix. Results show that, compared to the gold standard, MicroPIE extracted 21 characters (50%) with a Relaxed F1 score > 0.80 and 16 characters (38%) with Relaxed F1 scores ranging between 0.50 and 0.80. Inclusion of a character prediction component (SVM) improved the overall performance of MicroPIE, notably the precision. Evaluated against the same gold standard, MicroPIE performed significantly better than the undergraduate students. Conclusion: MicroPIE is a promising new tool for the rapid and efficient extraction of phenotypic character information from prokaryotic taxonomic descriptions. However, further development, including incorporation of ontologies, will be necessary to improve the performance of the extraction for some character types. Information extraction Phenotypic data extraction Prokaryotic taxonomic descriptions Microbial phenotypes Character matrices Support vector machine Machine learning Text mining Algorithm evaluation Natural language processing
47	Medicininių dokumentų automatizuotos analizės metodikos tyrimas / Analysis of automatic data extraction from medical documents Kazla, Algirdas 25 May 2005 (has links) Automatic data extraction from medical legacy systems into archetype-based systems is analyzed, developed and tested in this work. Electronic health record system (EHRS) is a must in today’s healthcare environment. Lots of up-to-date medical systems are still built with classic development approaches, with semantics hard coded into systems. Modern EHRS standards propose new “two-level” methodology, which is based on separation of knowledge and information levels. This work suggests a methodology for heterogenic medical legacy systems that exist today to be transformed into ones, built with “two-level” methodology. Transformation is based on knowledge, residing in new system. By creating a comprehensive transformation scheme, it is possible to analyze and extract relevant data from semi structured or unstructured text fields with mixed information. Suggested methodology is tested with software prototype by extracting laboratory results of clinical blood test from semi structured fields of cardiology database. Achieved results are about 95% of data successfully transferred from legacy system. This approach preserves medical data accumulated during long years of work and transforms it into more useful form, creating structured data from unstructured text fields. It allows an automatic means of information technologies to be used by medicine expert to analyze and interpret legacy data (draw charts, calculate statistics and so on). Informatics Two level methodology Archetypes Eletronic health record Ontologija Ontology Elektroninė sveikatos istorija Data extraction Dviejų lygių metodika Archetipai Duomenų išgavimas
48	Mining Git Repositories : An introduction to repository mining Carlsson, Emil January 2013 (has links) When performing an analysis of the evolution of software quality and software metrics,there is a need to get access to as many versions of the source code as possible. There isa lack of research on how data or source code can be extracted from the source controlmanagement system Git. This thesis explores different possibilities to resolve thisproblem. Lately, there has been a boom in usage of the version control system Git. Githubalone hosts about 6,100,000 projects. Some well known projects and organizations thatuse Git are Linux, WordPress, and Facebook. Even with these figures and clients, thereare very few tools able to perform data extraction from Git repositories. A pre-studyshowed that there is a lack of standardization on how to share mining results, and themethods used to obtain them. There are several tools available for older version control systems, such as concurrentversions system (CVS), but few for Git. The examined repository mining applicationsfor Git are either poorly documented; or were built to be very purpose-specific to theproject for which they were designed. This thesis compiles a list of general issues encountered when using repositorymining as a tool for data gathering. A selection of existing repository mining tools wereevaluated towards a set of prerequisite criteria. The end result of this evaluation is thecreation of a new repository mining tool called Doris. This tool also includes a smallcode metrics analysis library to show how it can be extended. repository mining msr git quality analysis version control system vcs source control management scm data mining data extraction Computer Sciences Datavetenskap (datalogi)
49	Vyhledávač údajů ve webových stránkách / Web page data figure finder Janata, Dominik January 2016 (has links) The thesis treats automatic extraction of semantic data from Web pages. Within this broad problem, it focuses on finding values of data figures within the page presenting certain entity (e.g. price of a laptop). The main idea we wanted to evaluate is that a figure can be found using its context in the page: the words that surround it and values of the attributes of the containing HTML tags, class attribute in particular. Our research revealed there are two types of contemporary solutions of this problem: either the author of the Web page must inline semantic information inside the markup of the page or there are commercial tools that can be trained to parse a particular page format (targetting pages from a single Web domain). We examined the possibilities of developing a general solution that would - for given entity - find its properties across the Web domains using text analysis and machine learning. The naïve algorithm had about 30% accuracy, the lear- ning algorithms had the accuracy between 40 and 50% in finding the properties. Despite the accuracy is not acceptable for a final solution, we believe it confirms the potential of the idea. Keywords: Web pages data extraction 1
50	Extrakce dat z webu / Web Data Extraction Novella, Tomáš January 2016 (has links) Creation of web wrappers (i.e programs that extract data from the web) is a subject of study in the field of web data extraction. Designing a domain-specific language for a web wrapper is a challenging task, because it introduces trade-offs between expressiveness of a wrapper's language and safety. In addition, little attention has been paid to execution of a wrapper in restricted environment. In this thesis, we present a new wrapping language -- Serrano -- that has three goals in mind. (1) Ability to run in restricted environment, such as a browser extension, (2) extensibility, to balance the tradeoffs between expressiveness of a command set and safety, and (3) processing capabilities, to eliminate the need for additional programs to clean the extracted data. Serrano has been successfully deployed in a number of projects and provided encouraging results. Powered by TCPDF (www.tcpdf.org)

Search results