Global ETD Search

Return to search

Towards More Comprehensive Information Retrieval Systems: Entity Extraction Using XSLT

One problem that exists in today's document management arena is the issue of retrieving information from electronic documents such as images, Microsoft Office documents, and e-mail. Specific data entities must be extracted from these documents so that the data can be searched and queried. This study presents a unique approach to extracting these entities: using Extensible Stylesheet Language Transformations (XSLT) to match patterns in text. Because XSLT is processed at run time, new XSLT templates can be created and used without having to recompile and redeploy the application. The specific implementation addressed in this project extracts entities from an image file. The data in the image file is converted to Extensible Markup Language (XML) text via optical character recognition (OCR), and then this XML text is transformed into an organized, well-formed XML output file using an XSLT template. We show this approach can accurately retrieve the correct data and this method can be extended to other electronic document sources.

UNF

University of North Florida

Computer Sciences

Identifer	oai:union.ndltd.org:unf.edu/oai:digitalcommons.unf.edu:etd-1224
Date	01 January 2005
Creators	McManigal, Chris A
Publisher	UNF Digital Commons
Source Sets	University of North Florida
Detected Language	English
Type	text
Format	application/pdf
Source	UNF Theses and Dissertations

Page generated in 0.0016 seconds

Towards More Comprehensive Information Retrieval Systems: Entity Extraction Using XSLT

Description

Links & Downloads

Tags

Additional Fields