Global ETD Search

Return to search

Levenshtein distance for information extraction in databases and for natural language processing.

While performing information extraction or natural language processing tasks, one usually encounters problems when working with data or texts containing noise, typing mistakes or other different kinds of errors. In this thesis we investigate the use of modified Levenshtein edit distances to deal with these problems in two specific tasks. The first one is the record linkage in databases where distinct records can be representing the same entity. For this task we used and extended the WEKA API for Machine Learning and we were able to show that a modified Levenshtein distance provides good precision and recall results in the detection of records representing the same entities. The second task is the search and annotation of occurrences of specified words in texts written in natural language. Our main result in this task was the implementation of an approximate Gazetteer for GATE, the General Architecture for Text Engineering.

http://www.bd.bibl.ita.br/tde_busca/arquivo.php?codArquivo=529

Processamento de textos

Linguagem natural (computadores)

Rotinas de edição (computadores)

Teoria da informação

Computação

Identifer	oai:union.ndltd.org:IBICT/oai:agregador.ibict.br.BDTD_ITA:oai:ita.br:529
Date	21 December 2007
Creators	Bruno Woltzenlogel Paleo
Contributors	Carlos Henrique Costa Ribeiro
Publisher	Instituto Tecnológico de Aeronáutica
Source Sets	IBICT Brazilian ETDs
Language	English
Detected Language	English
Type	info:eu-repo/semantics/publishedVersion, info:eu-repo/semantics/masterThesis
Format	application/pdf
Source	reponame:Biblioteca Digital de Teses e Dissertações do ITA, instname:Instituto Tecnológico de Aeronáutica, instacron:ITA
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0022 seconds

Levenshtein distance for information extraction in databases and for natural language processing.

Description

Links & Downloads

Tags

Additional Fields