A problem with automatic tagging and lexical analysis is that it is never 100 % accurate. In order to arrive at better figures, one needs to study the character of what is left untagged by automatic taggers. In this paper untagged residue outputted by the automatic analyser SWETWOL (Karlsson 1992) at Helsinki is studied. SWETWOL assigns tags to words in Swedish texts mainly through dictionary lookup. The contents of the untagged residue files are described and discussed, and possible ways of solving different problems are proposed. One method of tagging residual output is proposed and implemented: the left-stripping method, through which untagged words are bereaved their left-most letters, searched in a dictionary, and if found, tagged according to the information found in the said dictionary. If the stripped word is not found in the dictionary, a match is searched in ending lexica containing statistical information about word classes associated with that particular word form (i.e., final letter cluster, be this a grammatical suffix or not), and the relative frequency of each word class. If a match is found, the word is given graduated tagging according to the statistical information in the ending lexicon. If a match is not found, the word is stripped of what is now its left-most letter and is recursively searched in a dictionary and ending lexica (in that order). The ending lexica employed in this paper are retrieved from a reversed version of Nusvensk Frekvensordbok (Allén 1970), and contain endings of between one and seven letters. The contents of the ending lexica are to a certain degree described and discussed. The programs working according to the principles described are run on files of untagged residual output. Appendices include, among other things, LISP source code, untagged and tagged files, the ending lexica containing one and two letter endings and excerpts from ending lexica containing three to seven letters.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-135294 |
Date | January 1993 |
Creators | Eklund, Robert |
Publisher | Stockholm University, Department of Computational Linguistics, Institute of Linguistics |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0107 seconds