Global ETD Search

Return to search

Jak kvalita lemmatizace ovlivňuje výsledky vyhledávání dokumentů v českém jazyce / Effect of the Czech Stemming Algorithm on the Document Retrieval

This thesis deals with the measurement of the quality of the stemming/lemmatization algo-rithm for the Czech language in document processing systems and provides an analysis of the results. The theoretical part of the thesis describes the principles of the full-text search, the possibilities of implementation as well as the common problems which have to be solved in connection with the processing of natural language. Methods of evaluating the quality of lemmatization, using recall and precision, are discussed. In addition, the theoret-ical part covers the method of measuring the index of under-stemming and over-stemming, which can be applied for the purposes of a more detailed evaluation. An experiment for evaluating the lemmatization algorithms is proposed in the second part of the thesis. A specialized application has been developed to perform the experiment in three different systems, namely Apache Lucene, the PostgreSQL database systems and the Microsoft SQL Server. The experiment is based on the Prague Dependency Treebank cor-pus. It has been carried out both for the corpus as a whole and for selected word classes separately. Further analysis of the results for Czech stemmer in Apache Lucene leads to a proposal for several modifications of the algorithm. Such modifications result in measurable improvements. The results achieved show how metrics discussed, together with the values measured, can be used for improving the lemmatization algorithms and thus to improve the full-text search for Czech language.

http://www.nusl.cz/ntk/nusl-150220

Identifer	oai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:150220
Date	January 2012
Creators	Pytelka, Petr
Contributors	Strossa, Petr, Pinkas, Otakar
Publisher	Vysoká škola ekonomická v Praze
Source Sets	Czech ETDs
Language	Czech
Detected Language	English
Type	info:eu-repo/semantics/masterThesis
Rights	info:eu-repo/semantics/restrictedAccess

Page generated in 0.002 seconds

Jak kvalita lemmatizace ovlivňuje výsledky vyhledávání dokumentů v českém jazyce / Effect of the Czech Stemming Algorithm on the Document Retrieval

Description

Links & Downloads

Tags

Additional Fields