Global ETD Search

1	The effects of indexing strategy-query term combination on retrieval effectiveness in a Swedish full text database Ahlgren, Per January 2004 (has links) This thesis deals with Swedish full text retrieval and the problem of morphological variation of query terms in thedocument database. The study is an information retrieval experiment with a test collection. While no Swedish testcollection was available, such a collection was constructed. It consists of a document database containing 161,336news articles, and 52 topics with four-graded (0, 1, 2, 3) relevance assessments. The effects of indexing strategy-query term combination on retrieval effectiveness were studied. Three of five testedmethods involved indexing strategies that used conflation, in the form of normalization. Further, two of these threecombinations used indexing strategies that employed compound splitting. Normalization and compound splittingwere performed by SWETWOL, a morphological analyzer for the Swedish language. A fourth combinationattempted to group related terms by right hand truncation of query terms. A search expert performed the truncation.The four combinations were compared to each other and to a baseline combination, where no attempt was made tocounteract the problem of morphological variation of query terms in the document database. Two situations were examined in the evaluation: the binary relevance situation and the multiple degree relevancesituation. With regard to the binary relevance situation, where the three (positive) relevance degrees (1, 2, 3) weremerged into one, and where precision was used as evaluation measure, the four alternative combinationsoutperformed the baseline. The best performing combination was the combination that used truncation. Thiscombination performed better than or equal to a median precision value for 41 of the 52 topics. One reason for therelatively good performance of the truncation combination was the capacity of its queries to retrieve different partsof speech. In the multiple degree relevance situation, where the three (positive) relevance degrees were retained, retrievaleffectiveness was taken to be the accumulated gain the user receives by examining the retrieval result up to givenpositions. The evaluation measure used was nDCG (normalized cumulated gain with discount). This measurecredits retrieval methods that (1) rank highly relevant documents higher than less relevant ones, and (2) rankrelevant (of any degree) documents high. With respect to (2), nDCG involves a discount component: a discount withregard to the relevance score of a relevant (of any degree) document is performed, and this discount is greater andgreater, the higher position the document has in the ranked list of retrieved documents. In the multiple degree relevance situation, the five combinations were evaluated under four different user scenarios,where each scenario simulated a certain user type. Again, the four alternative combinations outperformed thebaseline, for each user scenario. The truncation combination had the best performance under each user scenario.This outcome agreed with the performance result in the binary relevance situation. However, there were alsodifferences between the two relevance situations. For 25 percent of the topics and with regard to one of the four userscenarios, the set of best performing combinations in the binary relevance situation was disjunct from the set of bestperforming combinations in the multiple degree relevance situation. The user scenario in question was such thatalmost all importance was placed on highly relevant documents, and the discount was sharp. The main conclusion of the thesis is that normalization and right hand truncation (performed by a search expert)enhanced retrieval effectiveness in comparison to the baseline, irrespective of which of the two relevance situationswe consider. Further, the three indexing strategy-query term combinations based on normalization were almost asgood as the combination that involves truncation. This holds for both relevance situations. / <p>QC 20150813</p> base word form index full text retrieval indexing strategies inflected word form index morphological analysis normalization Swedish SWETWOL truncation user scenarios
2	A Probabilistic Tagging Module Based on Surface Pattern Matching Eklund, Robert January 1993 (has links) A problem with automatic tagging and lexical analysis is that it is never 100 % accurate. In order to arrive at better figures, one needs to study the character of what is left untagged by automatic taggers. In this paper untagged residue outputted by the automatic analyser SWETWOL (Karlsson 1992) at Helsinki is studied. SWETWOL assigns tags to words in Swedish texts mainly through dictionary lookup. The contents of the untagged residue files are described and discussed, and possible ways of solving different problems are proposed. One method of tagging residual output is proposed and implemented: the left-stripping method, through which untagged words are bereaved their left-most letters, searched in a dictionary, and if found, tagged according to the information found in the said dictionary. If the stripped word is not found in the dictionary, a match is searched in ending lexica containing statistical information about word classes associated with that particular word form (i.e., final letter cluster, be this a grammatical suffix or not), and the relative frequency of each word class. If a match is found, the word is given graduated tagging according to the statistical information in the ending lexicon. If a match is not found, the word is stripped of what is now its left-most letter and is recursively searched in a dictionary and ending lexica (in that order). The ending lexica employed in this paper are retrieved from a reversed version of Nusvensk Frekvensordbok (Allén 1970), and contain endings of between one and seven letters. The contents of the ending lexica are to a certain degree described and discussed. The programs working according to the principles described are run on files of untagged residual output. Appendices include, among other things, LISP source code, untagged and tagged files, the ending lexica containing one and two letter endings and excerpts from ending lexica containing three to seven letters. Tagging computational linguistics word-class probabilistic morphology swetwol statistical corpus linguistics corpora endings suffixes word class frequency lexical analysis

Search results

The effects of indexing strategy-query term combination on retrieval effectiveness in a Swedish full text database

A Probabilistic Tagging Module Based on Surface Pattern Matching