Collocations are groups of words which are co-occurring more often than appearing separately. They also include phrases that give a new meaning to a group of unrelated words. This thesis is aimed to find collocations in large data and to create a database that allows their retrieval. The Pointwise Mutual Information, a value based on word frequency, is computed for finding the collocations. Words with the highest value of PMI are considered candidates for good collocations. Chosen collocations are stored in a database in a format that allows searching with Apache Lucene. A part of the thesis is to create a Web user interface as a quick and easy way to search collocations. If this service is fast enough and the collocations are good, translators will be able to use it for finding proper equivalents in the target language. Students of a foreign language will also be able to use it to extend their vocabulary. Such database will be created independently in several languages including Czech and English. Powered by TCPDF (www.tcpdf.org)
Identifer | oai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:341207 |
Date | January 2014 |
Creators | Helcl, Jindřich |
Contributors | Hajič, Jan, Mareček, David |
Source Sets | Czech ETDs |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/masterThesis |
Rights | info:eu-repo/semantics/restrictedAccess |
Page generated in 0.002 seconds