Return to search

Velký mnohojazyčný korpus / Velký mnohojazyčný korpus

This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words for each of these languages, with the total size 10.5 billion words. The corpus was built by crawling the Internet. This work describes the methods and tools used for its construction. The complete process consisted of building an initial corpus from Wikipedia, developing a language recognizer for 122 languages, implementing a distributed system for crawling and parsing webpages and finally, the reduction of duplicities. A comparative analysis of the texts of Wikipedia and the Internet is provided at the end of this thesis. The analysis is based on basic statistics such as average word and sentence length, conditional entropy and perplexity. 1

Identiferoai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:313914
Date January 2011
CreatorsMajliš, Martin
ContributorsŽabokrtský, Zdeněk, Spousta, Miroslav
Source SetsCzech ETDs
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/masterThesis
Rightsinfo:eu-repo/semantics/restrictedAccess

Page generated in 0.0017 seconds