Global ETD Search

Return to search

Velký mnohojazyčný korpus / Velký mnohojazyčný korpus

This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words for each of these languages, with the total size 10.5 billion words. The corpus was built by crawling the Internet. This work describes the methods and tools used for its construction. The complete process consisted of building an initial corpus from Wikipedia, developing a language recognizer for 122 languages, implementing a distributed system for crawling and parsing webpages and finally, the reduction of duplicities. A comparative analysis of the texts of Wikipedia and the Internet is provided at the end of this thesis. The analysis is based on basic statistics such as average word and sentence length, conditional entropy and perplexity. 1

http://www.nusl.cz/ntk/nusl-313914

Identifer	oai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:313914
Date	January 2011
Creators	Majliš, Martin
Contributors	Žabokrtský, Zdeněk, Spousta, Miroslav
Source Sets	Czech ETDs
Language	English
Detected Language	English
Type	info:eu-repo/semantics/masterThesis
Rights	info:eu-repo/semantics/restrictedAccess

Page generated in 0.0017 seconds

Velký mnohojazyčný korpus / Velký mnohojazyčný korpus

Description

Links & Downloads

Tags

Additional Fields