The purpose of this study is to estimate and compare the entropy and redundancy of written English and Swedish. We also investigate and compare the entropy and redundancy of Twitter language. This is done by extracting n consecutive characters called n-grams and calculating their frequencies. No precise values are obtained, due to the amount of text being finite, while the entropy is estimated for text length tending towards infinity. However we do obtain results for n = 1,...,6 and the results show that written Swedish has higher entropy than written English and that the redundancy is lower for Swedish language. When comparing Twitter with the standard languages we find that for Twitter, the entropy is higher and the redundancy is lower.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:lnu-64952 |
Date | January 2017 |
Creators | Juhlin, Sanna |
Publisher | Linnéuniversitetet, Institutionen för matematik (MA) |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0014 seconds