Specificity is an important factor to text analysis. While much research on sentence specificity experiments upon news, very little is known about press releases. Our study is devoted to specificity in press releases, which are journalistic documents that companies share with the press and other media outlets. In this research, we analyze press releases about digital transformation written by pump companies, and develop tools for automatic measurement of sentence specificity. The goal of the research is to 1) explore the effects of data combination, 2) analyze features for specificity prediction, and 3) compare the effectiveness of classification and probability estimation. Through our experiment on various combinations of training data, we find that adding news data to the model effectively improves probability estimation, but the effects on classification are not noticeable. In terms of features, we find that the sentence length plays an essential role in specificity prediction. We remove twelve insignificant features, and this modification results in a model running faster as well as achieving comparable scores. We also find that both classification and probability estimation have drawbacks. With regard to probability estimation, models can score well by only making predictions around the threshold. Binary classification depends on the threshold, and threshold setting requires consideration. Besides, classification scores cannot sift out models that make unreliable judgement about high and low specificity sentences.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-413515 |
Date | January 2020 |
Creators | He, Tiantian |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.1929 seconds