In this thesis we present a new way of extracting relevant data from texts. We use the method presented in the paper by Patwardhan and Rilo (2007), with improvements of our own. Our approach modifes the input to the support vector machine, to construct a self-trained relevant sentence classi er. This classffer is used to identify relevant sentences on the MUC-4 terrorism corpus.We modify the input by removing stopwords, converting words to its stem and only using words that occur at least three times in the corpus. We also changed how each word is weighted, using TF x IDF as weighting function. By using the relevant sentence classiffer together with domain relevant extraction patterns, we achieved higher performance on the MUC-4 terrorism corpus than the original model.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-32661 |
Date | January 2011 |
Creators | Karlsson, Thomas |
Publisher | KTH, Skolan för informations- och kommunikationsteknik (ICT) |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Relation | Trita-ICT-EX ; 52 |
Page generated in 0.017 seconds