The client of the project has problems with complex queries and noisewhen querying their stream of five million news articles per day. Thisresults in much manual work when sorting and pruning the search result of their query. Instead of using direct text matching, the approachof the project was to use a topic model to describe articles in terms oftopics covered and to use this new information to sort the articles. An online version of the topic model Latent Dirichlet Allocationwas implemented using online variational Bayes inference to handlestreamed data. Using 100 dimensions, topics such as sports and politics emerged during training on a 1.7 million articles big simulatedstream. These topics were used to sort articles based on context. Theimplementation was found accurate enough to be useful for the client aswell as fast and stable enough to be a feasible solution to the problem.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-222429 |
Date | January 2014 |
Creators | Wedenberg, Kim, Sjöberg, Alexander |
Publisher | Uppsala universitet, Institutionen för informationsteknologi, Uppsala universitet, Institutionen för informationsteknologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Relation | UPTEC F, 1401-5757 ; 14010 |
Page generated in 0.0021 seconds