Global ETD Search

Return to search

Efficient techniques for streaming cross document coreference resolution

Large text streams are commonplace; news organisations are constantly producing stories and people are constantly writing social media posts. These streams should be analysed in real-time so useful information can be extracted and acted upon instantly. When natural disasters occur people want to be informed, when companies announce new products financial institutions want to know and when celebrities do things their legions of fans want to feel involved. In all these examples people care about getting information in real-time (low latency). These streams are massively varied, people’s interests are typically classified by the entities they are interested in. Organising a stream by the entity being referred to would help people extract the information useful to them. This is a difficult task: fans of ‘Captain America’ films will not want to be incorrectly told that ‘Chris Evans’ (the main actor) was appointed to host ‘Top Gear’ when it was a different ‘Chris Evans’. People who use local idiosyncrasies such as referring to their home county (‘Cornwall’) as ‘Kernow’ (the Cornish for ‘Cornwall’ that has entered the local lexicon) should not be forced to change their language when finding out information about their home. This thesis addresses a core problem for real-time entity-specific NLP: Streaming cross document coreference resolution (CDC), how to automatically identify all the entities mentioned in a stream in real-time. This thesis address two significant problems for streaming CDC: There is no representative dataset and existing systems consume more resources over time. A new technique to create datasets is introduced and it was applied to social media (Twitter) to create a large (6M mentions) and challenging new CDC dataset that contains a much more variend range of entities than typical newswire streams. Existing systems are not able to keep up with large data streams. This problem is addressed with a streaming CDC system that stores a constant sized set of mentions. New techniques to maintain the sample are introduced significantly out-performing existing ones maintaining 95% of the performance of a non-streaming system while only using 20% of the memory.

https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.738811

Identifer	oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:738811
Date	January 2017
Creators	Shrimpton, Luke William
Contributors	Heafield, Kenneth ; Osborne, Miles
Publisher	University of Edinburgh
Source Sets	Ethos UK
Detected Language	English
Type	Electronic Thesis or Dissertation
Source	http://hdl.handle.net/1842/28895

Page generated in 0.0026 seconds

Efficient techniques for streaming cross document coreference resolution

Description

Links & Downloads

Tags

Additional Fields