Automated event extraction from free text remains an open problem, particularly when the goal is to identify all relevant events. Manual extraction is currently the only alternative for comprehensive and reliable extraction. Therefore, it is required to have a system that can comprehensively extract events reported in news articles (high recall) and is also scalable enough to handle a large number of articles.
In this dissertation, we explore various methods to develop an event extraction system that can mitigate these challenges. We primarily investigate three major problems related to event extraction as follows. (i) What are the strengths and weaknesses of the automated event extractors? A thorough understanding of what can be automated with high success and what leads to common pitfalls is crucial before we could develop a superior event extraction system. (ii) How can we build a hybrid event extraction system that can bridge the gap between manual and automated event extraction? Hybrid extraction is a semi-automated approach that uses an ecosystem of machine learning models along with a carefully designed user interface for extracting events. Since this method is semi-automated it also requires a meticulous understanding of user behavior in order to identify tasks that humans can perform with ease while diverting the more tedious task to the machine learning methods (iii) Finally, we explore methods for displaying extracted events that could simplify the analytical and inference generation processes for an analyst. We particularly aim to develop visualizations that would allow analysts can perform macro and micro level analysis of significant societal events. / Ph. D. / News articles provide information about who did what to whom, when, where, and why. Extracting this structured information from news articles can allow scientific evaluation of widely believed information. However, curating these databases of structured information is not a trivial task. Currently there are two main approaches: manual and automated. Manually curation is not scalable due to labor costs: adding more humans to perform analysis is prohibitively expensive and time consuming. The alternative approach is ‘Automated Extraction’, wherein, machine learning algorithms extract events on their own without any human assistance. Even though this approach can easily scale to work with a large number of articles, it frequently misclassifies events.
In this dissertation, we present EMBERS AutoGSR, a framework for comprehensively extracting ‘protest’ events reported in news articles using Hybrid Event Extraction. In the hybrid approach, we use an ecosystem of Filtering, Ranking, and Recommendation models to determine if an article is reporting a protest and, if so, proceed to identify and encode specific characteristics of the event, such as who protested when, where and why? These extracted events are then displayed on an interactive web-based interface that allows manual validation. This manual validation, in turn, helps the automated event extractors learn and evolve from user feedback and error correction. The interface is carefully designed with an aim to minimize the manual effort required for user validation, thereby making it feasible and viable to work with a large number of articles.
EMBERS AutoGSR operated 24x7 for a year from October 2015 through September 2016, during which it extracted protest events from news articles that were collected from 19 countries across 8 languages. These extracted events were validated by 12 subject matter experts. The system was evaluated by an independent third party, MITRE corporation. They compared EMBERS AutoGSR events with events that were manually extracted by their team of political scientists. AutoGSR achieved a recall of 0.82 out of 1, and reduced the manual effort required for event extraction by 72%, thereby making the system extremely reliable and scalable.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/82926 |
Date | 26 April 2018 |
Creators | Saraf, Parang |
Contributors | Computer Science, Ramakrishnan, Naren, House, Leanna L., Corley, Courtney, North, Christopher L., Lu, Chang-Tien |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Detected Language | English |
Type | Dissertation |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0029 seconds