Spelling suggestions: "subject:"summarization"" "subject:"ummarization""
1 |
Investigating the Extractive Summarization of Literary NovelsCeylan, Hakan 12 1900 (has links)
Abstract
Due to the vast amount of information we are faced with, summarization has become a critical necessity of everyday human life. Given that a large fraction of the electronic documents available online and elsewhere consist of short texts such as Web pages, news articles, scientific reports, and others, the focus of natural language processing techniques to date has been on the automation of methods targeting short documents. We are witnessing however a change: an increasingly larger number of books become available in electronic format. This means that the need for language processing techniques able to handle very large documents such as books is becoming increasingly important. This thesis addresses the problem of summarization of novels, which are long and complex literary narratives. While there is a significant body of research that has been carried out on the task of automatic text summarization, most of this work has been concerned with the summarization of short documents, with a particular focus on news stories. However, novels are different in both length and genre, and consequently different summarization techniques are required. This thesis attempts to close this gap by analyzing a new domain for summarization, and by building unsupervised and supervised systems that effectively take into account the properties of long documents, and outperform the traditional extractive summarization systems typically addressing news genre.
|
2 |
自動摘要方法之研究與探討吳家威, Wu, Chia-wei Unknown Date (has links)
隨著網際網路的發展,人類能夠獲取的資訊也隨之增加。因此,如何增加獲取資訊的效率便成為重要的研究之一。自動摘要系統的目的即在於協助使用者有效率的閱讀。其根本的問題為:如何從文章中找出重要的資訊並呈現給使用者。本文採用三種方法進行摘要,應用於英文的新聞文件:第一種為利用ontology建立一領域的文章可能的主題資訊,並利用該資訊選出重要的段落作為摘要。第二種方法為建立一領域的ontology後,利用ontology所定義的標籤建構摘要的樣板,再利用該樣版搜尋所需的資訊,並將摘要輸出。第三種方法為利用各種不同的特徵找出文章中較重要的段落。另外,我們也將文章依其主題予以分類,利用不同主題的文章呈現的特徵表現不同,改良原本的特徵選取摘要的方法。本文呈現三種以不同方式獲取文章的主題資訊的方法,並利用該資訊呈現文章中較重要的訊息,經過實驗的評估皆獲得一定的成果。 / In the past decade, the explosively growing number of online articles has made efficient information gathering a challenging necessity. We need ways to absorb the information contained in the news articles effectively. Automatically providing summaries of articles is one way to save people time. The essential problems of automatic summarization are: how to identify the useful information and how to present the results to readers. I compare and analyze three kinds of summarization methods. The first method constructs a domain-depend knowledge based on the ontology approach, then use the ontology for gathering the main topics, and chooses the desired proportion of paragraphs as the summary by gathered topical information. The second method is similar to that taken by information-extraction systems. I organize semantic tags into an ontological structure, and the summarization system learns tags patterns for creating summaries from tagged data. The summarization system creates summaries by extracting useful information from a news article, and replaces the semantic tags with extracts in selected tag patterns. The third method analyzes the effects of several previously proposed features for summarization under different situations. The most important observation is that the effectiveness if these features depends on the topics of the news articles Hence, I collect statistical information about the features for different possible new topics, and apply such conditional probabilistic information for extracting summaries. Effectiveness of these proposed methods vary from case to case, but is believed to be satisfactory based on the experimental results.
|
3 |
WHISK: Web Hosted Information Into Summarized KnowledgeWu, Jiewen 01 July 2016 (has links) (PDF)
Today’s online content increases at an alarmingly rate which exceeds users’ ability to consume such content. Modern search techniques allow users to enter keyword queries to find content they wish to see. However, such techniques break down when users freely browse the internet without knowing exactly what they want. Users may have to invest an unnecessarily long time reading content to see if they are interested in it. Automatic text summarization helps relieve this problem by creating synopses that significantly reduce the text while preserving the key points. Steffen Lyngbaek created the SPORK summarization pipeline to solve the content overload in Reddit comment threads. Lyngbaek adapted the Opinosis graph model for extractive summarization and combined it with agglomerative hierarchical clustering and the Smith-Waterman algorithm to perform multi-document summarization on Reddit comments.This thesis presents WHISK as a pipeline for general multi-document text summarization based on SPORK. A generic data model in WHISK allows creating new drivers for different platforms to work with the pipeline. In addition to the existing Opinosis graph model adapted in SPORK, WHISK introduces two simplified graph models for the pipeline. The simplified models removes unnecessary restrictions inherited from Opinosis graph’s abstractive summarization origins. Performance measurements and a study with Digital Democracy compare the two new graph models against the Opinosis graph model. Additionally, the study evaluates WHISK’s ability to generate pull quotes from political discussions as summaries.
|
4 |
Summarization of very large spatial datasetLiu, Qing, Computer Science & Engineering, Faculty of Engineering, UNSW January 2006 (has links)
Nowadays there are a large number of applications, such as digital library information retrieval, business data analysis, CAD/CAM, multimedia applications with images and sound, real-time process control and scientific computation, with data sets about gigabytes, terabytes or even petabytes. Because data distributions are too large to be stored accurately, maintaining compact and accurate summarized information about underlying data is of crucial important. The summarizing problem for Level 1 (disjoint and non-disjoint) topological relationship has been well studied for the past few years. However the spatial database users are often interested in a much richer set of spatial relations such as contains. Little work has been done on summarization for Level 2 topological relationship which includes contains, contained, overlap, equal and disjoint relations. We study the problem of effective summatization to represent the underlying data distribution to answer window queries for Level 2 topological relationship. Cell-density based approach has been demonstrated as an effective way to this problem. But the challenges are the accuracy of the results and the storage space required which should be linearly proportional to the number of cells to be practical. In this thesis, we present several novel techniques to effectively construct cell density based spatial histograms. Based on the framework proposed, exact results could be obtained in constant time for aligned window queries. To minimize the storage space of the framework, an approximate algorithm with the approximate ratio 19/12 is presented, while the problem is shown NP-hard generally. Because the framework requires only a storage space linearly proportional to the number of cells, it is practical for many popular real datasets. To conform to a limited storage space, effective histogram construction and query algorithms are proposed which can provide approximate results but with high accuracy. The problem for non-aligned window queries is also investigated and techniques of un-even partitioned space are developed to support non-aligned window queries. Finally, we extend our techniques to 3D space. Our extensive experiments against both synthetic and real world datasets demonstrate the efficiency of the algorithms developed in this thesis.
|
5 |
Shape-Time PhotographyFreeman, William T., Zhang, Hao 10 January 2002 (has links)
We introduce a new method to describe, in a single image, changes in shape over time. We acquire both range and image information with a stationary stereo camera. From the pictures taken, we display a composite image consisting of the image data from the surface closest to the camera at every pixel. This reveals the 3-d relationships over time by easy-to-interpret occlusion relationships in the composite image. We call the composite a shape-time photograph. Small errors in depth measurements cause artifacts in the shape-time images. We correct most of these using a Markov network to estimate the most probable front surface, taking into account the depth measurements, their uncertainties, and layer continuity assumptions.
|
6 |
Discovering and summarizing email conversationsZhou, Xiaodong 05 1900 (has links)
With the ever increasing popularity of emails, it is very common nowadays that people discuss specific issues, events or tasks among a group of people by emails. Those discussions can be viewed as conversations via emails and are valuable for the user as a personal information repository. For instance, in 10 minutes before a meeting, a user may want to quickly go through a previous discussion via emails that is going to be discussed in the meeting soon. In this case, rather than reading each individual email one by one, it is preferable to read a concise summary of the previous discussion with major information summarized. In this thesis, we study the problem of discovering and summarizing email conversations. We believe that our work can greatly support users with their email folders. However, the characteristics of email conversations, e.g., lack of synchronization, conversational structure and informal writing style, make this task particularly challenging. In this thesis, we tackle this task by considering the following aspects: discovering emails in one conversation, capturing the conversation structure and summarizing the email conversation. We first study how to discover all emails belonging to one conversation. Specifically, we study the hidden email problem, which is important for email summarization and other applications but has not been studied before. We propose a framework to discover and regenerate hidden emails. The empirical evaluation shows that this framework is accurate and scalable to large folders. Second, we build a fragment quotation graph to capture email conversations. The hidden emails belonging to each conversation are also included into the corresponding graph. Based on the quotation graph, we develop a novel email conversation summarizer, ClueWordSummarizer. The comparison with a state-of-the-art email summarizer as well as with a popular multi-document summarizer shows that ClueWordSummarizer obtains a higher accuracy in most cases. Furthermore, to address the characteristics of email conversations, we study several ways to improve the ClueWordSummarizer by considering more lexical features. The experiments show that many of those improvements can significantly increase the accuracy especially the subjective words and phrases.
|
7 |
Summarizing Spoken Documents Through Utterance SelectionZhu, Xiaodan 02 September 2010 (has links)
The inherently linear and sequential property of speech raises the need for
ways to better navigate through spoken documents. The strategy of navigation I
focus on in this thesis is summarization, which aims to identify important excerpts
in spoken documents.
A basic characteristic that distinguishes speech summarization from traditional
text summarization is the availability and utilization of speech-related features.
Most previous research, however, has addressed this source from the perspective of
descriptive linguistics, in considering only such prosodic features that appear in that
literature. The experiments in this dissertation suggest that incorporating prosody
does help but its usefulness is very limited—much less than has been suggested in
some previous research. We reassess the role of prosodic features vs. features arising
from speech recognition transcripts, as well as baseline selection in error-prone
and disfluency-filled spontaneous speech. These problems interact with each other,
and isolated observations have hampered a comprehensive understanding to date.
The effectiveness of these prosodic features is largely confined because of their
difficulty in predicting content relevance and redundancy. Nevertheless, untranscribed
audio does contain more information than just prosody. This dissertation
shows that collecting statistics from far more complex acoustic patterns does allow
for estimating state-of-the-art summarization models directly. To this end, we propose
an acoustics-based summarization model that is estimated directly on acoustic
patterns. We empirically determine the extent to which this acoustics-based model
can effectively replace ASR-based models.
The extent to which written sources can benefit speech summarization has
also been limited, namely to noisy speech recognition transcripts. Predicting the
salience of utterances can indeed benefit from more sources than raw audio only.
Since speaking and writing are two basic ways of communication and are by nature
closely related to each other, in many situations, speech is accompanied with relevant
written text. Richer semantics conveyed in the relevant written text provides
additional information over speech by itself. This thesis utilizes such information
in content selection to help identify salient utterances in the corresponding speech
documents. We also employ such richer content to find the structure of spoken
documents—i.e., subtopic boundaries—which may in turn help summarization.
|
8 |
Summarizing Spoken Documents Through Utterance SelectionZhu, Xiaodan 02 September 2010 (has links)
The inherently linear and sequential property of speech raises the need for
ways to better navigate through spoken documents. The strategy of navigation I
focus on in this thesis is summarization, which aims to identify important excerpts
in spoken documents.
A basic characteristic that distinguishes speech summarization from traditional
text summarization is the availability and utilization of speech-related features.
Most previous research, however, has addressed this source from the perspective of
descriptive linguistics, in considering only such prosodic features that appear in that
literature. The experiments in this dissertation suggest that incorporating prosody
does help but its usefulness is very limited—much less than has been suggested in
some previous research. We reassess the role of prosodic features vs. features arising
from speech recognition transcripts, as well as baseline selection in error-prone
and disfluency-filled spontaneous speech. These problems interact with each other,
and isolated observations have hampered a comprehensive understanding to date.
The effectiveness of these prosodic features is largely confined because of their
difficulty in predicting content relevance and redundancy. Nevertheless, untranscribed
audio does contain more information than just prosody. This dissertation
shows that collecting statistics from far more complex acoustic patterns does allow
for estimating state-of-the-art summarization models directly. To this end, we propose
an acoustics-based summarization model that is estimated directly on acoustic
patterns. We empirically determine the extent to which this acoustics-based model
can effectively replace ASR-based models.
The extent to which written sources can benefit speech summarization has
also been limited, namely to noisy speech recognition transcripts. Predicting the
salience of utterances can indeed benefit from more sources than raw audio only.
Since speaking and writing are two basic ways of communication and are by nature
closely related to each other, in many situations, speech is accompanied with relevant
written text. Richer semantics conveyed in the relevant written text provides
additional information over speech by itself. This thesis utilizes such information
in content selection to help identify salient utterances in the corresponding speech
documents. We also employ such richer content to find the structure of spoken
documents—i.e., subtopic boundaries—which may in turn help summarization.
|
9 |
Multi-document Summarization System Using Rhetorical InformationAlliheedi, Mohammed 03 July 2012 (has links)
Over the past 20 years, research in automated text summarization has grown significantly in the field of natural language processing. The massive availability of scientific and technical information on the Internet, including journals, conferences, and news articles has attracted the interest of various groups of researchers working in text summarization. These researchers include linguistics, biologists, database researchers, and information retrieval experts. However, because the information available on the web is ever expanding, reading the sheer volume of information is a significant challenge. To deal with this volume of information, users need appropriate summaries to help them more efficiently manage their information needs. Although many automated text summarization systems have been proposed in the past twenty years, none of these systems have incorporated the use of rhetoric. To date, most automated text summarization systems have relied only on statistical approaches. These approaches do not take into account other features of language such as antimetabole and epanalepsis. Our hypothesis is that rhetoric can provide this type of additional information. This thesis addresses these issues by investigating the role of rhetorical figuration in detecting the salient information in texts. We show that automated multi-document summarization can be improved using metrics based on rhetorical figuration. A corpus of presidential speeches, which is for different U.S. presidents speeches, has been created. It includes campaign, state of union, and inaugural speeches to test our proposed multi-document summarization system. Various evaluation metrics have been used to test and compare the performance of the produced summaries of both our proposed system and other system. Our proposed multi-document summarization system using rhetorical figures improves the produced summaries, and achieves better performance over MEAD system in most of the cases especially in antimetabole, polyptoton, and isocolon. Overall, the results of our system are promising and leads to future progress on this research.
|
10 |
Discovering and summarizing email conversationsZhou, Xiaodong 05 1900 (has links)
With the ever increasing popularity of emails, it is very common nowadays that people discuss specific issues, events or tasks among a group of people by emails. Those discussions can be viewed as conversations via emails and are valuable for the user as a personal information repository. For instance, in 10 minutes before a meeting, a user may want to quickly go through a previous discussion via emails that is going to be discussed in the meeting soon. In this case, rather than reading each individual email one by one, it is preferable to read a concise summary of the previous discussion with major information summarized. In this thesis, we study the problem of discovering and summarizing email conversations. We believe that our work can greatly support users with their email folders. However, the characteristics of email conversations, e.g., lack of synchronization, conversational structure and informal writing style, make this task particularly challenging. In this thesis, we tackle this task by considering the following aspects: discovering emails in one conversation, capturing the conversation structure and summarizing the email conversation. We first study how to discover all emails belonging to one conversation. Specifically, we study the hidden email problem, which is important for email summarization and other applications but has not been studied before. We propose a framework to discover and regenerate hidden emails. The empirical evaluation shows that this framework is accurate and scalable to large folders. Second, we build a fragment quotation graph to capture email conversations. The hidden emails belonging to each conversation are also included into the corresponding graph. Based on the quotation graph, we develop a novel email conversation summarizer, ClueWordSummarizer. The comparison with a state-of-the-art email summarizer as well as with a popular multi-document summarizer shows that ClueWordSummarizer obtains a higher accuracy in most cases. Furthermore, to address the characteristics of email conversations, we study several ways to improve the ClueWordSummarizer by considering more lexical features. The experiments show that many of those improvements can significantly increase the accuracy especially the subjective words and phrases.
|
Page generated in 0.108 seconds