Global ETD Search

1	Summary-based document categorization with LSI Liu, Hsiao-Wen 14 February 2007 (has links) Text categorization to automatically assign documents into the appropriate pre-defined category or categories is essential to facilitating the retrieval of desired documents efficiently and effectively from a huge text depository, e.g., the world-wide web. Most techniques, however, suffer from the feature selection problem and the vocabulary mismatch problem. A few research works have addressed on text categorization via text summarization to reduce the size of documents, and consequently the number of features to consider, while some proposed using latent semantic indexing (LSI) to reveal the true meaning of a term via its association with other terms. Few works, however, have studied the joint effect of text summarization and the semantic dimension reduction technique in the literature. The objective of this research is thus to propose a practical approach, SBDR to deal with the above difficulties in text categorization tasks. Two experiments are conducted to validate our proposed approach. In the first experiment, the results show that text summarization does improve the performance in categorization. In addition, to construct important sentences, the association terms of both noun-noun and noun-verb pairs should be considered. Results of the second experiment indicate slight better performance with the approach of adopting LSI exclusively (i.e. no summarization) than that with SBDR (i.e. with summarization). Nonetheless, the minor accuracy reduction can be largely compensated for the computational time saved using LSI with text summarized. The feasibility of the SBDR approach is thus justified. Document Categorization Latent Semantic Indexing Text Summarization
2	Use of Text Summarization for Supporting Event Detection Wu, Pao-Feng 12 August 2003 (has links) Environmental scanning, which acquires and use the information about event, trends, and changes in an organization¡¦s external environment, is an important process in the strategic management of an organization and permits the organization to quickly adapt to the changes of its external environment. Event detection that detects the onset of new events from news documents is essential to facilitating an organization¡¦s environmental scanning activity. However, traditional feature-based event detection techniques detect events by comparing the similarity between features of news stories and incur several problems. For example, for illustration and comparison purpose, a news story may contain sentences or paragraphs that are not highly relevant to defining its event. Without removing such less relevant sentences or paragraphs before detection, the effectiveness of traditional event detection techniques may suffer. In this study, we developed a summary-based event detection (SED) technique that filters less relevant sentences or paragraphs in a news story before performing feature-based event detection. Using a traditional feature-based event detection technique (i.e., INCR) as benchmark, the empirical evaluation results showed that the proposed SED technique could achieve comparable or even better detection effectiveness (measured by miss and false alarm rates) than the INCR technique, for data corpora where the percentage of news stories discussing old events is high. text summarization event detection environmental scanning
3	A hybrid approach to automatic text summarization Yuan, Li-An 18 October 2007 (has links) Automatic text summarization can efficiently and effectively save users¡¦ time while reading text documents. The objective of automatic text summarization is to extract essential sentences that cover almost all the concepts of a document so that users are able to comprehend the ideas the document tries to address by simply reading through the corresponding summary. This research focuses on developing a hybrid automatic text summarization approach, KCS, to enhancing the quality of summaries. This approach basically consists of two major components: first, it employs the K-mixture probabilistic model to calculate term weights in a statistical sense; it then identifies the term relationship between nouns and nouns as well as nouns and verbs, which results in the connective strength (CS) of nouns. With the connective strengths available scores of sentences can be calculated and ranked to be extracted. We conduct three experiments to justify the proposed approach. The quality of summary is examined by its capability of increasing accuracy of text classification,while the classifier employed, the Naïve Bayes classifier, is kept the same through all experiments. The results show that the K-mixture model is more contributive to document classification than traditional TFIDF weighting scheme. It, however, is still no better than CS, a more complex linguistic-based approach. More importantly, our proposed approach, KCS, performs best among all approaches considered. It implies that KCS can extract more representative sentences from the document and its feasibility in text summarization applications is thus justified. automatic text summarization statistical approach linguistic approach
4	Investigating the Extractive Summarization of Literary Novels Ceylan, Hakan 12 1900 (has links) Abstract Due to the vast amount of information we are faced with, summarization has become a critical necessity of everyday human life. Given that a large fraction of the electronic documents available online and elsewhere consist of short texts such as Web pages, news articles, scientific reports, and others, the focus of natural language processing techniques to date has been on the automation of methods targeting short documents. We are witnessing however a change: an increasingly larger number of books become available in electronic format. This means that the need for language processing techniques able to handle very large documents such as books is becoming increasingly important. This thesis addresses the problem of summarization of novels, which are long and complex literary narratives. While there is a significant body of research that has been carried out on the task of automatic text summarization, most of this work has been concerned with the summarization of short documents, with a particular focus on news stories. However, novels are different in both length and genre, and consequently different summarization techniques are required. This thesis attempts to close this gap by analyzing a new domain for summarization, and by building unsupervised and supervised systems that effectively take into account the properties of long documents, and outperform the traditional extractive summarization systems typically addressing news genre. Text summarization extractive summarization summarization of literary novels
5	WHISK: Web Hosted Information Into Summarized Knowledge Wu, Jiewen 01 July 2016 (has links) (PDF) Today’s online content increases at an alarmingly rate which exceeds users’ ability to consume such content. Modern search techniques allow users to enter keyword queries to find content they wish to see. However, such techniques break down when users freely browse the internet without knowing exactly what they want. Users may have to invest an unnecessarily long time reading content to see if they are interested in it. Automatic text summarization helps relieve this problem by creating synopses that significantly reduce the text while preserving the key points. Steffen Lyngbaek created the SPORK summarization pipeline to solve the content overload in Reddit comment threads. Lyngbaek adapted the Opinosis graph model for extractive summarization and combined it with agglomerative hierarchical clustering and the Smith-Waterman algorithm to perform multi-document summarization on Reddit comments.This thesis presents WHISK as a pipeline for general multi-document text summarization based on SPORK. A generic data model in WHISK allows creating new drivers for different platforms to work with the pipeline. In addition to the existing Opinosis graph model adapted in SPORK, WHISK introduces two simplified graph models for the pipeline. The simplified models removes unnecessary restrictions inherited from Opinosis graph’s abstractive summarization origins. Performance measurements and a study with Digital Democracy compare the two new graph models against the Opinosis graph model. Additionally, the study evaluates WHISK’s ability to generate pull quotes from political discussions as summaries. text summarization summarization Other Computer Sciences
6	SPORK: A Summarization Pipeline for Online Repositories of Knowledge Lyngbaek, Steffen Slyngbae 01 June 2013 (has links) (PDF) The web 2.0 era has ushered an unprecedented amount of interactivity on the Internet resulting in a flood of user-generated content. This content is often unstructured and comes in the form of blog posts and comment discussions. Users can no longer keep up with the amount of content available, which causes developers to start relying on natural language techniques to help mitigate the problem. Although many natural language processing techniques have been employed for years, automatic text summarization, in particular, has recently gained traction. This research proposes a graph-based, extractive text summarization system called SPORK (Summarization Pipeline for Online Repositories of Knowledge). The goal of SPORK is to be able to identify important key topics presented in multi-document texts, such as online comment threads. While most other automatic summarization systems simply focus on finding the top sentences represented in the text, SPORK separates the text into clusters, and identifies different topics and opinions presented in the text. SPORK has shown results of managing to identify 72\% of key topics present in any discussion and up to 80\% of key topics in a well-structured discussion. text summarization spork automatic extractive abstractive
7	Evaluation of Automatic Text Summarization Using Synthetic Facts Ahn, Jaewook 01 June 2022 (has links) (PDF) Automatic text summarization has achieved remarkable success with the development of deep neural networks and the availability of standardized benchmark datasets. It can generate fluent, human-like summaries. However, the unreliability of the existing evaluation metrics hinders its practical usage and slows down its progress. To address this issue, we propose an automatic reference-less text summarization evaluation system with dynamically generated synthetic facts. We hypothesize that if a system guarantees a summary that has all the facts that are 100% known in the synthetic document, it can provide natural interpretability and high feasibility in measuring factual consistency and comprehensiveness. To our knowledge, our system is the first system that measures the overarching quality of the text summarization models with factual consistency, comprehensiveness, and compression rate. We validate our system by comparing its correlation with human judgment with existing N-gram overlap-based metrics such as ROUGE and BLEU and a BERT-based evaluation metric, BERTScore. Our system's experimental evaluation of PEGASUS, BART, and T5 outperforms the current evaluation metrics in measuring factual consistency with a noticeable margin and demonstrates its statistical significance in measuring comprehensiveness and overall summary quality. Natural Language Processing Text summarization Summarization Evaluation Synthetic Facts
8	Training Neural Models for Abstractive Text Summarization Kryściński, Wojciech January 2018 (has links) Abstractive text summarization aims to condense long textual documents into a short, human-readable form while preserving the most important information from the source document. A common approach to training summarization models is by using maximum likelihood estimation with the teacher forcing strategy. Despite its popularity, this method has been shown to yield models with suboptimal performance at inference time. This work examines how using alternative, task-specific training signals affects the performance of summarization models. Two novel training signals are proposed and evaluated as part of this work. One, a novelty metric, measuring the overlap between n-grams in the summary and the summarized article. The other, utilizing a discriminator model to distinguish human-written summaries from generated ones on a word-level basis. Empirical results show that using the mentioned metrics as rewards for policy gradient training yields significant performance gains measured by ROUGE scores, novelty scores and human evaluation. / Abstraktiv textsammanfattning syftar på att korta ner långa textdokument till en förkortad, mänskligt läsbar form, samtidigt som den viktigaste informationen i källdokumentet bevaras. Ett vanligt tillvägagångssätt för att träna sammanfattningsmodeller är att använda maximum likelihood-estimering med teacher-forcing-strategin. Trots dess popularitet har denna metod visat sig ge modeller med suboptimal prestanda vid inferens. I det här arbetet undersöks hur användningen av alternativa, uppgiftsspecifika träningssignaler påverkar sammanfattningsmodellens prestanda. Två nya träningssignaler föreslås och utvärderas som en del av detta arbete. Den första, vilket är en ny metrik, mäter överlappningen mellan n-gram i sammanfattningen och den sammanfattade artikeln. Den andra använder en diskrimineringsmodell för att skilja mänskliga skriftliga sammanfattningar från genererade på ordnivå. Empiriska resultat visar att användandet av de nämnda mätvärdena som belöningar för policygradient-träning ger betydande prestationsvinster mätt med ROUGE-score, novelty score och mänsklig utvärdering. machine learning deep learning text summarization natural language processing neural networks recurrent neural networks reinforcement learning generative adversarial networks gans abstractive text summarization nlp Computer Sciences Datavetenskap (datalogi)
9	Multi Domain Semantic Information Retrieval Based on Topic Model Lee, Sanghoon 07 May 2016 (has links) Over the last decades, there have been remarkable shifts in the area of Information Retrieval (IR) as huge amount of information is increasingly accumulated on the Web. The gigantic information explosion increases the need for discovering new tools that retrieve meaningful knowledge from various complex information sources. Thus, techniques primarily used to search and extract important information from numerous database sources have been a key challenge in current IR systems. Topic modeling is one of the most recent techniquesthat discover hidden thematic structures from large data collections without human supervision. Several topic models have been proposed in various fields of study and have been utilized extensively for many applications. Latent Dirichlet Allocation (LDA) is the most well-known topic model that generates topics from large corpus of resources, such as text, images, and audio.It has been widely used in many areas in information retrieval and data mining, providing efficient way of identifying latent topics among document collections. However, LDA has a drawback that topic cohesion within a concept is attenuated when estimating infrequently occurring words. Moreover, LDAseems not to consider the meaning of words, but rather to infer hidden topics based on a statisticalapproach. However, LDA can cause either reduction in the quality of topic words or increase in loose relations between topics. In order to solve the previous problems, we propose a domain specific topic model that combines domain concepts with LDA. Two domain specific algorithms are suggested for solving the difficulties associated with LDA. The main strength of our proposed model comes from the fact that it narrows semantic concepts from broad domain knowledge to a specific one which solves the unknown domain problem. Our proposed model is extensively tested on various applications, query expansion, classification, and summarization, to demonstrate the effectiveness of the model. Experimental results show that the proposed model significantly increasesthe performance of applications. Information retrieval Semantics Topic model Query expansion Text classification Text summarization
10	Automatic text summarization of Swedish news articles Lehto, Niko, Sjödin, Mikael January 2019 (has links) With an increasing amount of textual information available there is also an increased need to make this information more accessible. Our paper describes a modified TextRank model and investigates the different methods available to use automatic text summarization as a means for summary creation of swedish news articles. To evaluate our model we focused on intrinsic evaluation methods, in part through content evaluation in the form of of measuring referential clarity and non-redundancy, and in part by text quality evaluation measures, in the form of keyword retention and ROUGE evaluation. The results acquired indicate that stemming and improved stop word capabilities can have a positive effect on the ROUGE scores. The addition of redundancy checks also seems to have a positive effect on avoiding repetition of information. Keyword retention decreased somewhat, however. Lastly all methods had some trouble with dangling anaphora, showing a need for further work within anaphora resolution. Automatic text summarization Language technology Summary evaluation Natural language processing Engineering and Technology Teknik och teknologier

Search results