Spelling suggestions: "subject:"summarization"" "subject:"ummarization""
51 |
Query-Driven Analysis and Visualization for Large-Scale Scientific Dataset using Geometry Summarization and Bitmap IndexingWei, Tzu-Hsuan January 2017 (has links)
No description available.
|
52 |
Towards large-scale network analyticsYang, Xintian 27 August 2012 (has links)
No description available.
|
53 |
Summarizing Legal DepositionsChakravarty, Saurabh 18 January 2021 (has links)
Documents like legal depositions are used by lawyers and paralegals to ascertain the facts
pertaining to a case. These documents capture the conversation between a lawyer and a
deponent, which is in the form of questions and answers. Applying current automatic summarization
methods to these documents results in low-quality summaries. Though extensive
research has been performed in the area of summarization, not all methods succeed in all
domains. Accordingly, this research focuses on developing methods to generate high-quality
summaries of depositions. As part of our work related to legal deposition summarization, we
propose a solution in the form of a pipeline of components, each addressing a sub-problem;
we argue that a pipeline based framework can be tuned to summarize documents from any
domain.
First, we developed methods to parse the depositions, accounting for different document
formats. We were able to successfully parse both a proprietary and a public dataset with
our methods. We next developed methods to anonymize the personal information present in
the deposition documents; we achieve 95% accuracy on the anonymization using a random
sampling based evaluation. Third, we developed an ontology to define dialog acts for the
questions and answers present in legal depositions. Fourth, we developed classifiers based
on this ontology and achieved F1-scores of 0.84 and 0.87 on the public and proprietary
datasets, respectively. Fifth, we developed methods to transform a question-answer pair to
a canonical/simple form. In particular, based on the dialog acts for the question and answer
combination, we developed transformation methods using each of traditional NLP, and deep
learning, techniques. We were able to achieve good scores on the ROUGE and semantic similarity
metrics for most of the dialog act combinations. Sixth, we developed methods based
on deep learning, heuristics, and machine translation to correct the transformed declarative
sentences. The sentence correction improved the readability of the transformed sentences.
Seventh, we developed a methodology to break a deposition into its topical aspects. An
ontology for aspects was defined for legal depositions, and classifiers were developed that
achieved an F1-score of 0.89. Eighth, we developed methods to segment the deposition into
parts that have the same thematic context. The segments helped in augmenting candidate
summary sentences with surrounding context, that leads to a more readable summary.
Ninth, we developed a pipeline to integrate all of the methods, to generate summaries from
the depositions. We were able to outperform the baseline and state of the art summarization
methods in a majority of the cases based on the F1, Recall, and ROUGE-2 scores. The performance
gains were statistically significant for all of the scores. The summaries generated
by our system can be arranged based on the same thematic context or aspect and hence
should be much easier to read and follow, compared to the baseline methods. As part of our
future work, we will improve upon these methods. We will refine our methods to identify
the important parts using additional documents related to a deposition. In addition, we will
work to improve the compression ratio of the generated summaries by reducing the number
of unimportant sentences. We will expand the training dataset to learn and tune the coverage
of the aspects for various deponent types using empirical methods.
Our system has demonstrated effectiveness in transforming a QA pair into a declarative
sentence. Having such a capability could enable us to generate a narrative summary from
the depositions, a first for legal depositions. We will also expand our dataset for evaluation
to ensure that our methods are indeed generalizable, and that they work well when experts
subjectively evaluate the quality of the deposition summaries. / Doctor of Philosophy / Documents in the legal domain are of various types. One set of documents includes trial and
deposition transcripts. These documents capture the proceedings of a trial or a deposition
by note-taking, often over many hours. They contain conversation sentences that are spoken
during the trial or deposition and involve multiple actors. One of the greatest challenges
with these documents is that generally, they are long. This is a source of pain for attorneys
and paralegals who work with the information contained in the documents.
Text summarization techniques have been successfully used to compress a document and capture
the salient parts from it. They have also been able to reduce redundancy in summary
sentences while focusing on coherence and proper sentence formation. Summarizing trial and
deposition transcripts would be immensely useful for law professionals, reducing the time to
identify and disseminate salient information in case related documents, as well as reducing
costs and trial preparation time. Processing the deposition documents using traditional text
processing techniques is a challenge because of their form. Having the deposition conversations
transformed into a suitable declarative form where they can be easily comprehended
can pave the way for the usage of extractive and abstractive summarization methods. As
part of our work, we identified the different discourse structures present in the deposition
in the form of dialog acts. We developed methods based on those dialog acts to transform
the deposition into a declarative form. We were able to achieve an accuracy of 87% on the
dialog act classification. We also were able to transform the conversational question-answer
(QA) pairs into declarative forms for 10 of the top-11 dialog act combinations. Our transformation
methods performed better in 8 out of the 10 QA pair types, when compared to the
baselines. We also developed methods to classify the deposition QA pairs according to their
topical aspects. We generated summaries using aspects by defining the relative coverage for
each aspect that should be present in a summary. Another set of methods developed can
segment the depositions into parts that have the same thematic context. These segments
aid augmenting the candidate summary sentences, to create a summary where information
is surrounded by associated context. This makes the summary more readable and informative;
we were able to significantly outperform the state of the art methods, based on our
evaluations.
|
54 |
Arabic News Text Classification and Summarization: A Case of the Electronic Library Institute SeerQ (ELISQ)Kan'an, Tarek Ghaze 21 July 2015 (has links)
Arabic news articles in heterogeneous electronic collections are difficult for users to work with. Two problems are: that they are not categorized in a way that would aid browsing, and that there are no summaries or detailed metadata records that could be easier to work with than full articles. To address the first problem, schema mapping techniques were adapted to construct a simple taxonomy for Arabic news stories that is compatible with the subject codes of the International Press Telecommunications Council. So that each article would be labeled with the proper taxonomy category, automatic classification methods were researched, to identify the most appropriate. Experiments showed that the best features to use in classification resulted from a new tailored stemming approach (i.e., a new Arabic light stemmer called P-Stemmer). When coupled with binary classification using SVM, the newly developed approach proved to be superior to state-of-the-art techniques. To address the second problem, i.e., summarization, preliminary work was done with English corpora. This was in the context of a new Problem Based Learning (PBL) course wherein students produced template summaries of big text collections. The techniques used in the course were extended to work with Arabic news. Due to the lack of high quality tools for Named Entity Recognition (NER) and topic identification for Arabic, two new tools were constructed: RenA for Arabic NER, and ALDA for Arabic topic extraction tool (using the Latent Dirichlet Algorithm). Controlled experiments with each of RenA and ALDA, involving Arabic speakers and a randomly selected corpus of 1000 Qatari news articles, showed the tools produced very good results (i.e., names, organizations, locations, and topics). Then the categorization, NER, topic identification, and additional information extraction techniques were combined to produce approximately 120,000 summaries for Qatari news articles, which are searchable, along with the articles, using LucidWorks Fusion, which builds upon Solr software. Evaluation of the summaries showed high ratings based on the 1000-article test corpus. Contributions of this research with Arabic news articles thus include a new: test corpus, taxonomy, light stemmer, classification approach, NER tool, topic identification tool, and template-based summarizer – all shown through experimentation to be highly effective. / Ph. D.
|
55 |
Improving Access to ETD Elements Through Chapter Categorization and SummarizationBanerjee, Bipasha 07 August 2024 (has links)
The field of natural language processing and information retrieval has made remarkable progress since the 1980s. However, most of the theoretical investigation and applied experimentation is focused on short documents like web pages, journal articles, or papers in conference proceedings. Electronic Theses and Dissertations (ETDs) contain a wealth of information. These book-length documents describe research conducted in a variety of academic disciplines. While current digital library systems can be directly used to find a document of interest, they do not also facilitate discovering what specific parts or segments are of particular interest. This research aims to improve access to ETD components by providing users with chapter-level classification labels and summaries to help easily find portions of interest. We explore the challenges such documents pose, especially when dealing with a highly specialized academic vocabulary. We use large language models (LLMs) and fine-tune pre-trained models for these downstream tasks. We also develop a method to connect the ETD discipline and the department information to an ETD-centric classification system. To help guide the summarization model to create better chapter summaries, for each chapter, we try to identify relevant sentences of the document abstract, plus the titles of cited references from the bibliography. We leverage human feedback that helps us evaluate models qualitatively on top of using traditional metrics. We provide users with chapter classification labels and summaries to improve access to ETD chapters. We generate the top three classification labels for each chapter that reflect the interdisciplinarity of the work in ETDs. Our evaluation proves that our ensemble methods yield summaries that are preferred by users. Our summaries also perform better than summaries generated by using a single method when evaluated on several metrics using an LLM-based evaluation methodology. / Doctor of Philosophy / Natural language processing (NLP) is a field in computer science that focuses on creating artificially intelligent models capable of processing text and audio similarly to humans. We make use of various NLP techniques, ranging from machine learning and language models, to provide users with a much more granular level of information stored in Electronic Theses and Dissertations (ETDs). ETDs are documents submitted by students conducting research at the culmination of their degree. Such documents comprise research work in various academic disciplines and thus contain a wealth of information. This work aims to make such information stored in chapters of ETDs more accessible to readers through the addition of chapter-level classification labels and summaries. We provide users with chapter classification labels and summaries to improve access to ETD chapters. We generate the top three classification labels for each chapter that reflect the interdisciplinarity of the work in ETDs. Alongside human evaluation of automatically generated summaries, we use an LLM-based approach that aims to score summaries on several metrics. Our evaluation proves that our methods yield summaries that users prefer to summaries generated by using a single method.
|
56 |
The effect of noise in the training of convolutional neural networks for text summarisationMeechan-Maddon, Ailsa January 2019 (has links)
In this thesis, we work towards bridging the gap between two distinct areas: noisy text handling and text summarisation. The overall goal of the paper is to examine the effects of noise in the training of convolutional neural networks for text summarisation, with a view to understanding how to effectively create a noise-robust text-summarisation system. We look specifically at the problem of abstractive text summarisation of noisy data in the context of summarising error-containing documents from automatic speech recognition (ASR) output. We experiment with adding varying levels of noise (errors) to the 4 million-article Gigaword corpus and training an encoder-decoder CNN on it with the aim of producing a noise-robust text summarisation system. A total of six text summarisation models are trained, each with a different level of noise. We discover that the models with a high level of noise are indeed able to aptly summarise noisy data into clean summaries, despite a tendency for all models to overfit to the level of noise on which they were trained. Directions are given for future steps in order to create an even more noise-robust and flexible text summarisation system.
|
57 |
Proposition-based summarization with a coherence-driven incremental modelFang, Yimai January 2019 (has links)
Summarization models which operate on meaning representations of documents have been neglected in the past, although they are a very promising and interesting class of methods for summarization and text understanding. In this thesis, I present one such summarizer, which uses the proposition as its meaning representation. My summarizer is an implementation of Kintsch and van Dijk's model of comprehension, which uses a tree of propositions to represent the working memory. The input document is processed incrementally in iterations. In each iteration, new propositions are connected to the tree under the principle of local coherence, and then a forgetting mechanism is applied so that only a few important propositions are retained in the tree for the next iteration. A summary can be generated using the propositions which are frequently retained. Originally, this model was only played through by hand by its inventors using human-created propositions. In this work, I turned it into a fully automatic model using current NLP technologies. First, I create propositions by obtaining and then transforming a syntactic parse. Second, I have devised algorithms to numerically evaluate alternative ways of adding a new proposition, as well as to predict necessary changes in the tree. Third, I compared different methods of modelling local coherence, including coreference resolution, distributional similarity, and lexical chains. In the first group of experiments, my summarizer realizes summary propositions by sentence extraction. These experiments show that my summarizer outperforms several state-of-the-art summarizers. The second group of experiments concerns abstractive generation from propositions, which is a collaborative project. I have investigated the option of compressing extracted sentences, but generation from propositions has been shown to provide better information packaging.
|
58 |
Title-based video summarization using attention networksLi, Changwei 23 August 2022 (has links)
No description available.
|
59 |
Graph Models For Query Focused Text Summarization And Assessment Of Machine Translation Using StopwordsRama, B 06 1900 (has links) (PDF)
Text summarization is the task of generating a shortened version of the original text where core ideas of the original text are retained. In this work, we focus on query focused summarization. The task is to generate the summary from a set of documents which answers the query. Query focused summarization is a hard task because it expects the summary to be biased towards the query and at the same time important concepts in the original documents must be preserved with high degree of novelty.
Graph based ranking algorithms which use biased random surfer model like Topic-sensitive LexRank have been applied to query focused summarization. In our work, we propose look-ahead version of Topic-sensitive LexRank. We incorporate the option of look-ahead in the random walk model and we show that it helps in generating better quality summaries.
Next, we consider assessment of machine translation. Assessment of a machine translation output is important for establishing benchmarks for translation quality. An obvious way to assess the quality of machine translation is through the perception of human subjects. Though highly reliable, this approach is not scalable and is time consuming. Hence mechanisms have been devised to automate the assessment process. All such assessment methods are essentially a study of correlations between human translation and the machine translation.
In this work, we present a scalable approach to assess the quality of machine translation that borrows features from the study of writing styles, popularly known as Stylometry. Towards this, we quantify the characteristic styles of individual machine translators and compare them with that of human generated text. The translator whose style is closest to human style is deemed to generate a higher quality translation. We show that our approach is scalable and does not require actual source text translations for evaluation.
|
60 |
Methods for increasing cohesion in automatically extracted summaries of Swedish news articles : Using and extending multilingual sentence transformers in the data-processing stage of training BERT models for extractive text summarization / Metoder för att öka kohesionen i automatiskt extraherade sammanfattningar av svenska nyhetsartiklarAndersson, Elsa January 2022 (has links)
Developments in deep learning and machine learning overall has created a plethora of opportunities for easier training of automatic text summarization (ATS) models for producing summaries with higher quality. ATS can be split into extractive and abstractive tasks; extractive models extract sentences from the original text to create summaries. On the contrary, abstractive models generate novel sentences to create summaries. While extractive summaries are often preferred over abstractive ones, summaries created by extractive models trained on Swedish texts often lack cohesion, which affects the readability and overall quality of the summary. Therefore, there is a need to improve the process of training ATS models in terms of cohesion, while maintaining other text qualities such as content coverage. This thesis explores and implements methods at the data-processing stage aimed at improving cohesion of generated summaries. The methods are based around Sentence-BERT for creating advanced sentence embeddings that can be used to rank sentences in a text in terms of if it should be included in the extractive summary or not. Three models are trained using different methods and evaluated using ROUGE, BERTScore for measuring content coverage and Coh-Metrix for measuring cohesion. The results of the evaluation suggest that the methods can indeed be used to create more cohesive summaries, although content coverage was reduced, which gives rise to the potential for extensive future exploration of further implementation.
|
Page generated in 0.0659 seconds