Spelling suggestions: "subject:"[een] SUMMARIZATION"" "subject:"[enn] SUMMARIZATION""
161 |
Sumarizace dokumentů na webu / Summarization of Documents from the WebŠkurla, Ján January 2012 (has links)
Topic of this master's thesis is a summarization of the documents on the web. First, it deals with the issues of acquiring text from the web using wrapper. An overview of wrappers used as an inspiration for the future implementation is stated. This paper also includes various methods for creating summary (Luhn`s, Edmundson`s and KPC) from the text data. Application design for the text data extraction and summarization is also part of this paper. Application is based on Java platform and Swing graphic library.
|
162 |
Analyse und Vergleich von Extraktionsalgorithmen für die Automatische TextzusammenfassungKrübel, Monique 18 May 2006 (has links)
Obwohl schon seit den 50er Jahren auf dem Gebiet der Automatischen Textzusammenfassung Forschung betrieben wird, wurden der Nutzen und die Notwendigkeit dieser Systeme erst mit dem Boom des Internets richtig erkannt.
Das World Wide Web stellt eine täglich wachsende Menge an Informationen zu nahezu jedem Thema zur Verfügung.
Um den Zeitaufwand zum Finden und auch zum Wiederfinden der richtigen Informationen zu minimieren, traten Suchmaschinen ihren Siegeszug an.
Doch um einen Überblick zu einem ausgewählten Thema zu erhalten, ist eine einfache Auflistung aller in Frage kommenden Seiten nicht mehr adäquat.
Zusätzliche Mechanismen wie Extraktionsalgorithmen für die automatische Generierung von Zusammenfassungen können hier helfen, Suchmaschinen oder Webkataloge zu optimieren, um so den Zeitaufwand bei der Recherche zu verringern und die Suche einfacher und komfortabler zu gestalten.
In dieser Diplomarbeit wurde eine Analyse von Extraktionsalgorithmen durchgeführt, welche für die automatische Textzusammenfassung genutzt werden können. Auf Basis dieser Analyse als viel versprechend eingestufte Algorithmen wurden in Java implementiert und die mit diesen Algorithmen erstellten Zusammenfassungen in einer Evaluation verglichen.
|
163 |
Evaluation Techniques and Graph-Based Algorithms for Automatic Summarization and Keyphrase ExtractionHamid, Fahmida 08 1900 (has links)
Automatic text summarization and keyphrase extraction are two interesting areas of research which extend along natural language processing and information retrieval. They have recently become very popular because of their wide applicability. Devising generic techniques for these tasks is challenging due to several issues. Yet we have a good number of intelligent systems performing the tasks. As different systems are designed with different perspectives, evaluating their performances with a generic strategy is crucial. It has also become immensely important to evaluate the performances with minimal human effort.
In our work, we focus on designing a relativized scale for evaluating different algorithms. This is our major contribution which challenges the traditional approach of working with an absolute scale. We consider the impact of some of the environment variables (length of the document, references, and system-generated outputs) on the performance. Instead of defining some rigid lengths, we show how to adjust to their variations. We prove a mathematically sound baseline that should work for all kinds of documents. We emphasize automatically determining the syntactic well-formedness of the structures (sentences). We also propose defining an equivalence class for each unit (e.g. word) instead of the exact string matching strategy. We show an evaluation approach that considers the weighted relatedness of multiple references to adjust to the degree of disagreements between the gold standards. We publish the proposed approach as a free tool so that other systems can use it. We have also accumulated a dataset (scientific articles) with a reference summary and keyphrases for each document. Our approach is applicable not only for evaluating single-document based tasks but also for evaluating multiple-document based tasks.
We have tested our evaluation method for three intrinsic tasks (taken from DUC 2004 conference), and in all three cases, it correlates positively with ROUGE. Based on our experiments for DUC 2004 Question-Answering task, it correlates with the human decision (extrinsic task) with 36.008% of accuracy. In general, we can state that the proposed relativized scale performs as well as the popular technique (ROUGE) with flexibility for the length of the output.
As part of the evaluation we have also devised a new graph-based algorithm focusing on sentiment analysis. The proposed model can extract units (e.g. words or sentences) from the original text belonging either to the positive sentiment-pole or to the negative sentiment-pole. It embeds both (positive and negative) types of sentiment-flow into a single text-graph. The text-graph is composed with words or phrases as nodes, and their relations as edges. By recursively calling two mutually exclusive relations the model builds the final rank of the nodes. Based on the final rank, it splits two segments from the article: one with highly positive sentiment and the other with highly negative sentiments. The output of this model was tested with the non-polar TextRank generated output to quantify how much of the polar summaries actually covers the fact along with sentiment.
|
164 |
Building high-quality datasets for abstractive text summarization : A filtering‐based method applied on Swedish news articlesMonsen, Julius January 2021 (has links)
With an increasing amount of information on the internet, automatic text summarization could potentially make content more readily available for a larger variety of people. Training and evaluating text summarization models require datasets of sufficient size and quality. Today, most such datasets are in English, and for minor languages such as Swedish, it is not easy to obtain corresponding datasets with handwritten summaries. This thesis proposes methods for compiling high-quality datasets suitable for abstractive summarization from a large amount of noisy data through characterization and filtering. The data used consists of Swedish news articles and their preambles which are here used as summaries. Different filtering techniques are applied, yielding five different datasets. Furthermore, summarization models are implemented by warm-starting an encoder-decoder model with BERT checkpoints and fine-tuning it on the different datasets. The fine-tuned models are evaluated with ROUGE metrics and BERTScore. All models achieve significantly better results when evaluated on filtered test data than when evaluated on unfiltered test data. Moreover, models trained on the most filtered dataset with the smallest size achieves the best results on the filtered test data. The trade-off between dataset size and quality and other methodological implications of the data characterization, the filtering and the model implementation are discussed, leading to suggestions for future research.
|
165 |
Extractive Text Summarization of Norwegian News Articles Using BERTBiniam, Thomas Indrias, Morén, Adam January 2021 (has links)
Extractive text summarization has over the years been an important research area in Natural Language Processing. Numerous methods have been proposed for extracting information from text documents. Recent works have shown great success for English summarization tasks by fine-tuning the language model BERT using large summarization datasets. However, less research has been made for low-resource languages. This work contributes by investigating how BERT can be used for Norwegian text summarization. Two models are developed by applying a modified BERT architecture, called BERTSum, on pre-trained Norwegian and Multilingual BERT. The results are models able to predict key sentences from articles to generate bullet-point summaries. These models are evaluated with the automatic metric ROUGE and in this evaluation, the Multilingual BERT model outperforms the Norwegian model. The multilingual model is further evaluated in a human evaluation by journalists, revealing that the generated summaries are not entirely satisfactory in some aspects. With some improvements, the model shows to be a valuable tool for journalists to edit and rewrite generated summaries, saving time and workload. / <p>Examensarbetet är utfört vid Institutionen för teknik och naturvetenskap (ITN) vid Tekniska fakulteten, Linköpings universitet</p>
|
166 |
Extractive Text Summarization of Greek News Articles Based on Sentence-ClustersKantzola, Evangelia January 2020 (has links)
This thesis introduces an extractive summarization system for Greek news articles based on sentence clustering. The main purpose of the paper is to evaluate the impact of three different types of text representation, Word2Vec embeddings, TF-IDF and LASER embeddings, on the summarization task. By taking these techniques into account, we build three different versions of the initial summarizer. Moreover, we create a new corpus of gold standard summaries to evaluate them against the system summaries. The new collection of reference summaries is merged with a part of the MultiLing Pilot 2011 in order to constitute our main dataset. We perform both automatic and human evaluation. Our automatic ROUGE results suggest that System A which employs Average Word2Vec vectors to create sentence embeddings, outperforms the other two systems by yielding higher ROUGE-L F-scores. Contrary to our initial hypotheses, System C using LASER embeddings fails to surpass even the Word2Vec embeddings method, showing sometimes a weak sentence representation. With regard to the scores obtained by the manual evaluation task, we observe that System A using Average Word2Vec vectors and System C with LASER embeddings tend to produce more coherent and adequate summaries than System B employing TF-IDF. Furthermore, the majority of system summaries are rated very high with respect to non-redundancy. Overall, System A utilizing Average Word2Vec embeddings performs quite successfully according to both evaluations.
|
167 |
Compressive Cross-Language Text Summarization / Génération automatique de résumé par abstraction dans un contexte multiculturelLinhares Pontes, Elvys 30 November 2018 (has links)
La popularisation des réseaux sociaux et des documents numériques a rapidement accru l'information disponible sur Internet. Cependant, cette quantité massive de données ne peut pas être analysée manuellement. Parmi les applications existantes du Traitement Automatique du Langage Naturel (TALN), nous nous intéressons dans cette thèse au résumé cross-lingue de texte, autrement dit à la production de résumés dans une langue différente de celle des documents sources. Nous analysons également d'autres tâches du TALN (la représentation des mots, la similarité sémantique ou encore la compression de phrases et de groupes de phrases) pour générer des résumés cross-lingues plus stables et informatifs. La plupart des applications du TALN, celle du résumé automatique y compris, utilisent une mesure de similarité pour analyser et comparer le sens des mots, des séquences de mots, des phrases et des textes. L’une des façons d'analyser cette similarité est de générer une représentation de ces phrases tenant compte de leur contenu. Le sens des phrases est défini par plusieurs éléments, tels que le contexte des mots et des expressions, l'ordre des mots et les informations précédentes. Des mesures simples, comme la mesure cosinus et la distance euclidienne, fournissent une mesure de similarité entre deux phrases. Néanmoins, elles n'analysent pas l'ordre des mots ou les séquences de mots. En analysant ces problèmes, nous proposons un modèle de réseau de neurones combinant des réseaux de neurones récurrents et convolutifs pour estimer la similarité sémantique d'une paire de phrases (ou de textes) en fonction des contextes locaux et généraux des mots. Sur le jeu de données analysé, notre modèle a prédit de meilleurs scores de similarité que les systèmes de base en analysant mieux le sens local et général des mots mais aussi des expressions multimots. Afin d'éliminer les redondances et les informations non pertinentes de phrases similaires, nous proposons de plus une nouvelle méthode de compression multiphrase, fusionnant des phrases au contenu similaire en compressions courtes. Pour ce faire, nous modélisons des groupes de phrases semblables par des graphes de mots. Ensuite, nous appliquons un modèle de programmation linéaire en nombres entiers qui guide la compression de ces groupes à partir d'une liste de mots-clés ; nous cherchons ainsi un chemin dans le graphe de mots qui a une bonne cohésion et qui contient le maximum de mots-clés. Notre approche surpasse les systèmes de base en générant des compressions plus informatives et plus correctes pour les langues française, portugaise et espagnole. Enfin, nous combinons les méthodes précédentes pour construire un système de résumé de texte cross-lingue. Notre système génère des résumés cross-lingue de texte en analysant l'information à la fois dans les langues source et cible, afin d’identifier les phrases les plus pertinentes. Inspirés par les méthodes de résumé de texte par compression en analyse monolingue, nous adaptons notre méthode de compression multiphrase pour ce problème afin de ne conserver que l'information principale. Notre système s'avère être performant pour compresser l'information redondante et pour préserver l'information pertinente, en améliorant les scores d'informativité sans perdre la qualité grammaticale des résumés cross-lingues du français vers l'anglais. En analysant les résumés cross-lingues depuis l’anglais, le français, le portugais ou l’espagnol, vers l’anglais ou le français, notre système améliore les systèmes par extraction de l'état de l'art pour toutes ces langues. En outre, une expérience complémentaire menée sur des transcriptions automatiques de vidéo montre que notre approche permet là encore d'obtenir des scores ROUGE meilleurs et plus stables, même pour ces documents qui présentent des erreurs grammaticales et des informations inexactes ou manquantes. / The popularization of social networks and digital documents increased quickly the informationavailable on the Internet. However, this huge amount of data cannot be analyzedmanually. Natural Language Processing (NLP) analyzes the interactions betweencomputers and human languages in order to process and to analyze natural languagedata. NLP techniques incorporate a variety of methods, including linguistics, semanticsand statistics to extract entities, relationships and understand a document. Amongseveral NLP applications, we are interested, in this thesis, in the cross-language textsummarization which produces a summary in a language different from the languageof the source documents. We also analyzed other NLP tasks (word encoding representation,semantic similarity, sentence and multi-sentence compression) to generate morestable and informative cross-lingual summaries.Most of NLP applications (including all types of text summarization) use a kind ofsimilarity measure to analyze and to compare the meaning of words, chunks, sentencesand texts in their approaches. A way to analyze this similarity is to generate a representationfor these sentences that contains the meaning of them. The meaning of sentencesis defined by several elements, such as the context of words and expressions, the orderof words and the previous information. Simple metrics, such as cosine metric andEuclidean distance, provide a measure of similarity between two sentences; however,they do not analyze the order of words or multi-words. Analyzing these problems,we propose a neural network model that combines recurrent and convolutional neuralnetworks to estimate the semantic similarity of a pair of sentences (or texts) based onthe local and general contexts of words. Our model predicted better similarity scoresthan baselines by analyzing better the local and the general meanings of words andmulti-word expressions.In order to remove redundancies and non-relevant information of similar sentences,we propose a multi-sentence compression method that compresses similar sentencesby fusing them in correct and short compressions that contain the main information ofthese similar sentences. We model clusters of similar sentences as word graphs. Then,we apply an integer linear programming model that guides the compression of theseclusters based on a list of keywords. We look for a path in the word graph that has goodcohesion and contains the maximum of keywords. Our approach outperformed baselinesby generating more informative and correct compressions for French, Portugueseand Spanish languages. Finally, we combine these previous methods to build a cross-language text summarizationsystem. Our system is an {English, French, Portuguese, Spanish}-to-{English,French} cross-language text summarization framework that analyzes the informationin both languages to identify the most relevant sentences. Inspired by the compressivetext summarization methods in monolingual analysis, we adapt our multi-sentencecompression method for this problem to just keep the main information. Our systemproves to be a good alternative to compress redundant information and to preserve relevantinformation. Our system improves informativeness scores without losing grammaticalquality for French-to-English cross-lingual summaries. Analyzing {English,French, Portuguese, Spanish}-to-{English, French} cross-lingual summaries, our systemsignificantly outperforms extractive baselines in the state of the art for all these languages.In addition, we analyze the cross-language text summarization of transcriptdocuments. Our approach achieved better and more stable scores even for these documentsthat have grammatical errors and missing information.
|
168 |
Temporal profile summarization and indexing for surveillance videosBagheri, Saeid 12 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Surveillance videos are recorded continually and the retrieval of such videos currently still relies on human operators. Automatic retrieval has not reached a satisfactory accuracy. As an intermediate representation, this work develops multiple original temporal profiles of video to convey accurate temporal information in the video while keeping certain spatial characteristics. These are effective methods to visualizes surveillance video contents efficiently in a 2D temporal image, suitable for indexing and retrieving a large video database.
We are aiming to provide a compact index that is intuitive and preserves most of the information in the video in order to avoid browsing extensive video clips frame by frame.
By considering some of the properties of static surveillance videos, we aim at accentuating the temporal dimension in our visualization.
We have introduced our framework as three unique methods that visualize different aspects of a surveillance video, plus an extension to non-static surveillance videos.
In our first method "Localized Temporal Profile", by knowing that most surveillance videos are monitoring specific locations, we try to emphasize the other dimension, time, in our solution. we focus on describing all the events only in critical locations of the video.
In our next method "Multi-Position Temporal Profile", we generate an all-inclusive profile that covers all the events in the video field of view.
In our last method "Motion Temporal Profile" we perform in-depth analysis of scene motion and try to handle targets with non-uniform, non-translational motion in our temporal profile.
We then further extend our framework by loosening the constraint that the video is static and including cameras with smooth panning motion as such videos are widely used in practice. By performing motion analysis on the camera, we stabilize the camera to create a panorama-like effect for the video, allowing us to utilize all of the aforementioned methods.
The resulting profiles allows temporal indexing to each video frame, and contains all spatial information in a continuous manner. It also shows the actions and progress of events in the temporal profile. Flexible browsing and effective manipulation of videos can be achieved using the resulting video profiles.
|
169 |
Distribution-based Summarization for Large Scale Simulation Data Visualization and AnalysisWang, Ko-Chih 11 July 2019 (has links)
No description available.
|
170 |
Structural Self-Supervised Objectives for TransformersDi Liello, Luca 21 September 2023 (has links)
In this Thesis, we leverage unsupervised raw data to develop more efficient pre-training objectives and self-supervised tasks that align well with downstream applications.
In the first part, we present three alternative objectives to BERT’s Masked Language Modeling (MLM), namely Random Token Substitution (RTS), Cluster-based Random Token Substitution C-RTS, and Swapped Language Modeling (SLM). Unlike MLM, all of these proposals involve token swapping rather than replacing tokens with BERT’s [MASK]. RTS and C-RTS involve pre- dicting the originality of tokens, while SLM tasks the model at predicting the original token values. Each objective is applied to several models, which are trained using the same computational budget and corpora. Evaluation results reveal RTS and C-RTS require up to 45% less pre-training time while achieving performance on par with MLM. Notably, SLM outperforms MLM on several Answer Sentence Selection and GLUE tasks, despite utilizing the same computational budget for pre-training.
In the second part of the Thesis, we propose self-supervised pre-training tasks that exhibit structural alignment with downstream applications, leading to improved performance and reduced reliance on labeled data to achieve comparable results. We exploit the weak supervision provided by large corpora like Wikipedia and CC-News, challenging the model to recognize whether spans of text originate from the same paragraph or document. To this end, we design (i) a pre-training objective that targets multi-sentence inference models by performing predictions over multiple spans of texts simultaneously, (ii) self-supervised objectives tailored to enhance performance in Answer Sentence Selection and its Contextual version, and (iii) a pre-training objective aimed at performance improvements in Summarization.
Through continuous pre-training, starting from renowned checkpoints such as RoBERTa, ELEC- TRA, DeBERTa, BART, and T5, we demonstrate that our models achieve higher performance on Fact Verification, Answer Sentence Selection, and Summarization. We extensively evaluate our proposals on different benchmarks, revealing significant accuracy gains, particularly when annotation in the target dataset is limited. Notably, we achieve state-of-the-art results on the development set of the FEVER dataset and results close to state-of-the-art models using much more parameters on the test set. Furthermore, our objectives enable us to attain state-of-the-art results on ASNQ, WikiQA, and TREC-QA test sets, across all evaluation metrics (MAP, MRR, and P@1). For Summarization, our objective enhances summary quality, as measured by various metrics like ROUGE and BLEURT. We maintain that our proposals can be seamlessly combined with other techniques from recently proposed works, as they do not require alterations to the internal structure of Transformer models but only involve modifications to the training tasks.
|
Page generated in 0.2302 seconds