• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 128
  • 41
  • 13
  • 12
  • 6
  • 4
  • 3
  • 3
  • 3
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 243
  • 73
  • 69
  • 67
  • 64
  • 59
  • 51
  • 45
  • 38
  • 38
  • 35
  • 34
  • 32
  • 32
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
131

Addressing Challenges of Modern News Agencies via Predictive Modeling, Deep Learning, and Transfer Learning

Keneshloo, Yaser 22 July 2019 (has links)
Today's news agencies are moving from traditional journalism, where publishing just a few news articles per day was sufficient, to modern content generation mechanisms, which create more than thousands of news pieces every day. With the growth of these modern news agencies comes the arduous task of properly handling this massive amount of data that is generated for each news article. Therefore, news agencies are constantly seeking solutions to facilitate and automate some of the tasks that have been previously done by humans. In this dissertation, we focus on some of these problems and provide solutions for two broad problems which help a news agency to not only have a wider view of the behaviour of readers around the article but also to provide an automated tools to ease the job of editors in summarizing news articles. These two disjoint problems are aiming at improving the users' reading experience by helping the content generator to monitor and focus on poorly performing content while allow them to promote the good-performing ones. We first focus on the task of popularity prediction of news articles via a combination of regression, classification, and clustering models. We next focus on the problem of generating automated text summaries for a long news article using deep learning models. The first problem aims at helping the content developer in understanding of how a news article is performing over the long run while the second problem provides automated tools for the content developers to generate summaries for each news article. / Doctor of Philosophy / Nowadays, each person is exposed to an immense amount of information from social media, blog posts, and online news portals. Among these sources, news agencies are one of the main content providers for each person around the world. Contemporary news agencies are moving from traditional journalism to modern techniques from different angles. This is achieved either by building smart tools to track the behaviour of readers’ reaction around a specific news article or providing automated tools to facilitate the editor’s job in providing higher quality content to readers. These systems should not only be able to scale well with the growth of readers but also they have to be able to process ad-hoc requests, precisely since most of the policies and decisions in these agencies are taken around the result of these analytical tools. As part of this new movement towards adapting new technologies for smart journalism, we have worked on various problems with The Washington Post news agency on building tools for predicting the popularity of a news article and automated text summarization model. We develop a model that monitors each news article after its publication and provide prediction over the number of views that this article will receive within the next 24 hours. This model will help the content creator to not only promote potential viral article in the main page of the web portal or social media, but also provide intuition for editors on potential poorly performing articles so that they can edit the content of those articles for better exposure. On the other hand, current news agencies are generating more than a thousands news articles per day and generating three to four summary sentences for each of these news pieces not only become infeasible in the near future but also very expensive and time-consuming. Therefore, we also develop a separate model for automated text summarization which generates summary sentences for a news article. Our model will generate summaries by selecting the most salient sentence in the news article and paraphrase them to shorter sentences that could represent as a summary sentence for the entire document.
132

Event-related Collections Understanding and Services

Li, Liuqing 18 March 2020 (has links)
Event-related collections, including both tweets and webpages, have valuable information, and are worth exploring in interdisciplinary research and education. Unfortunately, such data is noisy, so this variety of information has not been adequately exploited. Further, for better understanding, more knowledge hidden behind events needs to be unearthed. Regarding these collections, different societies may have different requirements in particular scenarios. Some may need relatively clean datasets for data exploration and data mining. Social researchers require preprocessing of information, so they can conduct analyses. General societies are interested in the overall descriptions of events. However, few systems, tools, or methods exist to support the flexible use of event-related collections. In this research, we propose a new, integrated system to process and analyze event-related collections at different levels (i.e., data, information, and knowledge). It also provides various services and covers the most important stages in a system pipeline, including collection development, curation, analysis, integration, and visualization. Firstly, we propose a query likelihood model with pre-query design and post-query expansion to rank a webpage corpus by query generation probability, and retrieve relevant webpages from event-related tweet collections. We further preserve webpage data into WARC files and enrich original tweets with webpages in JSON format. As an application of data management, we conduct an empirical study of the embedded URLs in tweets based on collection development and data curation techniques. Secondly, we develop TwiRole, an integrated model for 3-way user classification on Twitter, which detects brand-related, female-related, and male-related tweeters through multiple features with both machine learning (i.e., random forest classifier) and deep learning (i.e., an 18-layer ResNet) techniques. As guidance to user-centered social research at the information level, we combine TwiRole with a pre-trained recurrent neural network-based emotion detection model, and carry out tweeting pattern analyses on disaster-related collections. Finally, we propose a tweet-guided multi-document summarization (TMDS) model, which generates summaries of the event-related collections by using tweets associated with those events. The TMDS model also considers three aspects of named entities (i.e., importance, relatedness, and diversity) as well as topics, to score sentences in webpages, and then rank selected relevant sentences in proper order for summarization. The entire system is realized using many technologies, such as collection development, natural language processing, machine learning, and deep learning. For each part, comprehensive evaluations are carried out, that confirm the effectiveness and accuracy of our proposed approaches. Regarding broader impact, the outcomes proposed in our study can be easily adopted or extended for further event analyses and service development. / Doctor of Philosophy / Event-related collections, including both tweets and webpages, have valuable information. They are worth exploring in interdisciplinary research and education. Unfortunately, such data is noisy. Many tweets and webpages are not relevant to the events. This leads to difficulties during data analysis of the datasets, as well as explanation of the results. Further, for better understanding, more knowledge hidden behind events needs to be unearthed. Regarding these collections, different groups of people may have different requirements. Some may need relatively clean datasets for data exploration. Some require preprocessing of information, so they can conduct analyses, e.g., based on tweeter type or content topic. General societies are interested in the overall descriptions of events. However, few systems, tools, or methods exist to support the flexible use of event-related collections. Accordingly, we describe our new framework and integrated system to process and analyze event-related collections. It provides varied services and covers the most important stages in a system pipeline. It has sub-systems to clean, manage, analyze, integrate, and visualize event-related collections. It takes an event-related tweet collection as input and generates an event-related webpage corpus by leveraging Wikipedia and the URLs embedded in tweets. It also combines and enriches original tweets with webpages. As an application of data management, we conduct an empirical study of tweets and their embedded URLs. We developed TwiRole for 3-way user classification on Twitter. It detects brand-related, female-related, and male-related tweeters through their profiles, tweets, and images. To aid user-centered social research, we combine TwiRole with an existing emotion detection tool, and carry out tweeting pattern analyses on disaster-related collections. Finally, we propose a tweet-guided multi-document summarization (TMDS) model and service, which generates summaries of the event-related collections by using tweets associated with those events. It extracts important sentences across different topics from webpages, and organizes them in proper order. The entire system is realized using many technologies, such as collection development, natural language processing, machine learning, and deep learning. For each part, comprehensive evaluations help confirm the effectiveness and accuracy of our proposed approaches. Regarding broader impact, our methods and system can be easily adopted or extended for further event analyses and service development.
133

Product Defect Discovery and Summarization from Online User Reviews

Zhang, Xuan 29 October 2018 (has links)
Product defects concern various groups of people, such as customers, manufacturers, government officials, etc. Thus, defect-related knowledge and information are essential. In keeping with the growth of social media, online forums, and Internet commerce, people post a vast amount of feedback on products, which forms a good source for the automatic acquisition of knowledge about defects. However, considering the vast volume of online reviews, how to automatically identify critical product defects and summarize the related information from the huge number of user reviews is challenging, even when we target only the negative reviews. As a kind of opinion mining research, existing defect discovery methods mainly focus on how to classify the type of product issues, which is not enough for users. People expect to see defect information in multiple facets, such as product model, component, and symptom, which are necessary to understand the defects and quantify their influence. In addition, people are eager to seek problem resolutions once they spot defects. These challenges cannot be solved by existing aspect-oriented opinion mining models, which seldom consider the defect entities mentioned above. Furthermore, users also want to better capture the semantics of review text, and to summarize product defects more accurately in the form of natural language sentences. However, existing text summarization models including neural networks can hardly generalize to user review summarization due to the lack of labeled data. In this research, we explore topic models and neural network models for product defect discovery and summarization from user reviews. Firstly, a generative Probabilistic Defect Model (PDM) is proposed, which models the generation process of user reviews from key defect entities including product Model, Component, Symptom, and Incident Date. Using the joint topics in these aspects, which are produced by PDM, people can discover defects which are represented by those entities. Secondly, we devise a Product Defect Latent Dirichlet Allocation (PDLDA) model, which describes how negative reviews are generated from defect elements like Component, Symptom, and Resolution. The interdependency between these entities is modeled by PDLDA as well. PDLDA answers not only what the defects look like, but also how to address them using the crowd wisdom hidden in user reviews. Finally, the problem of how to summarize user reviews more accurately, and better capture the semantics in them, is studied using deep neural networks, especially Hierarchical Encoder-Decoder Models. For each of the research topics, comprehensive evaluations are conducted to justify the effectiveness and accuracy of the proposed models, on heterogeneous datasets. Further, on the theoretical side, this research contributes to the research stream on product defect discovery, opinion mining, probabilistic graphical models, and deep neural network models. Regarding impact, these techniques will benefit related users such as customers, manufacturers, and government officials. / Ph. D. / Product defects concern various groups of people, such as customers, manufacturers, and government officials. Thus, defect-related knowledge and information are essential. In keeping with the growth of social media, online forums, and Internet commerce, people post a vast amount of feedback on products, which forms a good source for the automatic acquisition of knowledge about defects. However, considering the vast volume of online reviews, how to automatically identify critical product defects and summarize the related information from the huge number of user reviews is challenging, even when we target only the negative reviews. People expect to see defect information in multiple facets, such as product model, component, and symptom, which are necessary to understand the defects and quantify their influence. In addition, people are eager to seek problem resolutions once they spot defects. Furthermore, users also want to better summarize product defects more accurately in the form of natural language sentences. These requirements cannot be satisfied by existing methods, which seldom consider the defect entities mentioned above, or hardly generalize to user review summarization. In this research, we develop novel Machine Learning (ML) algorithms for product defect discovery and summarization. Firstly, we study how to identify product defects and their related attributes, such as Product Model, Component, Symptom, and Incident Date. Secondly, we devise a novel algorithm, which can discover product defects and the related Component, Symptom, and Resolution, from online user reviews. This method tells not only what the defects look like, but also how to address them using the crowd wisdom hidden in user reviews. Finally, we address the problem of how to summarize user reviews in the form of natural language sentences using a paraphrase-style method. On the theoretical side, this research contributes to multiple research areas in Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning. Regarding impact, these techniques will benefit related users such as customers, manufacturers, and government officials.
134

Mobile-cloud assisted video summarization framework for efficient management of remote sensing data generated by wireless capsule sensors

Mehmood, Irfan, Sajjad, M., Baik, S.W. 18 July 2019 (has links)
Yes / Wireless capsule endoscopy (WCE) has great advantages over traditional endoscopy because it is portable and easy to use, especially in remote monitoring health-services. However, during the WCE process, the large amount of captured video data demands a significant deal of computation to analyze and retrieve informative video frames. In order to facilitate efficient WCE data collection and browsing task, we present a resource- and bandwidth-aware WCE video summarization framework that extracts the representative keyframes of the WCE video contents by removing redundant and non-informative frames. For redundancy elimination, we use Jeffrey-divergence between color histograms and inter-frame Boolean series-based correlation of color channels. To remove non-informative frames, multi-fractal texture features are extracted to assist the classification using an ensemble-based classifier. Owing to the limited WCE resources, it is impossible for the WCE system to perform computationally intensive video summarization tasks. To resolve computational challenges, mobile-cloud architecture is incorporated, which provides resizable computing capacities by adaptively offloading video summarization tasks between the client and the cloud server. The qualitative and quantitative results are encouraging and show that the proposed framework saves information transmission cost and bandwidth, as well as the valuable time of data analysts in browsing remote sensing data. / Supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2013R1A1A2012904).
135

Génération de résumés par abstraction

Genest, Pierre-Étienne 05 1900 (has links)
Cette thèse présente le résultat de plusieurs années de recherche dans le domaine de la génération automatique de résumés. Trois contributions majeures, présentées sous la forme d'articles publiés ou soumis pour publication, en forment le coeur. Elles retracent un cheminement qui part des méthodes par extraction en résumé jusqu'aux méthodes par abstraction. L'expérience HexTac, sujet du premier article, a d'abord été menée pour évaluer le niveau de performance des êtres humains dans la rédaction de résumés par extraction de phrases. Les résultats montrent un écart important entre la performance humaine sous la contrainte d'extraire des phrases du texte source par rapport à la rédaction de résumés sans contrainte. Cette limite à la rédaction de résumés par extraction de phrases, observée empiriquement, démontre l'intérêt de développer d'autres approches automatiques pour le résumé. Nous avons ensuite développé un premier système selon l'approche Fully Abstractive Summarization, qui se situe dans la catégorie des approches semi-extractives, comme la compression de phrases et la fusion de phrases. Le développement et l'évaluation du système, décrits dans le second article, ont permis de constater le grand défi de générer un résumé facile à lire sans faire de l'extraction de phrases. Dans cette approche, le niveau de compréhension du contenu du texte source demeure insuffisant pour guider le processus de sélection du contenu pour le résumé, comme dans les approches par extraction de phrases. Enfin, l'approche par abstraction basée sur des connaissances nommée K-BABS est proposée dans un troisième article. Un repérage des éléments d'information pertinents est effectué, menant directement à la génération de phrases pour le résumé. Cette approche a été implémentée dans le système ABSUM, qui produit des résumés très courts mais riches en contenu. Ils ont été évalués selon les standards d'aujourd'hui et cette évaluation montre que des résumés hybrides formés à la fois de la sortie d'ABSUM et de phrases extraites ont un contenu informatif significativement plus élevé qu'un système provenant de l'état de l'art en extraction de phrases. / This Ph.D. thesis is the result of several years of research on automatic text summarization. Three major contributions are presented in the form of published and submitted papers. They follow a path that moves away from extractive summarization and toward abstractive summarization. The first article describes the HexTac experiment, which was conducted to evaluate the performance of humans summarizing text by extracting sentences. Results show a wide gap of performance between human summaries written by sentence extraction and those written without restriction. This empirical performance ceiling to sentence extraction demonstrates the need for new approaches to text summarization. We then developed and implemented a system, which is the subject of the second article, using the Fully Abstractive Summarization approach. Though the name suggests otherwise, this approach is better categorized as semi-extractive, along with sentence compression and sentence fusion. Building and evaluating this system brought to light the great challenge associated with generating easily readable summaries without extracting sentences. In this approach, text understanding is not deep enough to provide help in the content selection process, as is the case in extractive summarization. As the third contribution, a knowledge-based approach to abstractive summarization called K-BABS was proposed. Relevant content is identified by pattern matching on an analysis of the source text, and rules are applied to directly generate sentences for the summary. This approach is implemented in a system called ABSUM, which generates very short and content-rich summaries. An evaluation was performed according to today's standards. The evaluation shows that hybrid summaries generated by adding extracted sentences to ABSUM's output have significantly more content than a state-of-the-art extractive summarizer.
136

Information Theoretic Approach To Extractive Text Summarization

Ravindra, G 08 1900 (has links)
Automatic text summarization techniques, which can reduce a source text to a summary text by content generalization or selection have assumed signifi- cance in recent times due to the ever expanding information explosion created by the World Wide Web. Summaries generated by generalization of information are called abstracts and those generated by selection of portions of text (sentences, phrases etc.) are called extracts. Further, summaries could for each document separately or multiple documents could be summarized together to produce a single summary. The challenges in making machines generate extracts or abstracts are primarily due to the lack of understanding of human cognitive processes. Summary generated by humans seems to be influenced by their moral, emotional and ethical stance on the subject and their background knowledge of the content being summarized.These characteristics are hardly understood and difficult to model mathematically. Further automatic summarization is very much handicapped by limitations of existing computing resources and lack of good mathematical models of cognition. In view of these, the role of rigorous mathematical theory in summarization has been limited hitherto. The research reported in this thesis is a contribution towards bringing in the awesome power of well-established concepts information theory to the field of summarization. Contributions of the Thesis The specific focus of this thesis is on extractive summarization. Its domain spans multi-document summarization as well as single document summarization. In the whole thesis the word "summarization" and "summary", imply extract generation and sentence extracts respectively. In this thesis, two new and novel summarizers referred to as ESCI (Extractive Summarization using Collocation Information) and De-ESCI (Dictionary enhanced ESCI) have been proposed. In addition, an automatic summary evaluation technique called DeFuSE (Dictionary enhanced Fuzzy Summary Evaluator) has also been introduced.The mathematical basis for the evolution of the scoring scheme proposed in this thesis and its relationship with other well-known summarization algorithms such as latent Semantic Indexing (LSI) is also derived. The work detailed in this thesis is specific to the domain of extractive summarization of unstructured text without taking into account the data set characteristics such as the positional importance of sentences. This is to ensure that the summarizer works well for a broad class of documents, and to keep the proposed models as generic as possible. Central to the proposed work is the concept of "Collocation Information of a word", its quantification and application to summarization. "Collocation Information" (CI) is the amount of information (Shannon’s measure) that a word and its collocations together contribute to the total information in the document(s) being summarized.The CI of a word has been computed using Shannon’s measure for information using a joint probability distribution. Further, a base value of CI called "Discrimination Threshold" (DT) has also been derived. To determine DT, sentences from a large collection of documents covering various topics including the topic covered by the document(s) being summarized were broken down into sequences of word collocations.The number of possible neighbors for a word within a specified collocation window was determined. This number has been called the "cardinality of the collocating set" and is represented as |ℵ (w)|. It is proved that if |ℵ (w)| determined from this large document collection for any word w is fixed, then the maximum value of the CI for a word w is proportional to |ℵ (w)|. This constrained maximum is the "Discrimination Threshold" and is used as the base value of CI. Experimental evidence detailed in this thesis shows that sentences containing words with CI greater than DT are most likely to be useful in an extract. Words in every sentence of the document(s) being summarized have been assigned scores based on the difference between their current value of CI and their respective DT. Individual word scores have been summed to derive a score for every sentence. Sentences are ranked according to their scores and the first few sentences in the rank order have been selected as the extract summary. Redundant and semantically similar sentences have been excluded from the selection process using a simple similarity detection algorithm. This novel method for extraction has been called ESCI in this thesis. In the second part of the thesis, the advantages of tagging words as nouns, verbs, adjectives and adverbs without the use of sense disambiguation has been explored. A hierarchical model for abstraction of knowledge has been proposed, and those cases where such a model can improve summarization accuracy have been explained. Knowledge abstraction has been achieved by converting collocations into their hypernymous versions. In the second part of the thesis, the advantages of tagging words as nouns, verbs, adjectives and adverbs without the use of sense disambiguation has been explored. A hierarchical model for abstraction of knowledge has been proposed, and those cases where such a model can improve summarization accuracy have been explained. Knowledge abstraction has been achieved by converting collocations into their hypernymous versions. The number of levels of abstraction varies based on the sense tag given to each word in the collocation being abstracted. Once abstractions have been determined, Expectation- Maximization algorithm is used to determine the probability value of each collocation at every level of abstraction. A combination of abstracted collocations from various levels is then chosen and sentences are assigned scores based on collocation information of these abstractions.This summarization scheme has been referred to as De-ESCI (Dictionary enhanced ESCI). It had been observed in many human summary data sets that the factual attribute of the human determines the choice of noun and verb pairs. Similarly, the emotional attribute of the human determines the choice of the number of noun and adjective pairs. In order to bring these attributes into the machine generated summaries, two variants of DeESCI have been proposed. The summarizer with the factual attribute has been called as De-ESCI-F, while the summarizer with the emotional attribute has been called De-ESCI-E in this thesis. Both create summaries having two parts. First part of the summary created by De-ESCI-F is obtained by scoring and selecting only those sentences where a fixed number of nouns and verbs occur.The second part of De-ESCI-F is obtained by ranking and selecting those sentences which do not qualify for the selection process in the first part. Assigning sentence scores and selecting sentences for the second part of the summary is exactly like in ESCI. Similarly, the first part of De-ESCI-E is generated by scoring and selecting only those sentences where fixed number of nouns and adjectives occur. The second part of the summary produced by De-ESCI-E is exactly like the second part in De-ESCI-F. As the model summary generated by human summarizers may or may not contain sentences with preference given to qualifiers (adjectives), the automatic summarizer does not know apriori whether to choose sentences with qualifiers over those without qualifiers. As there are two versions of the summary produced by De-ESCI-F and De-ESCIE, one of them should be closer to the human summarizer’s point of view (in terms of giving importance to qualifiers). This technique of choosing the best candidate summary has been referred to as De-ESCI-F/E. Performance Metrics The focus of this thesis is to propose new models and sentence ranking techniques aimed at improving the accuracy of the extract in terms of sentences selected, rather than on the readability of the summary. As a result, the order of sentences in the summary is not given importance during evaluation. Automatic evaluation metrics have been used and the performance of the automatic summarizer has been evaluated in terms of precision, recall and f-scores obtained by comparing its output with model human generated extract summaries. A novel summary evaluator called DeFuSE has been proposed in this thesis, and its scores are used along with the scores given by a standard evaluator called ROUGE. DeFuSE evaluates an extract in terms of precision, recall and f-score relying on The use of WordNet hypernymy structure to identify semantically similar sentences in a document. It also uses fuzzy set theory to compute the extent to which a sentence from the machine generated extract belongs to the model summary. Performance of candidate summarizers has been discussed in terms of percentage improvement in fscore relative to the baselines. Average of ROUGE and DeFuSE f-score for every summary is computed, and the mean value of these scores is used to compare performance improvement. Performance For illustrative purposes, DUC 2002 and DUC 2003 multi-document data sets have been used. From these data sets only the 400 word summaries of DUC 2002 and track-4 (novelty track) summaries of DUC 2003 are useful for evaluation of sentence extracts and hence only these have been used. f-score has been chosen as a measure of performance. Standard baselines such as coverage, size and lead and also probabilistic baselines have been used to measure percentage improvement in f-score of candidate summarizers relative to these baselines. Further, summaries generated by MEAD using centroid and length as features for ranking (MEAD-CL), MEAD using positional, centroid and length as features for ranking (MEAD-CLP), Microsoft Word automatic summarizer (MS-Word) and Latent Semantic Indexing (LSI) based summarizer were used to compare the performance of the proposed summarization schemes.
137

Génération de résumés par abstraction

Genest, Pierre-Étienne 05 1900 (has links)
Cette thèse présente le résultat de plusieurs années de recherche dans le domaine de la génération automatique de résumés. Trois contributions majeures, présentées sous la forme d'articles publiés ou soumis pour publication, en forment le coeur. Elles retracent un cheminement qui part des méthodes par extraction en résumé jusqu'aux méthodes par abstraction. L'expérience HexTac, sujet du premier article, a d'abord été menée pour évaluer le niveau de performance des êtres humains dans la rédaction de résumés par extraction de phrases. Les résultats montrent un écart important entre la performance humaine sous la contrainte d'extraire des phrases du texte source par rapport à la rédaction de résumés sans contrainte. Cette limite à la rédaction de résumés par extraction de phrases, observée empiriquement, démontre l'intérêt de développer d'autres approches automatiques pour le résumé. Nous avons ensuite développé un premier système selon l'approche Fully Abstractive Summarization, qui se situe dans la catégorie des approches semi-extractives, comme la compression de phrases et la fusion de phrases. Le développement et l'évaluation du système, décrits dans le second article, ont permis de constater le grand défi de générer un résumé facile à lire sans faire de l'extraction de phrases. Dans cette approche, le niveau de compréhension du contenu du texte source demeure insuffisant pour guider le processus de sélection du contenu pour le résumé, comme dans les approches par extraction de phrases. Enfin, l'approche par abstraction basée sur des connaissances nommée K-BABS est proposée dans un troisième article. Un repérage des éléments d'information pertinents est effectué, menant directement à la génération de phrases pour le résumé. Cette approche a été implémentée dans le système ABSUM, qui produit des résumés très courts mais riches en contenu. Ils ont été évalués selon les standards d'aujourd'hui et cette évaluation montre que des résumés hybrides formés à la fois de la sortie d'ABSUM et de phrases extraites ont un contenu informatif significativement plus élevé qu'un système provenant de l'état de l'art en extraction de phrases. / This Ph.D. thesis is the result of several years of research on automatic text summarization. Three major contributions are presented in the form of published and submitted papers. They follow a path that moves away from extractive summarization and toward abstractive summarization. The first article describes the HexTac experiment, which was conducted to evaluate the performance of humans summarizing text by extracting sentences. Results show a wide gap of performance between human summaries written by sentence extraction and those written without restriction. This empirical performance ceiling to sentence extraction demonstrates the need for new approaches to text summarization. We then developed and implemented a system, which is the subject of the second article, using the Fully Abstractive Summarization approach. Though the name suggests otherwise, this approach is better categorized as semi-extractive, along with sentence compression and sentence fusion. Building and evaluating this system brought to light the great challenge associated with generating easily readable summaries without extracting sentences. In this approach, text understanding is not deep enough to provide help in the content selection process, as is the case in extractive summarization. As the third contribution, a knowledge-based approach to abstractive summarization called K-BABS was proposed. Relevant content is identified by pattern matching on an analysis of the source text, and rules are applied to directly generate sentences for the summary. This approach is implemented in a system called ABSUM, which generates very short and content-rich summaries. An evaluation was performed according to today's standards. The evaluation shows that hybrid summaries generated by adding extracted sentences to ABSUM's output have significantly more content than a state-of-the-art extractive summarizer.
138

Investigação de estratégias de seleção de conteúdo baseadas na UNL (Universal Networking Language)

Chaud, Matheus Rigobelo 03 March 2015 (has links)
Made available in DSpace on 2016-06-02T20:25:24Z (GMT). No. of bitstreams: 1 6636.pdf: 3131517 bytes, checksum: 2afb763348af4eeb377c36a05732707f (MD5) Previous issue date: 2015-03-03 / Financiadora de Estudos e Projetos / The field of Natural Language Processing (NLP) has witnessed increased attention to Multilingual Multidocument Summarization (MMS), whose goal is to process a cluster of source documents in more than one language and generate a summary of this collection in one of the target languages. In MMS, the selection of sentences from source texts for summary generation may be based on either shallow or deep linguistic features. The purpose of this research was to investigate whether the use of deep knowledge, obtained from a conceptual representation of the source texts, could be useful for content selection in texts within the newspaper genre. In this study, we used a formal representation system the UNL (Universal Networking Language). In order to investigate content selection strategies based on this interlingua, 3 clusters of texts were represented in UNL, each consisting of 1 text in Portuguese, 1 text in English and 1 human-written reference summary. Additionally, in each cluster, the sentences of the source texts were aligned to the sentences of their respective human summaries, in order to identify total or partial content overlap between these sentences. The data collected allowed a comparison between content selection strategies based on conceptual information and a traditional selection method based on a superficial feature - the position of the sentence in the source text. According to the results, content selection based on sentence position was more closely correlated with the selection made by the human summarizer, compared to the conceptual methods investigated. Furthermore, the sentences in the beginning of the source texts, which, in newspaper articles, usually convey the most relevant information, did not necessarily contain the most frequent concepts in the text collection; on several occasions, the sentences with the most frequent concepts were in the middle or at the end of the text. These results indicate that, at least in the clusters analyzed, other criteria besides concept frequency help determine the relevance of a sentence. In other words, content selection in human multidocument summarization may not be limited to the selection of the sentences with the most frequent concepts. In fact, it seems to be a much more complex process. / Na área de Processamento Automático das Línguas Naturais (PLN), há um destaque crescente para a Sumarização Automática Multidocumento Multilíngue (SAMM), cujo objetivo é processar uma coleção de documentos-fonte em mais de uma língua e gerar um sumário correspondente a essa coleção em uma das línguas-alvo. Na SAMM, a seleção das sentenças dos textos-fonte para composição do sumário pode ser feita com base em atributos linguísticos superficiais ou profundos. O objetivo deste projeto foi investigar se a utilização de conhecimento profundo, obtido a partir de uma representação conceitual dos textos-fonte, pode ser útil na seleção de conteúdo em textos do gênero jornalístico. Para isso, utilizou-se um sistema de representação formal a UNL (Universal Networking Language). Visando investigar estratégias de seleção de conteúdo baseadas nessa interlíngua, fez-se a representação em UNL de 3 coleções de textos, cada qual com 1 texto-fonte em português, 1 texto-fonte em inglês e 1 sumário humano de referência. Fez-se também o alinhamento das sentenças dos textos-fonte de cada coleção às sentenças de seus respectivos sumários humanos, objetivando identificar sobreposição total ou parcial de conteúdo entre essas sentenças. Esses dados permitiram a comparação entre estratégias de seleção de conteúdo baseadas em informações conceituais e um método de seleção tradicional baseado em um atributo superficial a posição da sentença no texto-fonte. De acordo com os resultados obtidos, a seleção de conteúdo com base na posição no texto-fonte correlacionou-se mais adequadamente com a seleção realizada pelo sumarizador humano, comparado aos métodos conceituais investigados. Além disso, as sentenças iniciais dos textos-fonte, que, em textos jornalísticos, normalmente veiculam as informações mais relevantes, não necessariamente continham os conceitos mais frequentes da coleção; em diversas ocasiões, as sentenças com os conceitos mais frequentes estavam em posição intermediária ou final no texto. Esses resultados indicam que, ao menos nas coleções analisadas, outros critérios, além da frequência de conceitos, concorrem para determinar a relevância de uma sentença. Em outras palavras, na sumarização humana multidocumento, a seleção de conteúdo provavelmente não se resume a selecionar sentenças com os conceitos mais frequentes, tratando-se de um processo bem mais complexo.
139

Système symbolique de création de résumés de mise à jour

Genest, Pierre-Étienne January 2009 (has links)
Mémoire numérisé par la Division de la gestion de documents et des archives de l'Université de Montréal.
140

Extractive document summarization using complex networks / Sumarização extractiva de documentos usando redes complexas

Tohalino, Jorge Andoni Valverde 15 June 2018 (has links)
Due to a large amount of textual information available on the Internet, the task of automatic document summarization has gained significant importance. Document summarization became important because its focus is the development of techniques aimed at finding relevant and concise content in large volumes of information without changing its original meaning. The purpose of this Masters work is to use network theory concepts for extractive document summarization for both Single Document Summarization (SDS) and Multi-Document Summarization (MDS). In this work, the documents are modeled as networks, where sentences are represented as nodes with the aim of extracting the most relevant sentences through the use of ranking algorithms. The edges between nodes are established in different ways. The first approach for edge calculation is based on the number of common nouns between two sentences (network nodes). Another approach to creating an edge is through the similarity between two sentences. In order to calculate the similarity of such sentences, we used the vector space model based on Tf-Idf weighting and word embeddings for the vector representation of the sentences. Also, we make a distinction between edges linking sentences from different documents (inter-layer) and those connecting sentences from the same document (intra-layer) by using multilayer network models for the Multi-Document Summarization task. In this approach, each network layer represents a document of the document set that will be summarized. In addition to the measurements typically used in complex networks such as node degree, clustering coefficient, shortest paths, etc., the network characterization also is guided by dynamical measurements of complex networks, including symmetry, accessibility and absorption time. The generated summaries were evaluated by using different corpus for both Portuguese and English language. The ROUGE-1 metric was used for the validation of generated summaries. The results suggest that simpler models like Noun and Tf-Idf based networks achieved a better performance in comparison to those models based on word embeddings. Also, excellent results were achieved by using the multilayered representation of documents for MDS. Finally, we concluded that several measurements could be used to improve the characterization of networks for the summarization task. / Devido à grande quantidade de informações textuais disponíveis na Internet, a tarefa de sumarização automática de documentos ganhou importância significativa. A sumarização de documentos tornou-se importante porque seu foco é o desenvolvimento de técnicas destinadas a encontrar conteúdo relevante e conciso em grandes volumes de informação sem alterar seu significado original. O objetivo deste trabalho de Mestrado é usar os conceitos da teoria de grafos para o resumo extrativo de documentos para Sumarização mono-documento (SDS) e Sumarização multi-documento (MDS). Neste trabalho, os documentos são modelados como redes, onde as sentenças são representadas como nós com o objetivo de extrair as sentenças mais relevantes através do uso de algoritmos de ranqueamento. As arestas entre nós são estabelecidas de maneiras diferentes. A primeira abordagem para o cálculo de arestas é baseada no número de substantivos comuns entre duas sentenças (nós da rede). Outra abordagem para criar uma aresta é através da similaridade entre duas sentenças. Para calcular a similaridade de tais sentenças, foi usado o modelo de espaço vetorial baseado na ponderação Tf-Idf e word embeddings para a representação vetorial das sentenças. Além disso, fazemos uma distinção entre as arestas que vinculam sentenças de diferentes documentos (inter-camada) e aquelas que conectam sentenças do mesmo documento (intra-camada) usando modelos de redes multicamada para a tarefa de Sumarização multi-documento. Nesta abordagem, cada camada da rede representa um documento do conjunto de documentos que será resumido. Além das medições tipicamente usadas em redes complexas como grau dos nós, coeficiente de agrupamento, caminhos mais curtos, etc., a caracterização da rede também é guiada por medições dinâmicas de redes complexas, incluindo simetria, acessibilidade e tempo de absorção. Os resumos gerados foram avaliados usando diferentes corpus para Português e Inglês. A métrica ROUGE-1 foi usada para a validação dos resumos gerados. Os resultados sugerem que os modelos mais simples, como redes baseadas em Noun e Tf-Idf, obtiveram um melhor desempenho em comparação com os modelos baseados em word embeddings. Além disso, excelentes resultados foram obtidos usando a representação de redes multicamada de documentos para MDS. Finalmente, concluímos que várias medidas podem ser usadas para melhorar a caracterização de redes para a tarefa de sumarização.

Page generated in 0.0693 seconds