Global ETD Search

21	Creating eye-catching headlines using BART / Skapa intressanta rubriker med hjälp av BART Despinoy, Eva January 2022 (has links) Social media is a significant factor in information distribution today, and this information landscape contains a lot of different posts that compete for the user’s attention. Different factors can help catch the interest of the user, and one of them is the headline of the message. The headline can be more or less eye-catching, which can make the reader more or less interested in interacting with the post. The theme of this study is the automatized creation of eye-catching headlines that stay truthful to the content of the articles using Automatic Text Summarization. The exact method used consisted of fine-tuning the BART model, which is an existing model for Text Summarization. Other papers have been written using different models to solve this problem with more or less success, however, none have used this method. It was deemed an interesting method as it is less time- and energy-consuming than creating and training a new model entirely from scratch and therefore could be easily replicated if the results were positive. The BartForConditionalGeneration model implemented by the HuggingFace library was fine-tuned, using the Popular News Articles by Web.io. This method showed positive results. The resulting headlines were deemed faithful to the original ones, with a ROUGE-2 recall score of 0.541. They were comparably eye-catching to the human-written headlines, with the human respondents ranking them almost the same, with an average rank of 1.692 for the human-written headlines, and 1.821 for fine-tuned BART, and also getting an average score of 3.31 on a 1 to 5 attractiveness score scale. They were also deemed very comprehensible, with an average score of 0.95 on a scale from 0 to 1. / Sociala medier är idag en viktig faktor i distributionen av information. Detta nya landskap innehåller många olika inlägg som tävlar om användarens uppmärksamhet. Olika faktorer kan hjälpa till att fånga användarens blick till specifika inlägg eller artiklar, och en av dessa faktorer är rubriken. Rubriken kan vara mer eller mindre fängslande, och göra läsaren mer eller mindre intresserad av att interagera med inlägget. Temat för denna studie är att automatiskt skapa iögonfallande och intressanta rubriker, som beskriver innehå llet i artiklarna på ett korrekt sätt. Den valda metoden är automatisk textsamman fattning, och mer specifikt finjusterades BART-modellen, som är en existerande modell för textsammanfattning. Andra metoder har använts tidigare för att lösa denna problematik med mer eller mindre framgång, men ingen studie hade använt den här. Den ansågs vara intressant eftersom den är mindre tids- och energikrävande än vad det skulle vara att skapa en ny modell från grunden, och därför skulle den lätt kunna replikeras om resultatet var positivt. BartForConditionalGeneration-modellen implementerad av HuggingFace-bib lioteket finjusterades därför med hjälp av artiklar och rubriker från datasetet ’Popular News Articles’ av Web.io. Metoden visade positiva resultat. De resulterande rubrikerna ansågs trogna de ursprungliga, med en ROUGE-2 recall score på 0,541. De var jämförbart iögonfallande gentemot de mänskligt skrivna rubrikerna, då respondenterna rankade dem nästan likadant, med en genomsnittlig rankning på 1,692 för de mänskligt skrivna rubrikerna och 1,821 för rubrikerna som finjusterade BART genererade. De fick också ett genomsnittligt betyg av 3,31 på en poängskala från 1 till 5. De ansågs dessutom vara mycket lättbegripliga, med ett medelpoäng på 0,95 på en skala från 0 till 1. Transformer Natural Language Processing Automatic Text Summarization Headline Generation BART Computer and Information Sciences Data- och informationsvetenskap
22	A comparative study of automatic text summarization using human evaluation and automatic measures / En jämförande studie av automatisk textsammanfattning med användning av mänsklig utvärdering och automatiska mått Wennstig, Maja January 2023 (has links) Automatic text summarization has emerged as a promising solution to manage the vast amount of information available on the internet, enabling a wider audience to access it. Nevertheless, further development and experimentation with different approaches are still needed. This thesis explores the potential of combining extractive and abstractive approaches into a hybrid method, generating three types of summaries: extractive, abstractive, and hybrid. The news articles used in the study are from the Swedish newspaper Dagens Nyheter(DN). The quality of the summaries is assessed using various automatic measures, including ROUGE, BERTScore, and Coh-Metrix. Additionally, human evaluations are conducted to compare the different types of summaries in terms of perceived fluency, adequacy, and simplicity. The results of the human evaluation show a statistically significant difference between attractive, abstractive, and hybrid summaries with regard to fluency, adequacy, and simplicity. Specifically, there is a significant difference between abstractive and hybrid summaries in terms of fluency and simplicity, but not in adequacy. The automatic measures, however, do not show significant differences between the different summaries but tend to give higher scores to the hybrid and abstractive summaries Extractive summarization Abstractive summarization Hybrid summarization Automatic text summarization
23	Addressing Challenges of Modern News Agencies via Predictive Modeling, Deep Learning, and Transfer Learning Keneshloo, Yaser 22 July 2019 (has links) Today's news agencies are moving from traditional journalism, where publishing just a few news articles per day was sufficient, to modern content generation mechanisms, which create more than thousands of news pieces every day. With the growth of these modern news agencies comes the arduous task of properly handling this massive amount of data that is generated for each news article. Therefore, news agencies are constantly seeking solutions to facilitate and automate some of the tasks that have been previously done by humans. In this dissertation, we focus on some of these problems and provide solutions for two broad problems which help a news agency to not only have a wider view of the behaviour of readers around the article but also to provide an automated tools to ease the job of editors in summarizing news articles. These two disjoint problems are aiming at improving the users' reading experience by helping the content generator to monitor and focus on poorly performing content while allow them to promote the good-performing ones. We first focus on the task of popularity prediction of news articles via a combination of regression, classification, and clustering models. We next focus on the problem of generating automated text summaries for a long news article using deep learning models. The first problem aims at helping the content developer in understanding of how a news article is performing over the long run while the second problem provides automated tools for the content developers to generate summaries for each news article. / Doctor of Philosophy / Nowadays, each person is exposed to an immense amount of information from social media, blog posts, and online news portals. Among these sources, news agencies are one of the main content providers for each person around the world. Contemporary news agencies are moving from traditional journalism to modern techniques from different angles. This is achieved either by building smart tools to track the behaviour of readers’ reaction around a specific news article or providing automated tools to facilitate the editor’s job in providing higher quality content to readers. These systems should not only be able to scale well with the growth of readers but also they have to be able to process ad-hoc requests, precisely since most of the policies and decisions in these agencies are taken around the result of these analytical tools. As part of this new movement towards adapting new technologies for smart journalism, we have worked on various problems with The Washington Post news agency on building tools for predicting the popularity of a news article and automated text summarization model. We develop a model that monitors each news article after its publication and provide prediction over the number of views that this article will receive within the next 24 hours. This model will help the content creator to not only promote potential viral article in the main page of the web portal or social media, but also provide intuition for editors on potential poorly performing articles so that they can edit the content of those articles for better exposure. On the other hand, current news agencies are generating more than a thousands news articles per day and generating three to four summary sentences for each of these news pieces not only become infeasible in the near future but also very expensive and time-consuming. Therefore, we also develop a separate model for automated text summarization which generates summary sentences for a news article. Our model will generate summaries by selecting the most salient sentence in the news article and paraphrase them to shorter sentences that could represent as a summary sentence for the entire document. Text Summarization Predictive Modeling Deep learning (Machine learning) Transfer Learning Reinforcement Learning
24	Event-related Collections Understanding and Services Li, Liuqing 18 March 2020 (has links) Event-related collections, including both tweets and webpages, have valuable information, and are worth exploring in interdisciplinary research and education. Unfortunately, such data is noisy, so this variety of information has not been adequately exploited. Further, for better understanding, more knowledge hidden behind events needs to be unearthed. Regarding these collections, different societies may have different requirements in particular scenarios. Some may need relatively clean datasets for data exploration and data mining. Social researchers require preprocessing of information, so they can conduct analyses. General societies are interested in the overall descriptions of events. However, few systems, tools, or methods exist to support the flexible use of event-related collections. In this research, we propose a new, integrated system to process and analyze event-related collections at different levels (i.e., data, information, and knowledge). It also provides various services and covers the most important stages in a system pipeline, including collection development, curation, analysis, integration, and visualization. Firstly, we propose a query likelihood model with pre-query design and post-query expansion to rank a webpage corpus by query generation probability, and retrieve relevant webpages from event-related tweet collections. We further preserve webpage data into WARC files and enrich original tweets with webpages in JSON format. As an application of data management, we conduct an empirical study of the embedded URLs in tweets based on collection development and data curation techniques. Secondly, we develop TwiRole, an integrated model for 3-way user classification on Twitter, which detects brand-related, female-related, and male-related tweeters through multiple features with both machine learning (i.e., random forest classifier) and deep learning (i.e., an 18-layer ResNet) techniques. As guidance to user-centered social research at the information level, we combine TwiRole with a pre-trained recurrent neural network-based emotion detection model, and carry out tweeting pattern analyses on disaster-related collections. Finally, we propose a tweet-guided multi-document summarization (TMDS) model, which generates summaries of the event-related collections by using tweets associated with those events. The TMDS model also considers three aspects of named entities (i.e., importance, relatedness, and diversity) as well as topics, to score sentences in webpages, and then rank selected relevant sentences in proper order for summarization. The entire system is realized using many technologies, such as collection development, natural language processing, machine learning, and deep learning. For each part, comprehensive evaluations are carried out, that confirm the effectiveness and accuracy of our proposed approaches. Regarding broader impact, the outcomes proposed in our study can be easily adopted or extended for further event analyses and service development. / Doctor of Philosophy / Event-related collections, including both tweets and webpages, have valuable information. They are worth exploring in interdisciplinary research and education. Unfortunately, such data is noisy. Many tweets and webpages are not relevant to the events. This leads to difficulties during data analysis of the datasets, as well as explanation of the results. Further, for better understanding, more knowledge hidden behind events needs to be unearthed. Regarding these collections, different groups of people may have different requirements. Some may need relatively clean datasets for data exploration. Some require preprocessing of information, so they can conduct analyses, e.g., based on tweeter type or content topic. General societies are interested in the overall descriptions of events. However, few systems, tools, or methods exist to support the flexible use of event-related collections. Accordingly, we describe our new framework and integrated system to process and analyze event-related collections. It provides varied services and covers the most important stages in a system pipeline. It has sub-systems to clean, manage, analyze, integrate, and visualize event-related collections. It takes an event-related tweet collection as input and generates an event-related webpage corpus by leveraging Wikipedia and the URLs embedded in tweets. It also combines and enriches original tweets with webpages. As an application of data management, we conduct an empirical study of tweets and their embedded URLs. We developed TwiRole for 3-way user classification on Twitter. It detects brand-related, female-related, and male-related tweeters through their profiles, tweets, and images. To aid user-centered social research, we combine TwiRole with an existing emotion detection tool, and carry out tweeting pattern analyses on disaster-related collections. Finally, we propose a tweet-guided multi-document summarization (TMDS) model and service, which generates summaries of the event-related collections by using tweets associated with those events. It extracts important sentences across different topics from webpages, and organizes them in proper order. The entire system is realized using many technologies, such as collection development, natural language processing, machine learning, and deep learning. For each part, comprehensive evaluations help confirm the effectiveness and accuracy of our proposed approaches. Regarding broader impact, our methods and system can be easily adopted or extended for further event analyses and service development. Event-related Collections Collection Development Curation URL Analysis User Classification Tweeting Pattern Analysis Webpages Text Summarization
25	The effect of noise in the training of convolutional neural networks for text summarisation Meechan-Maddon, Ailsa January 2019 (has links) In this thesis, we work towards bridging the gap between two distinct areas: noisy text handling and text summarisation. The overall goal of the paper is to examine the effects of noise in the training of convolutional neural networks for text summarisation, with a view to understanding how to effectively create a noise-robust text-summarisation system. We look specifically at the problem of abstractive text summarisation of noisy data in the context of summarising error-containing documents from automatic speech recognition (ASR) output. We experiment with adding varying levels of noise (errors) to the 4 million-article Gigaword corpus and training an encoder-decoder CNN on it with the aim of producing a noise-robust text summarisation system. A total of six text summarisation models are trained, each with a different level of noise. We discover that the models with a high level of noise are indeed able to aptly summarise noisy data into clean summaries, despite a tendency for all models to overfit to the level of noise on which they were trained. Directions are given for future steps in order to create an even more noise-robust and flexible text summarisation system. text summarisation text summarization summarization nlp computational linguistics cnn neural networks machine learning
26	Data Mining Techniques to Understand Textual Data Zhou, Wubai 04 October 2017 (has links) More than ever, information delivery online and storage heavily rely on text. Billions of texts are produced every day in the form of documents, news, logs, search queries, ad keywords, tags, tweets, messenger conversations, social network posts, etc. Text understanding is a fundamental and essential task involving broad research topics, and contributes to many applications in the areas text summarization, search engine, recommendation systems, online advertising, conversational bot and so on. However, understanding text for computers is never a trivial task, especially for noisy and ambiguous text such as logs, search queries. This dissertation mainly focuses on textual understanding tasks derived from the two domains, i.e., disaster management and IT service management that mainly utilizing textual data as an information carrier. Improving situation awareness in disaster management and alleviating human efforts involved in IT service management dictates more intelligent and efficient solutions to understand the textual data acting as the main information carrier in the two domains. From the perspective of data mining, four directions are identified: (1) Intelligently generate a storyline summarizing the evolution of a hurricane from relevant online corpus; (2) Automatically recommending resolutions according to the textual symptom description in a ticket; (3) Gradually adapting the resolution recommendation system for time correlated features derived from text; (4) Efficiently learning distributed representation for short and lousy ticket symptom descriptions and resolutions. Provided with different types of textual data, data mining techniques proposed in those four research directions successfully address our tasks to understand and extract valuable knowledge from those textual data. My dissertation will address the research topics outlined above. Concretely, I will focus on designing and developing data mining methodologies to better understand textual information, including (1) a storyline generation method for efficient summarization of natural hurricanes based on crawled online corpus; (2) a recommendation framework for automated ticket resolution in IT service management; (3) an adaptive recommendation system on time-varying temporal correlated features derived from text; (4) a deep neural ranking model not only successfully recommending resolutions but also efficiently outputting distributed representation for ticket descriptions and resolutions. Data Mining Textual Understanding Domain Adaptation Text Summarization Deep Learning Learn to Rank Metric Learning Computer Sciences Physical Sciences and Mathematics
27	構文木からの再帰構造の除去による文圧縮 MATSUBARA, Shigeki, KATO, Yoshihide, EGAWA, Seiji, 松原, 茂樹, 加藤, 芳秀, 江川, 誠二 18 July 2008 (has links) No description available. text corpus maximum entropy method phrase structure text summarization テキストコーパス最大エントロピー法句構造テキスト要約
28	SearchViz: An Interactive Visual Interface to Navigate Search-Results in Online Discussion Forums January 2015 (has links) abstract: Online programming communities are widely used by programmers for troubleshooting or various problem solving tasks. Large and ever increasing volume of posts on these communities demands more efforts to read and comprehend thus making it harder to find relevant information. In my thesis; I designed and studied an alternate approach by using interactive network visualization to represent relevant search results for online programming discussion forums. I conducted user study to evaluate the effectiveness of this approach. Results show that users were able to identify relevant information more precisely via visual interface as compared to traditional list based approach. Network visualization demonstrated effective search-result navigation support to facilitate user’s tasks and improved query quality for successive queries. Subjective evaluation also showed that visualizing search results conveys more semantic information in efficient manner and makes searching more effective. / Dissertation/Thesis / Masters Thesis Computer Science 2015 Computer science Educational technology Information technology Discussion Forum Exploratory Search Interactive Visualization Network Visualization Programming Text Summarization
29	Indexace elektronických dokumentů a jejich částí / Indexing of text documents and their parts Tomeš, Jiří January 2015 (has links) The thesis describes the design and implementation of an application for processing electronic publications (collections of conference papers, comprehensive manuals, or even classical electronic books) in order to enrich their internal navigation by hyperlinks between their related parts, respectively producing as representative as possible summarizations of given length. Unlike similar applications summarizations can be based not only on the sentences, but on elements of other categories like paragraphs, sections and the like.The main emphasis was put on ease of use, platform independence, and multilingual support. The application provides a flexible environment that can be customized to user's needs.
30	Graph Models For Query Focused Text Summarization And Assessment Of Machine Translation Using Stopwords Rama, B 06 1900 (has links) (PDF) Text summarization is the task of generating a shortened version of the original text where core ideas of the original text are retained. In this work, we focus on query focused summarization. The task is to generate the summary from a set of documents which answers the query. Query focused summarization is a hard task because it expects the summary to be biased towards the query and at the same time important concepts in the original documents must be preserved with high degree of novelty. Graph based ranking algorithms which use biased random surfer model like Topic-sensitive LexRank have been applied to query focused summarization. In our work, we propose look-ahead version of Topic-sensitive LexRank. We incorporate the option of look-ahead in the random walk model and we show that it helps in generating better quality summaries. Next, we consider assessment of machine translation. Assessment of a machine translation output is important for establishing benchmarks for translation quality. An obvious way to assess the quality of machine translation is through the perception of human subjects. Though highly reliable, this approach is not scalable and is time consuming. Hence mechanisms have been devised to automate the assessment process. All such assessment methods are essentially a study of correlations between human translation and the machine translation. In this work, we present a scalable approach to assess the quality of machine translation that borrows features from the study of writing styles, popularly known as Stylometry. Towards this, we quantify the characteristic styles of individual machine translators and compare them with that of human generated text. The translator whose style is closest to human style is deemed to generate a higher quality translation. We show that our approach is scalable and does not require actual source text translations for evaluation. Natural Language Processing Abstracting Query Optimization Machine Translation Text Summarization Query Focused Summarization Machine Translators Computer Science

Search results