Global ETD Search

31	Automatic Video Categorization And Summarization Demirtas, Kezban 01 September 2009 (has links) (PDF) In this thesis, we make automatic video categorization and summarization by using subtitles of videos. We propose two methods for video categorization. The first method makes unsupervised categorization by applying natural language processing techniques on video subtitles and uses the WordNet lexical database and WordNet domains. The method starts with text preprocessing. Then a keyword extraction algorithm and a word sense disambiguation method are applied. The WordNet domains that correspond to the correct senses of keywords are extracted. Video is assigned a category label based on the extracted domains. The second method has the same steps for extracting WordNet domains of video but makes categorization by using a learning module. Experiments with documentary videos give promising results in discovering the correct categories of videos. Video summarization algorithms present condensed versions of a full length video by identifying the most significant parts of the video. We propose a video summarization method using the subtitles of videos and text summarization techniques. We identify significant sentences in the subtitles of a video by using text summarization techniques and then we compose a video summary by finding the video parts corresponding to these summary sentences.
32	Improving Search Results with Automated Summarization and Sentence Clustering Cotter, Steven 23 March 2012 (has links) Have you ever searched for something on the web and been overloaded with irrelevant results? Many search engines tend to cast a very wide net and rely on ranking to show you the relevant results first. But, this doesn't always work. Perhaps the occurrence of irrelevant results could be reduced if we could eliminate the unimportant content from each webpage while indexing. Instead of casting a wide net, maybe we can make the net smarter. Here, I investigate the feasibility of using automated document summarization and clustering to do just that. The results indicate that such methods can make search engines more precise, more efficient, and faster, but not without costs. / McAnulty College and Graduate School of Liberal Arts / Computational Mathematics / MS / Thesis
33	Sumarização Automática de Atualização para a língua portuguesa / Update Summarization for the portuguese language Fernando Antônio Asevêdo Nóbrega 12 December 2017 (has links) O enorme volume de dados textuais disponível na web caracteriza-se como um cenário ideal para inúmeras aplicações do Processamento de Língua Natural, tal como a tarefa da Sumarização Automática de Atualização (SAA), que tem por objetivo a geração automática de resumos a partir de uma coleção textual admitindo-se que o leitor possui algum conhecimento prévio sobre os textos-fonte. Dessa forma, um bom resumo de atualização deve ser constituído pelas informações mais relevantes, novas e atualizadas com relação ao conhecimento prévio do leitor. Essa tarefa implica em diversos desafios, sobretudo nas etapas de seleção e síntese de conteúdo para o sumário. Embora existam inúmeras abordagens na literatura, com diferentes níveis de complexidade teórica e computacional, pouco dessas investigações fazem uso de algum conhecimento linguístico profundo, que pode auxiliar a identificação de conteúdo mais relevante e atualizado. Além disso, os métodos de sumarização comumente empregam uma abordagem de síntese extrativa, na qual algumas sentenças dos textos-fonte são selecionadas e organizadas para compor o sumário sem alteração de seu conteúdo. Tal abordagem pode limitar a informatividade do sumário, uma vez que alguns segmentos sentenciais podem conter informação redundante ou irrelevante ao leitor. Assim, esforços recentes foram direcionados à síntese compressiva, na qual alguns segmentos das sentenças selecionadas para o sumário são removidos previamente à inserção no sumário. Nesse cenário, este trabalho de doutorado teve por objetivo a investigação do uso de conhecimentos linguísticos, como a Teoria Discursiva Multidocumento (CST), Segmentação de Subtópicos e Reconhecimento de Entidades Nomeadas, em distintas abordagens de seleção de conteúdo por meio das sínteses extrativas e compressivas visando à produção de sumários de atualização mais informativos. Tendo a língua Portuguesa como principal objeto de estudo, foram organizados três novos córpus, o CSTNews-Update, que viabiliza experimentos de SAA, e o PCSC-Pares e G1-Pares, para o desenvolvimento/avaliação de métodos de Compressão Sentencial. Ressalta-se que os experimentos de sumarização foram também realizados para a língua inglesa. Após as experimentações, observou-se que a Segmentação de Subtópicos foi mais efetiva para a produção de sumários mais informativos, porém, em apenas poucas abordagens de seleção de conteúdo. Além disso, foram propostas algumas simplificações para o método DualSum por meio da distribuição de Subtópicos. Tais métodos apresentaram resultados muito satisfatórios com menor complexidade computacional. Visando a produção de sumários compressivos, desenvolveram-se inúmeros métodos de Compressão Sentencial por meio de algoritmos de Aprendizado de Máquina. O melhor método proposto apresentou resultados superiores a um trabalho do estado da arte, que faz uso de algoritmos de Deep Learning. Além dos resultados supracitados, ressalta-se que anteriormente a este trabalho, a maioria das investigações de Sumarização Automática para a língua Portuguesa foi direcionada à geração de sumários a partir de um (monodocumento) ou vários textos relacionados (multidocumento) por meio da síntese extrativa, sobretudo pela ausência se recursos que viabilizassem a expansão da área de Sumarização Automática para esse idioma. Assim, as contribuições deste trabalho engajam-se em três campos, nos métodos de SAA propostos com conhecimento linguísticos, nos métodos de Compressão Sentencial e nos recursos desenvolvidos para a língua Portuguesa. / The huge amount of data that is available online is the main motivation for many tasks of Natural Language Processing, as the Update Summarization (US) which aims to produce a summary from a collection of related texts under the assumption the user/reader has some previous knowledge about the texts subject. Thus, a good update summary must be produced with the most relevant, new and updated content in order to assist the user. This task presents many research challenges, mainly in the processes of content selection and synthesis of the summary. Although there are several approaches for US, most of them do not use of some linguistic information that may assist the identification relevant content for the summary/user. Furthermore, US methods frequently apply an extractive synthesis approach, in which the summary is produced by picking some sentences from the source texts without rewriting operations. Once some segments of the picked sentences may contain redundant or irrelevant content, this synthesis process can to reduce the summary informativeness. Thus, some recent efforts in this field have focused in the compressive synthesis approach, in which some sentences are compressed by deletion of tokens or rewriting operations before be inserted in the output summary. Given this background, this PhD research has investigated the use of some linguistic information, as the Cross Document Theory (CST), Subtopic Segmentation and Named Entity Recognition into distinct content selection approaches for US by use extractive and compressive synthesis process in order to produce more informative update summaries. Once we have focused on the Portuguese language, we have compiled three new resources for this language, the CSTNews-Update, which allows the investigation of US methods for this language, the PCST-Pairs and G1-Pairs, in which there are pairs of original and compressed sentences in order to produce methods of sentence compression. It is important to say we also have performed experiments for the English language, in which there are more resources. The results show the Subtopic Segmentation assists the production of better summaries, however, this have occurred just on some content selection approaches. Furthermore, we also have proposed a simplification for the method DualSum by use Subtopic Segments. These simplifications require low computation power than DualSum and they have presented very satisfactory results. Aiming the production of compressive summaries, we have proposed different compression methods by use machine learning techniques. Our better proposed method present quality similar to a state-of-art system, which is based on Deep Learning algorithms. Previously this investigation, most of the researches on the Automatic Summarization field for the Portuguese language was focused on previous traditional tasks, as the production of summaries from one and many texts that does not consider the user knowledge, by use extractive synthesis processes. Thus, beside our proposed US systems based on linguistic information, which were evaluated over English and Portuguese datasets, we have produced many Compressions Methods and three new resources that will assist the expansion of the Automatic Summarization field for the Portuguese Language. Compressão sentencial Sumarização compressiva Compressive summarization Sentence compression Update summarization
34	Towards the creation of a Clinical Summarizer Gunnarsson, Axel January 2022 (has links) While Electronic Medical Records provide extensive information about patients, the vast amounts of data cause issues in attempts to quickly retrieve valuable information needed to make accurate assumptions and decisions directly concerned with patients’ health. This search process is naturally time-consuming and forces health professionals to focus on a labor intensive task that diverts their attention from the main task of applying their knowledge to save lives. With the general aim of potentially relieving the professionals from this task of finding information needed for an operational decision, this thesis explores the use of a general BERT model for extractive summarization of Swedish medical records to investigate its capability in extracting sentences that convey important information to MRI physicists. To achieve this, a domain expert evaluation of medical histories was performed, creating the references summaries that were used for model evaluation. Three implementations are included in this study and one of which is TextRank, a prominent unsupervised approach to extractive summarization. The other two are based on clustering and rely on BERT to encode the text. The implementations are then evaluated using ROUGE metrics. The results support the use of a general BERT model for extractive summarization on medical records. Furthermore, the results are discussed in relation to the collected reference summaries, leading to a discussion about potential improvements to be made with regards to the domain expert evaluation, as well as the possibilities for future work on the topic of summarization of clinical documents. NLP BERT summarization extractive summarization medical records
35	Using semantic folding with TextRank for automatic summarization / Semantisk vikning med TextRank för automatisk sammanfattning Karlsson, Simon January 2017 (has links) This master thesis deals with automatic summarization of text and how semantic folding can be used as a similarity measure between sentences in the TextRank algorithm. The method was implemented and compared with two common similarity measures. These two similarity measures were cosine similarity of tf-idf vectors and the number of overlapping terms in two sentences. The three methods were implemented and the linguistic features used in the construction were stop words, part-of-speech filtering and stemming. Five different part-of-speech filters were used, with different mixtures of nouns, verbs, and adjectives. The three methods were evaluated by summarizing documents from the Document Understanding Conference and comparing them to gold-standard summarization created by human judges. Comparison between the system summaries and gold-standard summaries was made with the ROUGE-1 measure. The algorithm with semantic folding performed worst of the three methods, but only 0.0096 worse in F-score than cosine similarity of tf-idf vectors that performed best. For semantic folding, the average precision was 46.2% and recall 45.7% for the best-performing part-of-speech filter. / Det här examensarbetet behandlar automatisk textsammanfattning och hur semantisk vikning kan användas som likhetsmått mellan meningar i algoritmen TextRank. Metoden implementerades och jämfördes med två vanliga likhetsmått. Dessa två likhetsmått var cosinus-likhet mellan tf-idf-vektorer samt antal överlappande termer i två meningar. De tre metoderna implementerades och de lingvistiska särdragen som användes vid konstruktionen var stoppord, filtrering av ordklasser samt en avstämmare. Fem olika filter för ordklasser användes, med olika blandningar av substantiv, verb och adjektiv. De tre metoderna utvärderades genom att sammanfatta dokument från DUC och jämföra dessa mot guldsammanfattningar skapade av mänskliga domare. Jämförelse mellan systemsammanfattningar och guldsammanfattningar gjordes med måttet ROUGE-1. Algoritmen med semantisk vikning presterade sämst av de tre jämförda metoderna, dock bara 0.0096 sämre i F-score än cosinus-likhet mellan tf-idf-vektorer som presterade bäst. För semantisk vikning var den genomsnittliga precisionen 46.2% och recall 45.7% för det ordklassfiltret som presterade bäst. automatic summarization natural language processing TextRank semantic folding extractive document summarization Computer Sciences Datavetenskap (datalogi)
36	Summarizing User-generated Discourse Syed, Shahbaz 04 July 2024 (has links) Automatic text summarization is a long-standing task with its origins in summarizing scholarly documents by generating their abstracts. While older approaches mainly focused on generating extractive summaries, recent approaches using neural architectures have helped the task advance towards generating more abstractive, human-like summaries. Yet, the majority of the research in automatic text summarization has focused on summarizing professionally-written news articles due to easier availability of large-scale datasets with ground truth summaries in this domain. Moreover, the inverted pyramid writing style enforced in news articles places crucial information in the top sentences, essentially summarizing it. This allows for a more reliable identification of ground truth for constructing datasets. In contrast, user-generated discourse, such as social media forums or debate portals, has acquired comparably little attention, despite its evident importance. Possible reasons include the challenges posed by the informal nature of user-generated discourse, which often lacks a rigid structure, such as news articles, and the difficulty of obtaining high-quality ground truth summaries for this text register. This thesis aims to address this existing gap by delivering the following novel contributions in the form of datasets, methodologies, and evaluation strategies for automatically summarizing user-generated discourse: (1) three new datasets for the registers of social media posts and argumentative texts containing author-provided ground truth summaries as well as crowdsourced summaries for argumentative texts by adapting theoretical definitions of high-quality summaries; (2) methodologies for creating informative as well as indicative summaries for long discussions of controversial topics; (3) user-centric evaluation processes that emphasize the purpose and provenance of the summary for qualitative assessment of the summarization models; and (4) tools for facilitating the development and evaluation of summarization models that leverage visual analytics and interactive interfaces to enable a fine-grained inspection of the automatically generated summaries in relation to their source documents.:1 Introduction 1.1 Understanding User-Generated Discourse 1.2 The Role of Automatic Summarization 1.3 Research Questions and Contributions 1.4 Thesis Structure 1.5 Publication Record 2 The Task of Text Summarization 2.1 Decoding Human Summarization Practices 2.2 Exploring Automatic Summarization Methods 2.3 Evaluation of Automatic Summarization and its Challenges 2.4 Summary 3 Defining Good Summaries: Examining News Editorials 3.1 Key Characteristics of News Editorials 3.2 Operationalizing High-Quality Summaries 3.3 Evaluating and Ensuring Summary Quality 3.4 Automatic Extractive Summarization of News Editorials 3.5 Summary 4 Mining Social Media for Author-provided Summaries 4.1 Leveraging Human Signals for Summary Identification 4.2 Constructing a Corpus of Abstractive Summaries 4.3 Insights from the TL;DR Challenge 4.4 Summary 5 Generating Conclusions for Argumentative Texts 5.1 Identifying Author-provided Conclusions 5.2 Enhancing Pretrained Models with External Knowledge 5.3 Evaluating Informative Conclusion Generation 5.4 Summary 6 Frame-Oriented Extractive Summarization of Argumentative Discussions 6.1 Importance of Summaries for Argumentative Discussions 6.2 Employing Argumentation Frames as Anchor Points 6.3 Extractive Summarization of Argumentative Discussions 6.4 Evaluation of Extractive Summaries via Relevance Judgments 6.5 Summary 7 Indicative Summarization of Long Discussions 7.1 Table of Contents as an Indicative Summary 7.2 Unsupervised Summarization with Large Language Models 7.3 Comprehensive Analysis of Prompt Engineering 7.4 Purpose-driven Evaluation of Summary Usefulness 7.5 Summary 8 Summary Explorer: Visual Analytics for the Qualitative Assessment of the State of the Art in Text Summarization 8.1 Limitations of Automatic Evaluation Metrics 8.2 Designing Interfaces for Visual Exploration of Summaries 8.3 Corpora, Models, and Case Studies 8.4 Summary 9 SummaryWorkbench: Reproducible Models and Metrics for Text Summarization 9.1 Addressing the Requirements for Summarization Researchers 9.2 AUnified Interface for Applying and Evaluating State-of-the-Art Models and Metrics 9.3 Models and Measures 9.4 Curated Artifacts and Interaction Scenarios 9.5 Interaction Use Cases 9.6 Summary 10 Conclusion 10.1 Key Contributions of the Thesis 10.2 Open Problems and FutureWork info:eu-repo/classification/ddc/000 ddc:000
37	IMPROVING UNDERSTANDABILITY AND UNCERTAINTY MODELING OF DATA USING FUZZY LOGIC SYSTEMS Wijayasekara, Dumidu S 01 January 2016 (has links) The need for automation, optimality and efficiency has made modern day control and monitoring systems extremely complex and data abundant. However, the complexity of the systems and the abundance of raw data has reduced the understandability and interpretability of data which results in a reduced state awareness of the system. Furthermore, different levels of uncertainty introduced by sensors and actuators make interpreting and accurately manipulating systems difficult. Classical mathematical methods lack the capability to capture human knowledge and increase understandability while modeling such uncertainty. Fuzzy Logic has been shown to alleviate both these problems by introducing logic based on vague terms that rely on human understandable terms. The use of linguistic terms and simple consequential rules increase the understandability of system behavior as well as data. Use of vague terms and modeling data from non-discrete prototypes enables modeling of uncertainty. However, due to recent trends, the primary research of fuzzy logic have been diverged from the basic concept of understandability. Furthermore, high computational costs to achieve robust uncertainty modeling have led to restricted use of such fuzzy systems in real-world applications. Thus, the goal of this dissertation is to present algorithms and techniques that improve understandability and uncertainty modeling using Fuzzy Logic Systems. In order to achieve this goal, this dissertation presents the following major contributions: 1) a novel methodology for generating Fuzzy Membership Functions based on understandability, 2) Linguistic Summarization of data using if-then type consequential rules, and 3) novel Shadowed Type-2 Fuzzy Logic Systems for uncertainty modeling. Finally, these presented techniques are applied to real world systems and data to exemplify their relevance and usage. Fuzzy Logic Machine Learning Control Systems Linguistic Summarization Computer Engineering
38	Algoritmos rápidos para estimativas de densidade hierárquicas e suas aplicações em mineração de dados / Fast algorithms for hierarchical density estimates and its applications in data mining Santos, Joelson Antonio dos 29 May 2018 (has links) O agrupamento de dados (ou do inglês Clustering) é uma tarefa não supervisionada capaz de descrever objetos em grupos (ou clusters), de maneira que objetos de um mesmo grupo sejam mais semelhantes entre si do que objetos de grupos distintos. As técnicas de agrupamento de dados são divididas em duas principais categorias: particionais e hierárquicas. As técnicas particionais dividem um conjunto de dados em um determinado número de grupos distintos, enquanto as técnicas hierárquicas fornecem uma sequência aninhada de agrupamentos particionais separados por diferentes níveis de granularidade. Adicionalmente, o agrupamento hierárquico de dados baseado em densidade é um paradigma particular de agrupamento que detecta grupos com diferentes concentrações ou densidades de objetos. Uma das técnicas mais populares desse paradigma é conhecida como HDBSCAN. Além de prover hierarquias, HDBSCAN é um framework que fornece detecção de outliers, agrupamento semi-supervisionado de dados e visualização dos resultados. No entanto, a maioria das técnicas hierárquicas, incluindo o HDBSCAN, possui uma alta complexidade computacional. Fato que as tornam proibitivas para a análise de grandes conjuntos de dados. No presente trabalho de mestrado, foram propostas duas variações aproximadas de HDBSCAN computacionalmente mais escaláveis para o agrupamento de grandes quantidades de dados. A primeira variação de HDBSCAN* segue o conceito de computação paralela e distribuída, conhecido como MapReduce. Já a segunda, segue o contexto de computação paralela utilizando memória compartilhada. Ambas as variações são baseadas em um conceito de divisão eficiente de dados, conhecido como Recursive Sampling, que permite o processamento paralelo desses dados. De maneira similar ao HDBSCAN, as variações propostas também são capazes de fornecer uma completa análise não supervisionada de padrões em dados, incluindo a detecção de outliers. Experimentos foram realizados para avaliar a qualidade das variações propostas neste trabalho, especificamente, a variação baseada em MapReduce foi comparada com uma versão paralela e exata de HDBSCAN conhecida como Random Blocks. Já a versão paralela em ambiente de memória compartilhada foi comparada com o estado da arte (HDBSCAN). Em termos de qualidade de agrupamento e detecção de outliers, tanto a variação baseada em MapReduce quanto a baseada em memória compartilhada mostraram resultados próximos à versão paralela exata de HDBSCAN e ao estado da arte, respectivamente. Já em termos de tempo computacional, as variações propostas mostraram maior escalabilidade e rapidez para o processamento de grandes quantidades de dados do que as versões comparadas. / Clustering is an unsupervised learning task able to describe a set of objects in clusters, so that objects of a same cluster are more similar than objects of other clusters. Clustering techniques are divided in two main categories: partitional and hierarchical. The particional techniques divide a dataset into a number of distinct clusters, while hierarchical techniques provide a nested sequence of partitional clusters separated by different levels of granularity. Furthermore, hierarchical density-based clustering is a particular clustering paradigm that detects clusters with different concentrations or densities of objects. One of the most popular techniques of this paradigm is known as HDBSCAN. In addition to providing hierarchies, HDBSCAN is a framework that provides outliers detection, semi-supervised clustering and visualization of results. However, most hierarchical techniques, including HDBSCAN, have a high complexity computational. This fact makes them prohibitive for the analysis of large datasets. In this work have been proposed two approximate variations of HDBSCAN computationally more scalable for clustering large amounts of data. The first variation follows the concept of parallel and distributed computing, known as MapReduce. The second one follows the context of parallel computing using shared memory. Both variations are based on a concept of efficient data division, known as Recursive Sampling, which allows parallel processing of this data. In a manner similar to HDBSCAN, the proposed variations are also capable of providing complete unsupervised patterns analysis in data, including outliers detection. Experiments have been carried out to evaluate the quality of the variations proposed in this work, specifically, the variation based on MapReduce have been compared to a parallel and exact version of HDBSCAN, known as Random Blocks. Already the version parallel in shared memory environment have been compared to the state of the art (HDBSCAN). In terms of clustering quality and outliers detection, the variation based on MapReduce and other based on shared memory showed results close to the exact parallel verson of HDBSCAN and the state of the art, respectively. In terms of computational time, the proposed variations showed greater scalability and speed for processing large amounts of data than the compared versions. Agrupamento de dados Clustering Data summarization MapReduce MapReduce Sumarização de dados
39	Topic Retrospection with Storyline-based Summarization on News Reports Liang, Chia-Hao 18 July 2005 (has links) The electronics newspaper becomes a main source for online news readers. When facing the numerous stories, news readers need some supports in order to review a topic in short time. Due to previous researches in TDT (Topic Detection and Tracking) only considering how to identify events and present the results with news titles and keywords, a summarized text to present event evolution is necessary for general news readers to retrospect events under a news topic. This thesis proposes a topic retrospection process and implements the SToRe system that identifies various events under a new topic and constructs the relationship to compose a summary which gives readers the sketch of event evolution in a topic. It consists of three main functions: event identification, main storyline construction and storyline-based summarization. The constructed main storyline can remove the irrelevant events and present a main theme. The summarization extracts the representative sentences and takes the main theme as the template to compose summary. The summarization not only provides enough information to comprehend the development of a topic, but also can be an index to help readers to find more detailed information. A lab experiment is conducted to evaluate the SToRe system in the question-and-answer (Q&A) setting. From the experimental results, the SToRe system can help news readers more effectively and efficiently to capture the development of a topic. topic retrospection event threading summarization Topic Detection and Tracking (TDT)
40	Semantic-Based Approach to Supporting Opinion Summarization Chen, Yen-Ming 20 July 2006 (has links) With the rapid expansion of e-commerce, the Web has become an excellent source for gathering customer opinions (or so-called customer reviews). Customer reviews are essential for merchants or product manufacturers to understand general responses of customers on their products for product or marketing campaign improvement. In addition, customer reviews can enable merchants better understand specific preferences of individual customers and facilitates making effective marketing decisions. Prior data mining research mainly concentrates on analyzing customer demographic, attitudinal, psychographic, transactional, and behavioral data for supporting customer relationship management and marketing decision making and did not pay attention to the use of customer reviews as additional source for marketing intelligence. Thus, the purpose of this research is to develop an efficient and effective opinion summarization technique. Specifically, we will propose a semantic-based product feature extraction technique (SPE) which aims at improving the existing product feature extraction technique and is desired to enhance the overall opinion summarization effectiveness. Semantic Orientation Opinion Summarization Customer Review Text Mining

Search results