1 |
Improving Search Results with Automated Summarization and Sentence ClusteringCotter, Steven 23 March 2012 (has links)
Have you ever searched for something on the web and been overloaded with irrelevant results? Many search engines tend to cast a very wide net and rely on ranking to show you the relevant results first. But, this doesn't always work. Perhaps the occurrence of irrelevant results could be reduced if we could eliminate the unimportant content from each webpage while indexing. Instead of casting a wide net, maybe we can make the net smarter. Here, I investigate the feasibility of using automated document summarization and clustering to do just that. The results indicate that such methods can make search engines more precise, more efficient, and faster, but not without costs. / McAnulty College and Graduate School of Liberal Arts / Computational Mathematics / MS / Thesis
|
2 |
Using semantic folding with TextRank for automatic summarization / Semantisk vikning med TextRank för automatisk sammanfattningKarlsson, Simon January 2017 (has links)
This master thesis deals with automatic summarization of text and how semantic folding can be used as a similarity measure between sentences in the TextRank algorithm. The method was implemented and compared with two common similarity measures. These two similarity measures were cosine similarity of tf-idf vectors and the number of overlapping terms in two sentences. The three methods were implemented and the linguistic features used in the construction were stop words, part-of-speech filtering and stemming. Five different part-of-speech filters were used, with different mixtures of nouns, verbs, and adjectives. The three methods were evaluated by summarizing documents from the Document Understanding Conference and comparing them to gold-standard summarization created by human judges. Comparison between the system summaries and gold-standard summaries was made with the ROUGE-1 measure. The algorithm with semantic folding performed worst of the three methods, but only 0.0096 worse in F-score than cosine similarity of tf-idf vectors that performed best. For semantic folding, the average precision was 46.2% and recall 45.7% for the best-performing part-of-speech filter. / Det här examensarbetet behandlar automatisk textsammanfattning och hur semantisk vikning kan användas som likhetsmått mellan meningar i algoritmen TextRank. Metoden implementerades och jämfördes med två vanliga likhetsmått. Dessa två likhetsmått var cosinus-likhet mellan tf-idf-vektorer samt antal överlappande termer i två meningar. De tre metoderna implementerades och de lingvistiska särdragen som användes vid konstruktionen var stoppord, filtrering av ordklasser samt en avstämmare. Fem olika filter för ordklasser användes, med olika blandningar av substantiv, verb och adjektiv. De tre metoderna utvärderades genom att sammanfatta dokument från DUC och jämföra dessa mot guldsammanfattningar skapade av mänskliga domare. Jämförelse mellan systemsammanfattningar och guldsammanfattningar gjordes med måttet ROUGE-1. Algoritmen med semantisk vikning presterade sämst av de tre jämförda metoderna, dock bara 0.0096 sämre i F-score än cosinus-likhet mellan tf-idf-vektorer som presterade bäst. För semantisk vikning var den genomsnittliga precisionen 46.2% och recall 45.7% för det ordklassfiltret som presterade bäst.
|
3 |
[en] ON THE PROCESSING OF COURSE SURVEY COMMENTS IN HIGHER EDUCATION INSTITUTIONS / [pt] PROCESSAMENTO DE COMENTÁRIOS DE PESQUISAS DE CURSOS EM INSTITUIÇÕES DE ENSINO SUPERIORHAYDÉE GUILLOT JIMÉNEZ 10 January 2022 (has links)
[pt] A avaliação sistemática de uma Instituição de Ensino Superior (IES) fornece à sua administração um feedback valioso sobre vários aspectos da vida acadêmica, como a reputação da instituição e o desempenho individual do corpo docente. Em particular, as pesquisas com alunos são uma fonte de informação de primeira mão que ajuda a avaliar o desempenho do professor e a adequação do curso. Os objetivos principais desta tese são criar e avaliar modelos de análise de sentimento dos comentários dos alunos e estratégias para resumir os comentários dos alunos. A tese primeiro descreve duas abordagens
para classificar a polaridade dos comentários dos alunos, ou seja, se eles são positivos, negativos ou neutros. A primeira abordagem depende de um dicionário criado manualmente que lista os termos que representam o sentimento a ser detectado nos comentários dos alunos. A segunda abordagem adota um
modelo de representação de linguagem, que não depende de um dicionário criado manualmente, mas requer algum conjunto de teste anotado manualmente. Os resultados indicaram que a primeira abordagem superou uma ferramenta de linha de base e que a segunda abordagem obteve um desempenho muito
bom, mesmo quando o conjunto de comentários anotados manualmente é pequeno.
A tese então explora várias estratégias para resumir um conjunto de comentários com interpretações semelhantes. O desafio está em resumir um conjunto de pequenas frases, escritas por pessoas diferentes, que podem transmitir ideias repetidas. Como estratégias, a tese testou Market Basket Analysis,
Topic Models, Text Similarity, TextRank e Entailment, adotando um método de inspeção humana para avaliar os resultados obtidos, uma vez que as métricas tradicionais de sumarização de textos se mostraram inadequadas. Os resultados sugerem que o agrupamento combinado com a estratégia baseada
em centróide atinge os melhores resultados. / [en] The systematic evaluation of a Higher Education Institution (HEI) provides its administration with valuable feedback about several aspects of academic life, such as the reputation of the institution and the individual performance of teachers. In particular, student surveys are a first-hand source of information that help assess teacher performance and course adequacy. The primary goals of this thesis are to create and evaluate sentiment analysis models of students comments, and strategies to summarize students comments. The thesis first describes two approaches to classify the polarity of students comments, that is, whether they are positive, negative, or neutral. The first approach depends on a manually created dictionary that lists terms that represent the sentiment to be detected in the students comments. The second approach adopts a language representation model, which does not depend on a manually created dictionary, but requires some manually annotated test set. The results indicated that the first approach outperformed a baseline tool, and that the second approach achieved very good performance, even when the set of manually annotated comments is small. The thesis then explores several strategies to summarize a set of comments with similar interpretations. The challenge lies in summarizing a set of small sentences, written by different people, which may convey repeated ideas. As strategies, the thesis tested Market
Basket Analysis, Topic Models, Text Similarity, TextRank, and Entailment, adopting a human inspection method to evaluate the results obtained, since traditional text summarization metrics proved inadequate. The results suggest that clustering combined with the centroid-based strategy achieves the best
results.
|
4 |
Clustering and Summarization of Chat Dialogues : To understand a company’s customer base / Klustring och Summering av Chatt-DialogerHidén, Oskar, Björelind, David January 2021 (has links)
The Customer Success department at Visma handles about 200 000 customer chats each year, the chat dialogues are stored and contain both questions and answers. In order to get an idea of what customers ask about, the Customer Success department has to read a random sample of the chat dialogues manually. This thesis develops and investigates an analysis tool for the chat data, using the approach of clustering and summarization. The approach aims to decrease the time spent and increase the quality of the analysis. Models for clustering (K-means, DBSCAN and HDBSCAN) and extractive summarization (K-means, LSA and TextRank) are compared. Each algorithm is combined with three different text representations (TFIDF, S-BERT and FastText) to create models for evaluation. These models are evaluated against a test set, created for the purpose of this thesis. Silhouette Index and Adjusted Rand Index are used to evaluate the clustering models. ROUGE measure together with a qualitative evaluation are used to evaluate the extractive summarization models. In addition to this, the best clustering model is further evaluated to understand how different data sizes impact performance. TFIDF Unigram together with HDBSCAN or K-means obtained the best results for clustering, whereas FastText together with TextRank obtained the best results for extractive summarization. This thesis applies known models on a textual domain of customer chat dialogues, something that, to our knowledge, has previously not been done in literature.
|
5 |
Extractive Multi-document Summarization of News ArticlesGrant, Harald January 2019 (has links)
Publicly available data grows exponentially through web services and technological advancements. To comprehend large data-streams multi-document summarization (MDS) can be used. In this research, the area of multi-document summarization is investigated. Multiple systems for extractive multi-document summarization are implemented using modern techniques, in the form of the pre-trained BERT language model for word embeddings and sentence classification. This is combined with well proven techniques, in the form of the TextRank ranking algorithm, the Waterfall architecture and anti-redundancy filtering. The systems are evaluated on the DUC-2002, 2006 and 2007 datasets using the ROUGE metric. Where the results show that the BM25 sentence representation implemented in the TextRank model using the Waterfall architecture and an anti-redundancy technique outperforms the other implementations, providing competitive results with other state-of-the-art systems. A cohesive model is derived from the leading system and tried in a user study using a real-world application. The user study is conducted using a real-time news detection application with users from the news-domain. The study shows a clear favour for cohesive summaries in the case of extractive multi-document summarization. Where the cohesive summary is preferred in the majority of cases.
|
Page generated in 0.0554 seconds