Return to search

Clustering Short Texts: Categorizing Initial Utterances from Customer Service Dialogue Agents

Text classification involves labeled data, which is not always available, or requires expensive manual labour.User-generated short texts are being produced in abundance in customer service sectors through transcripts of phone calls or chats online. This kind of unstructured textual data can be noisy and thus poses challenges to unsupervised classification methods developed for standard documents such as news articles.This thesis project explores some possible methods of unsupervised classification of user-generated short texts in Swedish on a real-world dataset of short texts collected from first utterances in a Conversational Interactive Voice Response solution. Such texts represent a spectrum of sub domains that customer service representative may handle, but are not extensively explored in the literature.  We experiment with three types of pretrained word embeddings as text representation methods, and two clustering algorithms on two representative, but different, subsets of the data as well as the full dataset. The experimental results show that the static fastText embeddings are better suited than state-of-the-art contextual embeddings, such as those derived from BERT, at representing noisy short texts for clustering. In addition, we conduct manual (re-)labeling of selected subsets of the data as an exploratory analysis of the dataset and it shows that the provided labels are not reliable for meaningful evaluation.Furthermore, as the data often covers several overlapping concepts in a narrow domain, the existing pretrained embeddings are not effective at capturing the nuanced differences and the clustering algorithms do not separate the data points that fit the operational objectives according to provided labels. Nevertheless, our qualitative analysis shows that unsupervised clustering algorithms could contribute to the goal of minimizing manual efforts in the data labeling process to a certain degree in the preprocessing step, but more could be achieved in a semi-supervised ``human-in-the-loop'' manner.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-453814
Date January 2021
CreatorsHang, Sijia
PublisherUppsala universitet, Institutionen för lingvistik och filologi
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0019 seconds