11 |
Comparing performance of K-Means and DBSCAN on customer support queriesKästel, Arne Morten, Vestergaard, Christian January 2019 (has links)
In customer support, there are often a lot of repeat questions, and questions that does not need novel answers. In a quest to increase the productivity in the question answering task within any business, there is an apparent room for automatic answering to take on some of the workload of customer support functions. We look at clustering corpora of older queries and texts as a method for identifying groups of semantically similar questions and texts that would allow a system to identify new queries that fit a specific cluster to receive a connected, automatic response. The approach compares the performance of K-means and density-based clustering algorithms on three different corpora using document embeddings encoded with BERT. We also discuss the digital transformation process, why companies are unsuccessful in their implementation as well as the possible room for a new more iterative model. / I kundtjänst förekommer det ofta upprepningar av frågor samt sådana frågor som inte kräver unika svar. I syfte att öka produktiviteten i kundtjänst funktionens arbete att besvara dessa frågor undersöks metoder för att automatisera en del av arbetet. Vi undersöker olika metoder för klusteranalys, applicerat på existerande korpusar innehållande texter så väl som frågor. Klusteranalysen genomförs i syfte att identifiera dokument som är semantiskt lika, vilket i ett automatiskt system för frågebevarelse skulle kunna användas för att besvara en ny fråga med ett existerande svar. En jämförelse mellan hur K-means och densitetsbaserad metod presterar på tre olika korpusar vars dokumentrepresentationer genererats med BERT genomförs. Vidare diskuteras den digitala transformationsprocessen, varför företag misslyckas avseende implementation samt även möjligheterna för en ny mer iterativ modell.
|
12 |
Representação de coleções de documentos textuais por meio de regras de associação / Representation of textual document collections through association rulesRossi, Rafael Geraldeli 16 August 2011 (has links)
O número de documentos textuais disponíveis em formato digital tem aumentado incessantemente. Técnicas de Mineração de Textos são cada vez mais utilizadas para organizar e extrair conhecimento de grandes coleções de documentos textuais. Para o uso dessas técnicas é necessário que os documentos textuais estejam representados em um formato apropriado. A maioria das pesquisas de Mineração de Textos utiliza a abordagem bag-of-words para representar os documentos da coleção. Essa representação usa cada palavra presente na coleção de documentos como possível atributo, ignorando a ordem das palavras, informa ções de pontuação ou estruturais, e é caracterizada pela alta dimensionalidade e por dados esparsos. Por outro lado, a maioria dos conceitos são compostos por mais de uma palavra, como Inteligência Articial, Rede Neural, e Mineração de Textos. As abordagens que geram atributos compostos por mais de uma palavra apresentam outros problemas além dos apresentados pela representação bag-of-words, como a geração de atributos com pouco signicado e uma dimensionalidade muito maior. Neste projeto de mestrado foi proposta uma abordagem para representar documentos textuais nomeada bag-of-related-words. A abordagem proposta gera atributos compostos por palavras relacionadas com o uso de regras de associação. Com as regras de associação, espera-se identicar relações entre palavras de um documento, além de reduzir a dimensionalidade, pois são consideradas apenas as palavras que ocorrem ou que coocorrem acima de uma determinada frequência para gerar as regras. Diferentes maneiras de mapear o documento em transações para possibilitar a geração de regras de associação são analisadas. Diversas medidas de interesse aplicadas às regras de associação para a extração de atributos mais signicativos e a redução do número de atributos também são analisadas. Para avaliar o quanto a representação bag-of-related-words pode auxiliar na organização e extração de conhecimento de coleções de documentos textuais, e na interpretabilidade dos resultados, foram realizados três grupos de experimentos: 1) classicação de documentos textuais para avaliar o quanto os atributos da representação bag-of-related-words são bons para distinguir as categorias dos documentos; 2) agrupamento de documentos textuais para avaliar a qualidade dos grupos obtidos com a bag-of-related-words e consequentemente auxiliar na obtenção da estrutura de uma hierarquia de tópicos; e 3) construção e avaliação de hierarquias de tópicos por especialistas de domínio. Todos os resultados e dimensionalidades foram comparados com a representação bag-of-words. Pelos resultados dos experimentos realizados, pode-se vericar que os atributos da representação bag-of-related-words possuem um poder preditivo tão bom quanto os da representação bag-of-words. A qualidade dos agrupamentos de documentos textuais utilizando a representação bag-of-related-words foi tão boa quanto utilizando a representação bag-of-words. Na avaliação de hierarquias de tópicos por especialistas de domínio, a utilização da representação bag-of-related-words apresentou melhores resultados em todos os quesitos analisados / The amount of textual documents available in digital format is incredibly large. Text Mining techniques are becoming essentials to manage and extract knowledge in big textual document collections. In order to use these techniques, the textual documents need to be represented in an appropriate format to allow the construction of a model that represents the embedded knowledge in these textual documents. Most of the researches on Text Mining uses the bag-of-words approach to represent textual document collections. This representation uses each word in a collection as feature, ignoring the order of the words, structural information, and it is characterized by the high dimensionality and data sparsity. On the other hand, most of the concepts are compounded by more than one word, such as Articial Intelligence, Neural Network, and Text Mining. The approaches which generate features compounded by more than one word to solve this problem, suer from other problems, as the generation of features without meaning and a dimensionality much higher than that of the bag-of-words. An approach to represent textual documents named bag-of-related-words was proposed in this master thesis. The proposed approach generates features compounded by related words using association rules. We hope to identify relationships among words and reduce the dimensionality with the use of association rules, since only the words that occur and cooccur over a frequency threshold will be used to generate rules. Dierent ways to map the document into transactions to allow the extraction of association rules are analyzed. Dierent objective interest measures applied to the association rules to generate more meaningful features and to the reduce the feature number are also analyzed. To evaluate how much the textual document representation proposed in this master project can aid the managing and knowledge extraction from textual document collections, and the understanding of the results, three experiments were carried out: 1) textual document classication to analyze the predictive power of the bag-of-related-words features, 2) textual document clustering to analyze the quality of the cluster using the bag-of-related-words representation 3) topic hierarchies building and evaluation by domain experts. All the results and dimensionalities were compared to the bag-of-words representation. The results presented that the features of the bag-of-related-words representation have a predictive power as good as the features of the bag-of-words representation. The quality of the textual document clustering also was as good as the bag-of-words. The evaluation of the topic hierarchies by domain specialists presented better results when using the bag-of-related-words representation in all the questions analyzed
|
13 |
Representação de coleções de documentos textuais por meio de regras de associação / Representation of textual document collections through association rulesRafael Geraldeli Rossi 16 August 2011 (has links)
O número de documentos textuais disponíveis em formato digital tem aumentado incessantemente. Técnicas de Mineração de Textos são cada vez mais utilizadas para organizar e extrair conhecimento de grandes coleções de documentos textuais. Para o uso dessas técnicas é necessário que os documentos textuais estejam representados em um formato apropriado. A maioria das pesquisas de Mineração de Textos utiliza a abordagem bag-of-words para representar os documentos da coleção. Essa representação usa cada palavra presente na coleção de documentos como possível atributo, ignorando a ordem das palavras, informa ções de pontuação ou estruturais, e é caracterizada pela alta dimensionalidade e por dados esparsos. Por outro lado, a maioria dos conceitos são compostos por mais de uma palavra, como Inteligência Articial, Rede Neural, e Mineração de Textos. As abordagens que geram atributos compostos por mais de uma palavra apresentam outros problemas além dos apresentados pela representação bag-of-words, como a geração de atributos com pouco signicado e uma dimensionalidade muito maior. Neste projeto de mestrado foi proposta uma abordagem para representar documentos textuais nomeada bag-of-related-words. A abordagem proposta gera atributos compostos por palavras relacionadas com o uso de regras de associação. Com as regras de associação, espera-se identicar relações entre palavras de um documento, além de reduzir a dimensionalidade, pois são consideradas apenas as palavras que ocorrem ou que coocorrem acima de uma determinada frequência para gerar as regras. Diferentes maneiras de mapear o documento em transações para possibilitar a geração de regras de associação são analisadas. Diversas medidas de interesse aplicadas às regras de associação para a extração de atributos mais signicativos e a redução do número de atributos também são analisadas. Para avaliar o quanto a representação bag-of-related-words pode auxiliar na organização e extração de conhecimento de coleções de documentos textuais, e na interpretabilidade dos resultados, foram realizados três grupos de experimentos: 1) classicação de documentos textuais para avaliar o quanto os atributos da representação bag-of-related-words são bons para distinguir as categorias dos documentos; 2) agrupamento de documentos textuais para avaliar a qualidade dos grupos obtidos com a bag-of-related-words e consequentemente auxiliar na obtenção da estrutura de uma hierarquia de tópicos; e 3) construção e avaliação de hierarquias de tópicos por especialistas de domínio. Todos os resultados e dimensionalidades foram comparados com a representação bag-of-words. Pelos resultados dos experimentos realizados, pode-se vericar que os atributos da representação bag-of-related-words possuem um poder preditivo tão bom quanto os da representação bag-of-words. A qualidade dos agrupamentos de documentos textuais utilizando a representação bag-of-related-words foi tão boa quanto utilizando a representação bag-of-words. Na avaliação de hierarquias de tópicos por especialistas de domínio, a utilização da representação bag-of-related-words apresentou melhores resultados em todos os quesitos analisados / The amount of textual documents available in digital format is incredibly large. Text Mining techniques are becoming essentials to manage and extract knowledge in big textual document collections. In order to use these techniques, the textual documents need to be represented in an appropriate format to allow the construction of a model that represents the embedded knowledge in these textual documents. Most of the researches on Text Mining uses the bag-of-words approach to represent textual document collections. This representation uses each word in a collection as feature, ignoring the order of the words, structural information, and it is characterized by the high dimensionality and data sparsity. On the other hand, most of the concepts are compounded by more than one word, such as Articial Intelligence, Neural Network, and Text Mining. The approaches which generate features compounded by more than one word to solve this problem, suer from other problems, as the generation of features without meaning and a dimensionality much higher than that of the bag-of-words. An approach to represent textual documents named bag-of-related-words was proposed in this master thesis. The proposed approach generates features compounded by related words using association rules. We hope to identify relationships among words and reduce the dimensionality with the use of association rules, since only the words that occur and cooccur over a frequency threshold will be used to generate rules. Dierent ways to map the document into transactions to allow the extraction of association rules are analyzed. Dierent objective interest measures applied to the association rules to generate more meaningful features and to the reduce the feature number are also analyzed. To evaluate how much the textual document representation proposed in this master project can aid the managing and knowledge extraction from textual document collections, and the understanding of the results, three experiments were carried out: 1) textual document classication to analyze the predictive power of the bag-of-related-words features, 2) textual document clustering to analyze the quality of the cluster using the bag-of-related-words representation 3) topic hierarchies building and evaluation by domain experts. All the results and dimensionalities were compared to the bag-of-words representation. The results presented that the features of the bag-of-related-words representation have a predictive power as good as the features of the bag-of-words representation. The quality of the textual document clustering also was as good as the bag-of-words. The evaluation of the topic hierarchies by domain specialists presented better results when using the bag-of-related-words representation in all the questions analyzed
|
14 |
Text mining : μια νέα προτεινόμενη μέθοδος με χρήση κανόνων συσχέτισηςΝασίκας, Ιωάννης 14 September 2007 (has links)
Η εξόρυξη κειμένου (text mining) είναι ένας νέος ερευνητικός τομέας που προσπαθεί να επιλύσει το πρόβλημα της υπερφόρτωσης πληροφοριών με τη χρησιμοποίηση των τεχνικών από την εξόρυξη από δεδομένα (data mining), την μηχανική μάθηση (machine learning), την επεξεργασία φυσικής γλώσσας (natural language processing), την ανάκτηση πληροφορίας (information retrieval), την εξαγωγή πληροφορίας (information extraction) και τη διαχείριση γνώσης (knowledge management).
Στο πρώτο μέρος αυτής της διπλωματικής εργασίας αναφερόμαστε αναλυτικά στον καινούριο αυτό ερευνητικό τομέα, διαχωρίζοντάς τον από άλλους παρεμφερείς τομείς.
Ο κύριος στόχος του text mining είναι να βοηθήσει τους χρήστες να εξαγάγουν πληροφορίες από μεγάλους κειμενικούς πόρους. Δύο από τους σημαντικότερους στόχους είναι η κατηγοριοποίηση και η ομαδοποίηση εγγράφων.
Υπάρχει μια αυξανόμενη ανησυχία για την ομαδοποίηση κειμένων λόγω της εκρηκτικής αύξησης του WWW, των ψηφιακών βιβλιοθηκών, των ιατρικών δεδομένων, κ.λ.π.. Τα κρισιμότερα προβλήματα για την ομαδοποίηση εγγράφων είναι η υψηλή διαστατικότητα του κειμένου φυσικής γλώσσας και η επιλογή των χαρακτηριστικών γνωρισμάτων που χρησιμοποιούνται για να αντιπροσωπεύσουν μια περιοχή.
Κατά συνέπεια, ένας αυξανόμενος αριθμός ερευνητών έχει επικεντρωθεί στην έρευνα για τη σχετική αποτελεσματικότητα των διάφορων τεχνικών μείωσης διάστασης και της σχέσης μεταξύ των επιλεγμένων χαρακτηριστικών γνωρισμάτων που χρησιμοποιούνται για να αντιπροσωπεύσουν το κείμενο και την ποιότητα της τελικής ομαδοποίησης. Υπάρχουν δύο σημαντικοί τύποι τεχνικών μείωσης διάστασης: οι μέθοδοι «μετασχηματισμού» και οι μέθοδοι «επιλογής».
Στο δεύτερο μέρος αυτής τη διπλωματικής εργασίας, παρουσιάζουμε μια καινούρια μέθοδο «επιλογής» που προσπαθεί να αντιμετωπίσει αυτά τα προβλήματα. Η προτεινόμενη μεθοδολογία είναι βασισμένη στους κανόνες συσχέτισης (Association Rule Mining). Παρουσιάζουμε επίσης και αναλύουμε τις εμπειρικές δοκιμές, οι οποίες καταδεικνύουν την απόδοση της προτεινόμενης μεθοδολογίας. Μέσα από τα αποτελέσματα που λάβαμε διαπιστώσαμε ότι η διάσταση μειώθηκε. Όσο όμως προσπαθούσαμε, βάσει της μεθοδολογίας μας, να την μειώσουμε περισσότερο τόσο χανόταν η ακρίβεια στα αποτελέσματα. Έγινε μια προσπάθεια βελτίωσης των αποτελεσμάτων μέσα από μια διαφορετική επιλογή των χαρακτηριστικών γνωρισμάτων. Τέτοιες προσπάθειες συνεχίζονται και σήμερα.
Σημαντική επίσης στην ομαδοποίηση των κειμένων είναι και η επιλογή του μέτρου ομοιότητας. Στην παρούσα διπλωματική αναφέρουμε διάφορα τέτοια μέτρα που υπάρχουν στην βιβλιογραφία, ενώ σε σχετική εφαρμογή κάνουμε σύγκριση αυτών.
Η εργασία συνολικά αποτελείται από 7 κεφάλαια: Στο πρώτο κεφάλαιο γίνεται μια σύντομη ανασκόπηση σχετικά με το text mining. Στο δεύτερο κεφάλαιο περιγράφονται οι στόχοι, οι μέθοδοι και τα εργαλεία που χρησιμοποιεί η εξόρυξη κειμένου. Στο τρίτο κεφάλαιο παρουσιάζεται ο τρόπος αναπαράστασης των κειμένων, τα διάφορα μέτρα ομοιότητας καθώς και μια εφαρμογή σύγκρισης αυτών. Στο τέταρτο κεφάλαιο αναφέρουμε τις διάφορες μεθόδους μείωσης της διάστασης και στο πέμπτο παρουσιάζουμε την δικιά μας μεθοδολογία για το πρόβλημα. Έπειτα στο έκτο κεφάλαιο εφαρμόζουμε την μεθοδολογία μας σε πειραματικά δεδομένα. Η εργασία κλείνει με τα συμπεράσματα μας και κατευθύνσεις για μελλοντική έρευνα. / Text mining is a new searching field which tries to solve the problem of information overloading by using techniques from data mining, natural language processing, information retrieval, information extraction and knowledge management.
At the first part of this diplomatic paper we detailed refer to this new searching field, separated it from all the others relative fields.
The main target of text mining is helping users to extract information from big text resources. Two of the most important tasks are document categorization and document clustering.
There is an increasing concern in document clustering due to explosive growth of the WWW, digital libraries, technical documentation, medical data, etc. The most critical problems for document clustering are the high dimensionality of the natural language text and the choice of features used to represent a domain.
Thus, an increasing number of researchers have concentrated on the investigation of the relative effectiveness of various dimension reduction techniques and of the relationship between the selected features used to represent text and the quality of the final clustering. There are two important types of techniques that reduce dimension: transformation methods and selection methods.
At the second part of this diplomatic paper we represent a new selection method trying to tackle these problems. The proposed methodology is based on Association Rule Mining. We also present and analyze empirical tests, which demonstrate the performance of the proposed methodology. Through the results that we obtained we found out that dimension has been reduced. However, the more we have been trying to reduce it, according to methodology, the bigger loss of precision we have been taking. There has been an effort for improving the results through a different feature selection. That kind of efforts are taking place even today.
In document clustering is also important the choice of the similarity measure. In this diplomatic paper we refer several of these measures that exist to bibliography and we compare them in relative application.
The paper totally has seven chapters. At the first chapter there is a brief review about text mining. At the second chapter we describe the tasks, the methods and the tools are used in text mining. At the third chapter we give the way of document representation, the various similarity measures and an application to compare them. At the fourth chapter we refer different kind of methods that reduce dimensions and at the fifth chapter we represent our own methodology for the problem. After that at the sixth chapter we apply our methodology to experimental data. The paper ends up with our conclusions and directions for future research.
|
15 |
Stream Clustering And Visualization Of Geotagged Text Data For Crisis ManagementCrossman, Nathaniel C. 08 June 2020 (has links)
No description available.
|
16 |
Duplicate Detection and Text Classification on Simplified Technical English / Dublettdetektion och textklassificering på Förenklad Teknisk EngelskaLund, Max January 2019 (has links)
This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.
|
17 |
Cluster Analysis with Meaning : Detecting Texts that Convey the Same Message / Klusteranalys med mening : Detektering av texter som uttrycker samma sakÖhrström, Fredrik January 2018 (has links)
Textual duplicates can be hard to detect as they differ in words but have similar semantic meaning. At Etteplan, a technical documentation company, they have many writers that accidentally re-write existing instructions explaining procedures. These "duplicates" clutter the database. This is not desired because it is duplicate work. The condition of the database will only deteriorate as the company expands. This thesis attempts to map where the problem is worst, and also how to calculate how many duplicates there are. The corpus is small, but written in a controlled natural language called Simplified Technical English. The method uses document embeddings from doc2vec and clustering by use of HDBSCAN* and validation using Density-Based Clustering Validation index (DBCV), to chart the problems. A survey was sent out to try to determine a threshold value of when documents stop being duplicates, and then using this value, a theoretical duplicate count was calculated.
|
18 |
Investigating the Correlation Between Marketing Emails and Receivers Using Unsupervised Machine Learning on Limited Data : A comprehensive study using state of the art methods for text clustering and natural language processing / Undersökning av samband mellan marknadsföringsemail och dess mottagare med hjälp av oövervakad maskininlärning på begränsad dataPettersson, Christoffer January 2016 (has links)
The goal of this project is to investigate any correlation between marketing emails and their receivers using machine learning and only a limited amount of initial data. The data consists of roughly 1200 emails and 98.000 receivers of these. Initially, the emails are grouped together based on their content using text clustering. They contain no information regarding prior labeling or categorization which creates a need for an unsupervised learning approach using solely the raw text based content as data. The project investigates state-of-the-art concepts like bag-of-words for calculating term importance and the gap statistic for determining an optimal number of clusters. The data is vectorized using term frequency - inverse document frequency to determine the importance of terms relative to the document and to all documents combined. An inherit problem of this approach is high dimensionality which is reduced using latent semantic analysis in conjunction with singular value decomposition. Once the resulting clusters have been obtained, the most frequently occurring terms for each cluster are analyzed and compared. Due to the absence of initial labeling an alternative approach is required to evaluate the clusters validity. To do this, the receivers of all emails in each cluster who actively opened an email is collected and investigated. Each receiver have different attributes regarding their purpose of using the service and some personal information. Once gathered and analyzed, conclusions could be drawn that it is possible to find distinguishable connections between the resulting email clusters and their receivers but to a limited extent. The receivers from the same cluster did show similar attributes as each other which were distinguishable from the receivers of other clusters. Hence, the resulting email clusters and their receivers are specific enough to distinguish themselves from each other but too general to handle more detailed information. With more data, this could become a useful tool for determining which users of a service should receive a particular email to increase the conversion rate and thereby reach out to more relevant people based on previous trends. / Målet med detta projekt att undersöka eventuella samband mellan marknadsföringsemail och dess mottagare med hjälp av oövervakad maskininlärning på en brgränsad mängd data. Datan består av ca 1200 email meddelanden med 98.000 mottagare. Initialt så gruperas alla meddelanden baserat på innehåll via text klustering. Meddelandena innehåller ingen information angående tidigare gruppering eller kategorisering vilket skapar ett behov för ett oövervakat tillvägagångssätt för inlärning där enbart det råa textbaserade meddelandet används som indata. Projektet undersöker moderna tekniker så som bag-of-words för att avgöra termers relevans och the gap statistic för att finna ett optimalt antal kluster. Datan vektoriseras med hjälp av term frequency - inverse document frequency för att avgöra relevansen av termer relativt dokumentet samt alla dokument kombinerat. Ett fundamentalt problem som uppstår via detta tillvägagångssätt är hög dimensionalitet, vilket reduceras med latent semantic analysis tillsammans med singular value decomposition. Då alla kluster har erhållits så analyseras de mest förekommande termerna i vardera kluster och jämförs. Eftersom en initial kategorisering av meddelandena saknas så krävs ett alternativt tillvägagångssätt för evaluering av klustrens validitet. För att göra detta så hämtas och analyseras alla mottagare för vardera kluster som öppnat något av dess meddelanden. Mottagarna har olika attribut angående deras syfte med att använda produkten samt personlig information. När de har hämtats och undersökts kan slutsatser dras kring hurvida samband kan hittas. Det finns ett klart samband mellan vardera kluster och dess mottagare, men till viss utsträckning. Mottagarna från samma kluster visade likartade attribut som var urskiljbara gentemot mottagare från andra kluster. Därav kan det sägas att de resulterande klustren samt dess mottagare är specifika nog att urskilja sig från varandra men för generella för att kunna handera mer detaljerad information. Med mer data kan detta bli ett användbart verktyg för att bestämma mottagare av specifika emailutskick för att på sikt kunna öka öppningsfrekvensen och därmed nå ut till mer relevanta mottagare baserat på tidigare resultat.
|
19 |
Regroupement de textes avec des approches simples et efficaces exploitant la représentation vectorielle contextuelle SBERTPetricevic, Uros 12 1900 (has links)
Le regroupement est une tâche non supervisée consistant à rassembler les éléments semblables
sous un même groupe et les éléments différents dans des groupes distincts. Le regroupement
de textes est effectué en représentant les textes dans un espace vectoriel et en étudiant leur
similarité dans cet espace. Les meilleurs résultats sont obtenus à l’aide de modèles neuronaux
qui affinent une représentation vectorielle contextuelle de manière non supervisée. Or, cette
technique peuvent nécessiter un temps d’entraînement important et sa performance n’est
pas comparée à des techniques plus simples ne nécessitant pas l’entraînement de modèles
neuronaux.
Nous proposons, dans ce mémoire, une étude de l’état actuel du domaine. Tout d’abord,
nous étudions les meilleures métriques d’évaluation pour le regroupement de textes. Puis,
nous évaluons l’état de l’art et portons un regard critique sur leur protocole d’entraînement.
Nous proposons également une analyse de certains choix d’implémentation en regroupement
de textes, tels que le choix de l’algorithme de regroupement, de la mesure de similarité, de
la représentation vectorielle ou de l’affinage non supervisé de la représentation vectorielle.
Finalement, nous testons la combinaison de certaines techniques ne nécessitant pas d’entraînement avec la représentation vectorielle contextuelle telles que le prétraitement des données,
la réduction de dimensionnalité ou l’inclusion de Tf-idf.
Nos expériences démontrent certaines lacunes dans l’état de l’art quant aux choix des
métriques d’évaluation et au protocole d’entraînement. De plus, nous démontrons que l’utilisation de techniques simples permet d’obtenir des résultats meilleurs ou semblables à des
méthodes sophistiquées nécessitant l’entraînement de modèles neuronaux. Nos expériences
sont évaluées sur huit corpus issus de différents domaines. / Clustering is an unsupervised task of bringing similar elements in the same cluster and
different elements in distinct groups. Text clustering is performed by representing texts in a
vector space and studying their similarity in this space. The best results are obtained using
neural models that fine-tune contextual embeddings in an unsupervised manner. However,
these techniques require a significant amount of training time and their performance is not
compared to simpler techniques that do not require training of neural models.
In this master’s thesis, we propose a study of the current state of the art. First, we study
the best evaluation metrics for text clustering. Then, we evaluate the state of the art and take
a critical look at their training protocol. We also propose an analysis of some implementation
choices in text clustering, such as the choice of clustering algorithm, similarity measure,
contextual embeddings or unsupervised fine-tuning of the contextual embeddings. Finally,
we test the combination of contextual embeddings with some techniques that don’t require
training such as data preprocessing, dimensionality reduction or Tf-idf inclusion.
Our experiments demonstrate some shortcomings in the state of the art regarding the
choice of evaluation metrics and the training protocol. Furthermore, we demonstrate that the
use of simple techniques yields better or similar results to sophisticated methods requiring
the training of neural models. Our experiments are evaluated on eight benchmark datasets
from different domains.
|
Page generated in 0.1022 seconds