Indiana University-Purdue University Indianapolis (IUPUI) / Efficient topic modeling is needed to support applications that aim at identifying main themes from a collection of documents. In this thesis, a reduced vector embedding representation and particle swarm optimization (PSO) are combined to develop a topic modeling strategy that is able to identify representative themes from a large collection of documents. Documents are encoded using a reduced, contextual vector embedding from a general-purpose pre-trained language model (sBERT). A modified PSO algorithm (pPSO) that tracks particle fitness on a dimension-by-dimension basis is then applied to these embeddings to create clusters of related documents. The proposed methodology is demonstrated on three datasets across different domains. The first dataset consists of posts from the online health forum r/Cancer. The second dataset is a collection of NY Times abstracts and is used to compare
Identifer | oai:union.ndltd.org:IUPUI/oai:scholarworks.iupui.edu:1805/29167 |
Date | 05 1900 |
Creators | Miles, Samuel |
Contributors | Ben Miled, Zina, Salama, Paul, El-Sharkawy, Mohamed |
Source Sets | Indiana University-Purdue University Indianapolis |
Language | en_US |
Detected Language | English |
Type | Thesis |
Page generated in 0.0021 seconds