Global ETD Search

Return to search

Incremental Re-tokenization in BPE-trained SentencePiece Models

This bachelor's thesis in Computer Science explores the efficiency of an incremental re-tokenization algorithm in the context of BPE-trained SentencePiece models used in natural language processing. The thesis begins by underscoring the critical role of tokenization in NLP, particularly highlighting the complexities introduced by modifications in tokenized text. It then presents an incremental re-tokenization algorithm, detailing its development and evaluating its performance against a full text re-tokenization. Experimental results demonstrate that this incremental approach is more time-efficient than full re-tokenization, especially evident in large text datasets. This efficiency is attributed to the algorithm's localized re-tokenization strategy, which limits processing to text areas around modifications. The research concludes by suggesting that incremental re-tokenization could significantly enhance the responsiveness and resource efficiency of text-based applications, such as chatbots and virtual assistants. Future work may focus on predictive models to anticipate the impact of text changes on token stability and optimizing the algorithm for different text contexts.

http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-221890

Natural Language Processing

Tokenization

Re-tokenization

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:umu-221890
Date	January 2024
Creators	Hellsten, Simon
Publisher	Umeå universitet, Institutionen för datavetenskap
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	UMNAD ; 1452

Page generated in 0.0764 seconds

Incremental Re-tokenization in BPE-trained SentencePiece Models

Description

Links & Downloads

Tags

Additional Fields