Global ETD Search

Return to search

The Student Becomes The Teacher: Training High-Performance Language Models More Sample-Efficiently From Small Models Via Superstilling

Recent advances including the Transformer architecture have revolutionized the Natural Language Processing community by providing immense performance improvements across many tasks, including the development of Large Language Models (LLMs). LLMs show enormous promise as few-shot learners, common-sense knowledge repositories, conversational agents, writing assistants, and coding tools, and are gaining widespread traction in commercial industry. However, LLMs are expensive and time-consuming to train, requiring many passes over terabytes of data for the largest models. In this paper, we present Superstilling, a method for reducing the sample complexity of language model training by distilling the knowledge from a previously-trained model (the teacher) into a new, larger model (the student). This method does not require conformity between the architectures of the two models, and can be applied even when the weights and training data of the teacher model are not available, for example in federated learning scenarios. We apply Superstilling to train models of various sizes and show this method can decrease sample complexity by more than 10\% on models with over 160M parameters. We also show that in certain scenarios, Superstilling can be used to speed up training despite the need to run the teacher and student models simultaneously.

large language models

knowledge distillation

sample efficiency

transformer

deep learning

Physical Sciences and Mathematics

Identifer	oai:union.ndltd.org:BGMYU2/oai:scholarsarchive.byu.edu:etd-11536
Date	14 August 2023
Creators	Gundry, Chaz Allen
Publisher	BYU ScholarsArchive
Source Sets	Brigham Young University
Detected Language	English
Type	text
Format	application/pdf
Source	Theses and Dissertations
Rights	https://lib.byu.edu/about/copyright/

Page generated in 0.0019 seconds

The Student Becomes The Teacher: Training High-Performance Language Models More Sample-Efficiently From Small Models Via Superstilling

Description

Links & Downloads

Tags

Additional Fields