Global ETD Search

Return to search

Knowledge Distillation of DNABERT for Prediction of Genomic Elements / Kunskapsdestillation av DNABERT för prediktion av genetiska attribut

Understanding the information encoded in the human genome and the influence of each part of the DNA sequence is a fundamental problem of our society that can be key to unveil the mechanism of common diseases. With the latest technological developments in the genomics field, many research institutes have the tools to collect massive amounts of genomic data. Nevertheless, there is a lack of tools that can be used to process and analyse these datasets in a biologically reliable and eﬀicient manner. Many deep learning solutions have been proposed to solve current genomic tasks, but most of the times the main research interest is in the underlying biological mechanisms rather than high scores of the predictive metrics themselves. Recently, state-of-the-art in deep learning has shifted towards large transformer models, which use an attention mechanism that can be leveraged for interpretability. The main drawbacks of these large models is that they require a lot of memory space and have high inference time, which may make their use unfeasible in practical applications. In this work, we test the appropriateness of knowledge distillation to obtain more usable and equally performing models that genomic researchers can easily fine-tune to solve their scientific problems. DNABERT, a transformer model pre-trained on DNA data, is distilled following two strategies: DistilBERT and MiniLM. Four student models with different sizes are obtained and fine-tuned for promoter identification. They are evaluated in three key aspects: classification performance, usability and biological relevance of the predictions. The latter is assessed by visually inspecting the attention maps of TATA-promoter predictions, which are expected to have a peak of attention at the well-known TATA motif present in these sequences. Results show that is indeed possible to obtain significantly smaller models that are equally performant in the promoter identification task without any major differences between the two techniques tested. The smallest distilled model experiences less than 1% decrease in all performance metrics evaluated (accuracy, F1 score and Matthews Correlation Coeﬀicient) and an increase in the inference speed by 7.3x, while only having 15% of the parameters of DNABERT. The attention maps for the student models show that they successfully learn to mimic the general understanding of the DNA that DNABERT possesses.

http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-314806

Knowledge distillation

Transformers

BERT

Genomics

Promoter identification

Explainability

Medical Engineering

Medicinteknik

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-314806
Date	January 2022
Creators	Palés Huix, Joana
Publisher	KTH, Skolan för kemi, bioteknologi och hälsa (CBH)
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	TRITA-CBH-GRU ; 2022:089

Page generated in 0.0019 seconds

Knowledge Distillation of DNABERT for Prediction of Genomic Elements / Kunskapsdestillation av DNABERT för prediktion av genetiska attribut

Description

Links & Downloads

Tags

Additional Fields