This is a brief summary of paper for me to study and organize it, ALBERT- A Lite BERT for Self-supervised Learning of Language Representations (Lan et al., ICLR 2020) I read and studied.
They propose light version of BERT_based model by using two parameter redcutions.
There are several ways to reduce parameter as follows:
- prunining
- weight sharing
- Quatatization
- Low-rank Approximation
- Sparse Regularization
- Distillation
They used weight sharing and factoriztion of parameter in ALBERT they proposed.
-
Factorized embedding parameterization
-
Cross-layer parameter sharing
And then they used another loss against next sentence prediction loss utilized in BERT paper.
It is Inter-sentence coherence loss.
Note(Abstract):
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, they present two parameter reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that their proposed methods lead to models that scale much better compared to the original BERT. They also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs.
Download URL:
The paper: ALBERT- A Lite BERT for Self-supervised Learning of Language Representations (Lan et al., ICLR 2020)
The paper: ALBERT- A Lite BERT for Self-supervised Learning of Language Representations (Lan et al., ICLR 2020)
Reference
- Paper
- How to use html for alert
- For your information