This is a brief summary of paper for me to study and organize it, ALBERT- A Lite BERT for Self-supervised Learning of Language Representations (Lan et al., ICLR 2020) I read and studied.

They propose light version of BERT_based model by using two parameter redcutions.

There are several ways to reduce parameter as follows:

  • prunining
  • weight sharing
  • Quatatization
  • Low-rank Approximation
  • Sparse Regularization
  • Distillation

They used weight sharing and factoriztion of parameter in ALBERT they proposed.

  • Factorized embedding parameterization

  • Cross-layer parameter sharing

And then they used another loss against next sentence prediction loss utilized in BERT paper.

It is Inter-sentence coherence loss.

Reference