This is a brief summary of paper for me to study and organize it, Deep Contextualized Word Representations (Peters et al., NAACL 2018) I read and studied.

Their archtecture could be shown as following:

Transfer Learning Natural Language Processing tutorial

Pretrain deep bidirectional LM, extract contextual word vectors as learned linear combination of hidden states

For detail of therir method, If you want to require it, visit the videos below.

ELMo Youtube & language model youtube

Reference