This is a brief summary of paper for me to study and organize it, Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss (Plank et al., ACL 2016) I read and studied.

They performed the experiment on POS tagging with multiple languages.

They used character embedding and byte embedding to handle rare words as follows:

Plank et al., ACL 2016

on their model, they train the bi-LSTM tagger to predict both the tags of the sequence, as well as a label that represents the log frequency of the next token as estimated from the training data

log frequency label : \(int(log(fre_{train}(w)))\)

They measured the performance with respect to label noise, data size, and rare word.

The result showed

  • For rare word, the rare token benefits from sub-token representation
  • For data size, the bi-LSTM model is better with more data but the TNT, based on a second order HHM, is better with little data. The bi-LSTM model always wins over CRF.
  • For label noise, bi-LSTMs are less robust, showing higher drops in accuracy compared to TNT.

Reference