This paper,Bidirectional LSTM-CRF Models for Sequence Tagging (Huang et al., arXiv 2015), refered to how to use BiLSTM+CRF for seqeunce tagging in NLT task.

Normally, If you run into Sequence tagging problem, you would think of RNN.

It is because the key point is seqeunce in the problem.

So This paper implemented LSTM network, BiLSTM network, LSTM-CRF Network, BiLSTM-CRF network.

First, A LSTM network deals with information from left to right :

Huang et al., arXiv 2015

as you can already know and understand RNN structure, the utilize the infromation of previous information and current input.

So LSTM utilize the past information at the time, But BiLSTM is different as follows:

Second, Bidirectional LSTM network is like this :

Huang et al., arXiv 2015

BiLSTM have two type of LSTM, one is forward LSTM and the other is backward LSTM.

operation of the two LSTM is the same, the direction of information flow is different.

Let’s see how to take advantage of BiLSTM to extract information.

there two ways to extract information. one is only final state, the other is sequence output at the time.

firstly, use final state(output) that it summarize the infromation of forward and backward respectively :

OR

Seconde, methods to use contextual represetation of forward and backward respectively.

as you could know, for sequence labeling problem, we need to use contextual represenation.

conditional random field

Conditional random field(CRF) is useful graphical model on probability.

It consider sentence level tag sequence information.

Huang et al., arXiv 2015

Let’s see the combination of LSTM and CRF

First, LSTM with a CRF layer

Huang et al., arXiv 2015

Second, Bi-direcational LSTM with A CRF Layer

Huang et al., arXiv 2015

As you can know, in this paper, LSTM is variant like Peephole LSTM.

They use cell state as input for input, output, forget gate.

In particular, the weight matrix from cell to gate vectors are diagonal.

additionaly, they used BIO2 annotation standard for Chunking and NER tasks.

also they use the connection trick of features like this:

Huang et al., arXiv 2015

They estimated the robustness of models with respect to engineered features(spelling and context features).

So they trained their models with word features only(spelling and context features removed).

they argued that

  • CRF model heavily rely on engineered features to obtain good performance.

  • on the other hand, LSTM, BiLSTM, and BiLSTM-CRF models are more robust and they are less affected by the removal of engineering features.

Reference