This post is a brief summary about the paper that I read for my study and curiosity, so I shortly arrange the content of the paper, titled Music Transformer: Generating Music with Long-Term Structure, Huang et al. ICLR 2019, that I read and studied.

The original paper by(Vaswani et al. NIPS 2017) for transformer architecture uses position encoding in input layer.

However, this paper proposes the efficient relative position embedding for music generation applying the method from Shaw et al., NAACL 2018.

They argue that for music generation, since timing and pitch is repeatedly generated, they take advantage of self-attention mechanism approach with relative position embedding for music generation task.

As a result, they are saying that a Transformer with their relative attention mechanism maintains the regular timing grid present in the JSB Chorales dataset.

For the detailed method about representating a music as a token seqeunce to apply autoregressive problem to music generation, refer to the paper, titled Music Transformer: Generating Music with Long-Term Structure, Huang et al. ICLR 2019.

On this article, I am focusing on explaining how to formularize the relative position embedding on self-attention mechanisms.

As you can see the paper, titled Music Transformer: Generating Music with Long-Term Structure, Huang et al. ICLR 2019,

Shaw et al., NAACL 2018 introduced relative position embedding to allow attention to be informed by how far two position are apart in a sequence.

This intuition causes model to make a separate relative position embedding, but they use relative bias score added to logit in self-attention mechanism as follows:

\[RelativeAttention = Softmax(\frac{QK^T + S^{rel}}{\sqrt(D_h)})V\]

The equation above is takeaway for relative poisition embedding they proposed.

For detailed experiment and explanation, refer to the paper, titled Music Transformer: Generating Music with Long-Term Structure, Huang et al. ICLR 2019.

Reference