This post is a brief summary about the paper that I read for my study and curiosity, so I shortly arrange the content of the paper, titled Visual Instruction Tuning (Liu et al., arXiv 2023), that I read and studied.

The primary goal is to effectively leverage the capabilities of both the pre-trained LLM and visual model. The network archtecture is illustrated in Figure 1. Liu et al. arXiv 2023

For viaul token \(H_v\), they use linear projection.

Liu et al. arXiv 2023

For Instruction \({X^t}_instruction\), it is comprised of the follwoing:

Liu et al. arXiv 2023

For detailed experiment and explanation, refer to the paper, titled Visual Instruction Tuning (Liu et al., arXiv 2023)

Reference