This post is a brief summary about the paper that I read for my study and curiosity, so I shortly arrange the content of the paper, titled Visual Instruction Tuning (Liu et al., arXiv 2023), that I read and studied.

The primary goal is to effectively leverage the capabilities of both the pre-trained LLM and visual model. The network archtecture is illustrated in Figure 1. Liu et al. arXiv 2023

For viaul token Hv, they use linear projection.

Liu et al. arXiv 2023

For Instruction Xtinstruction, it is comprised of the follwoing:

Liu et al. arXiv 2023

For detailed experiment and explanation, refer to the paper, titled Visual Instruction Tuning (Liu et al., arXiv 2023)

Reference