This post is a brief summary about the paper that I read for my study and curiosity, so I shortly arrange the content of the paper, titled Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (Bai et al. arXiv 2023), that I read and studied.

Bai et al. arXiv 2023

For detailed experiment and explanation, refer to the paper, titled Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (Bai et al. arXiv 2023)

Reference