Shortcoming in multilingual visual translation ability

#4
by liziming - opened

I tried jina-vlm on multilingual visual translation tasks, which involves translating the multiple languages in the image into Mandarin. However, I found that the model seems to have some shortcomings in this task. Here are some examples:
1ใ€
image
I asked the model to translate the text in the image, but the model only did OCR and did not provide a translation result.
2ใ€
image
I asked the model to translate the small words under the main title of the book on the left. The model did not have instructions, but answered, "This book is an introductory book on Python machine learning, including deep learning, neural networks, reinforcement learning, natural language processing, computer vision, etc"
In addition, the model also has issues such as translation errors and outputting a large amount of repetitive text.
Iโ€™m curious to know if any of these points resonate with your experience. Any perspective or analysis you could offer would be greatly valued.

Jina AI org

i think prompt like describe the image in {language} works better

image

image

Sign up or log in to comment