Abstract: Visual grounding tasks aim to localize image regions based on natural language references. In this work, we ex-plore whether generative VLMs predominantly trained on image-text data could be ...
Abstract: In a globalized world where people speak different languages and create data in multiple languages, it can become challenging to share information. Traditional methods of using textual ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results