Abstract: Current audio-visual representation learning can capture rough object categories (e.g., "animals" and "instruments"), but it lacks the ability to recognize fine-grained details, such as ...
The baseball is a curious object. One killed a man once, Ray Chapman in 1920. These days dozens are used in a single ...