In this paper we present a novel method to improve action recognition by leveraging a set of captioned videos. By learning linear projections to map videos and text onto a common space, our approach shows that improved results on unseen videos can be obtained. We also propose a novel structure preserving loss that further ameliorates the quality of the projections. We tested our method on the challenging, realistic, Hollywood2 action recognition dataset where a considerable gain in performance is obtained. We show that the gain is proportional to the number of training samples used to learn the projections.

Do textual descriptions help action recognition?

URICCHIO, TIBERIO;
2016-01-01

Abstract

In this paper we present a novel method to improve action recognition by leveraging a set of captioned videos. By learning linear projections to map videos and text onto a common space, our approach shows that improved results on unseen videos can be obtained. We also propose a novel structure preserving loss that further ameliorates the quality of the projections. We tested our method on the challenging, realistic, Hollywood2 action recognition dataset where a considerable gain in performance is obtained. We show that the gain is proportional to the number of training samples used to learn the projections.
2016
9781450336031
File in questo prodotto:
File Dimensione Formato  
p645-bruni.pdf

solo utenti autorizzati

Licenza: Copyright dell'editore
Dimensione 798.62 kB
Formato Adobe PDF
798.62 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11393/313350
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? ND
social impact