Enabling Detailed Action Recognition Evaluation Through Video Dataset Augmentation

SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding

Multi-query Video Retrieval

Quantized GAN for Complex Music Generation from Dance Videos

Learning to Learn by Jointly Optimizing Neural Architecture and Weights

Large-scale Video Panoptic Segmentation in the Wild: A Benchmark

Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing

VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents

Gated Channel Transformation for Visual Recognition

Unsupervised Person Re-identification via Softened Similarity Learning

Imitative Non-Autoregressive Modeling for Trajectory Forecasting and Imputation

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Cascaded Revision Network for Novel Object Captioning