Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing

VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents

Gated Channel Transformation for Visual Recognition

Unsupervised Person Re-identification via Softened Similarity Learning

Imitative Non-Autoregressive Modeling for Trajectory Forecasting and Imputation

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Cascaded Revision Network for Novel Object Captioning

Dual Attention Matching for Audio-Visual Event Localization

Pose-Guided Feature Alignment for Occluded Person Re-identification

Auto-ReID: Searching for a Part-aware ConvNet for Person Re-Identification

Baidu-UTS Submission to the EPIC-Kitchens Action Recognition Challenge 2019

Decoupled Novel Object Captioner

Exploit the Unknown Gradually: One-Shot Video-Based Person Re-Identification by Stepwise Learning