
Upcoming Event
Feature Representations for Visual and Language: Towards Deeper Video Understanding
This talk contains three research topics focusing on deeper video understanding using Transformer-based models for feature representation. The research proposes better systems for video question answering and humor prediction tasks and uses video related data to validate the effectiveness of the proposed systems. For video QA, BERT is used to represent visual and subtitle semantics to improve the accuracy on TVQA and Pororo datasets. A comparative study of Transformers is then made to link performance differences to their pre-training methods. For humor prediction, a novel multimodal method using pose, face, and subtitle features with a sliding window outperforms previous approaches on a new comedy dataset. The work highlights the importance of selecting optimal features and models for deeper video analysis.
​​
​
Presenter
Prof. Zekun YANG, Tokyo University of Science
​​
​
​Date
March 12, 2026 (Thursday)
​​
​
Time
11:00 am (HK Time)
​​
​
​Venue
CPD-3.01, Run Run Shaw Tower, Centennial Campus
​​
​
Presenter's biography
Zekun YANG is an Assistant Professor at Tokyo University of Science. He graduated from the University of Osaka in 2021 and has worked at Donghua University (China) and Nagoya University (Japan). His research areas include Machine Learning, Intelligent System and Applications, and Multimedia Information Processing.
​
​



