论文阅读：Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

We propose an unsupervised method for reference res-
olution in instructional videos, where the goal is to tem- porally link an entity (e.g., “dressing”) to the action (e.g., “mix yogurt”) that produced it. The key challenge is the inevitable visual-linguistic ambiguities arising from the changes in both visual appearance and referring expression of an entity in the video. This challenge is amplified by the fact that we aim to resolve references with no supervi- sion. We address these challenges by learning a joint visual- linguistic model, where linguistic cues can help resolve vi- sual ambiguities and vice versa. We verify our approach by learning our model unsupervisedly using more than two thousand unstructured cooking videos from YouTube, and show that our visual-linguistic model can substantially im- prove upon state-of-the-art linguistic only model on refer- ence resolution in instructional videos.

我们在教学视频中提出了一种无监督的参考解析方法，其目的是将实体（例如“装扮”）与产生它的动作（例如“混合酸奶”）临时联系起来。关键挑战是视频实体中视觉外观和参考表达的变化不可避免地导致视觉语言歧义。我们旨在无监督地解决参考文献这一事实加剧了这一挑战。我们通过学习联合的视觉语言模型来应对这些挑战，其中语言提示可以帮助解决视觉上的歧义，反之亦然。我们通过使用YouTube上的两千多条非结构化烹饪视频无监督地学习了我们的模型，从而验证了我们的方法，并表明我们的视觉语言模型可以大大改善基于参考解析的最新语言唯一模型。教学视频。