Learning to understand the visual context in images or videos is a challenging task in computer vision. Given an image, we can learn how to classify objects and further localize them by learning the spatial information of images. For videos, there are also tasks that aim to classify different sequences and predict their time interval within a video. In this case, not only does the spatial information matter, but we also need to exploit the temporal dependence of sequences in order to have a better understanding of the videos. In this thesis, we first learn to tell brake and turn signals of vehicles via a spatial-temporal model. Second, we construct a dataset and a deep model to explore how machines can help us in understanding the commands in Photoshop tutorial videos.
Many of these visual understanding approaches involve supervised learning models which rely on large annotated datasets. These models do not generalize well when the testing domain differs from the training one and it requires high labor cost to annotate another dataset. We address this issue in the thesis by proposing a progressive adaptation method to align the labeled domain with the testing one and conduct experiments on the object detection task. Our progressive adaptation relieves the difficulty under large domain discrepancy conditions by introducing an intermediate domain sitting in between the two domains and gradually adapt to our testing data distribution.
Advisor
Author