Computer vision involves a host of tasks, such as boundary detection, semantic segmentation, surface estimation, object detection, image classification, action localization, to name a few. For a holistic understanding of a scene, which is required by a lot of real-world applications, many of these tasks need to be combined together. For instance, an autonomous car should not only be able to detect other cars (object) but also if a pedestrian is walking (action). The former requires localizing the object, which can either be at the pixel level or bounding box level. The latter requires localizing the action, and by extension the actor, in both space and time. These problems are best dealt with approaches involving supervised learning models which rely on large annotated datasets, and so the problem becomes even more challenging when there is lack of labeled data.
In this thesis, we first tackle the problem of spatio-temporal action localization in an unsupervised setting. As the name suggests, it requires modeling of both spatial and temporal features. So, we propose an end-to-end learning framework for an adaptation method which aligns both spatial and temporal features and conduct experiments on the action localization task. To highlight the potential benefits for autonomous cars, we also construct and benchmark a new dataset which contains pedestrian actions collected in driving scenes. Then, for a holistic understanding of the scene, we shift our attention from localizing actions to recognising objects especially in a city street scenario. We do this by jointly dealing with the tasks of object detection and semantic segmentation. While the former localizes the individual instances of objects at the bounding box level, the latter provides pixel level distinction but at the category level. We explore a novel observation that connects the two tasks and provide an end-to-end learning framework to exploit this connection.