Semantic segmentation is one of the fundamental and challenging problems in computer vision, which can be applied to a wide range of applications such as autonomous driving and image editing. The goal of semantic segmentation is to assign a semantic class to each pixel in an image. With the recent success of Convolutional Neural Networks (CNNs), most existing works model the task as a pixel-wise classification problem, treating each pixel as an independent prediction. However, these pixels are semantically correlated, e.g., car pixels should be neighborhood with road instead of sky, and the structured context information of segmentation is not fully exploited in these works. In this thesis, we aim to design algorithms for semantic segmentation with the help of structured context information.
First, we address the problem of scene parsing (semantic segmentation with rich scene classes) with the help of global context. We propose a convolutional neural network to predict the global context feature embedding of image, e.g., beach, city, indoor. With the global context feature, we design a fully convolutional network that learns from the global contexts and priors. We show that the proposed method can eliminate false positives that are not compatible with the global context representations.
Second, we aim to exploit the structured contexts under semi-supervised semantic segmentation, where only part of train images are labeled. Inspired by Generative Adversarial Networks (GANs), we propose an adversarial learning method to leverage the ground truth contexts as well as the unlabeled data. We propose a fully convolutional discriminator to learn the context differences between the model prediction and ground truth, i.e., the difference of structural classes configuration.
With proposed adversarial training, our model learns to minimize such context gap with both labeled and unlabeled data. In addition, the fully convolutional discriminator enables the semi-supervised learning through discovering the trustworthy regions in prediction results of unlabeled images, providing additional supervisory signals. We show that the proposed method outperforms purely supervised and baseline semi-supervised approaches when using the same amount of ground truth data.
Third, we extend the adversarial learning to address the problem of domain adaptation, where there is a significant gap between the train and test dataset, e.g., synthetic versus real datasets. While most existing works apply adversarial learning to perform feature alignment, we propose a multi-level adversarial learning that directly works on the structured output space.We show that the proposed method performs favorably against existing feature alignment method when we exploit the structured information directly in the output space.
Lastly, we extend the research to reason about object's structural context. Specifically, we propose a self-supervised deep learning approach for part segmentation, where we devise several loss functions that aids in predicting part segments that are geometrically concentrated, robust to object variations and are also semantically consistent across different object instances. Extensive experiments on different types of image collections demonstrate that our approach can produce part segments that adhere to object boundaries and also more semantically consistent across object instances compared to existing self-supervised techniques.