Recent years have witnessed the success of deep learning models such as convolutional neural networks (ConvNets) for numerous vision tasks. However, ConvNets have a significant limitation: they do not have effective internal structures to explicitly learn image pairwise relations. This yields two fundamental bottlenecks for many vision problems of label and map regression, as well as image reconstruction: (a) pixels of an image have large amount of redundancies but cannot be efficiently utilized by ConvNets, which predict each of them independently, and (b) the convolutional operation cannot effectively solve problems that rely on similarities of pixel pairs, e.g., image pixel propagation and shape/mask refinement.
This thesis focuses on how to learn pairwise relations of image pixels under jointly, end-to-end learnable neural networks. Specifically, this is achieved by two different approaches: (a) formulating the conditional random field (CRF) objective as a non-structured objective that can be implemented via ConvNets as an additional loss, and (b) developing spatial propagation based deep-learning-friendly structures that learn the pairwise relations in an explicit manner.
In the first approach, we develop a novel multi-objective learning method that optimizes a single unified deep convolutional network with two distinct non-structured loss functions: one encoding the unary label likelihoods and the other encoding the pairwise label dependencies. We propose to apply this framework on face parsing, while experiments on both LFW and Helen datasets demonstrate the additional pairwise loss significantly improves the labeling performance compared to a single loss ConvNet with the same architecture.
In the second approach, we explore how to learn pairwise relations using spatial propagation networks, instead of using additional loss functions. Unlike ConvNets, the propagation module is a spatially recurrent network with a linear transformation between adjacent rows and columns. We propose two typical structures: a one-way connection using one-dimensional propagation, and a three-way connection using two-dimensional propagation. For both models, the linear weights are spatially variant output maps that can be learned from any ConvNet. Since such modules are fully differentiable, they are flexible enough to be inserted into any type of neural network. We prove that while both structures can formulate global affinities, the one-way connection constructs a sparse matrix, and the three-way forms a much denser one. While both structures demonstrate their effectiveness over a wide range of vision problems, the three-way connection is more powerful with challenging tasks (e.g., general object segmentation). We show that a well-learned affinity can benefit numerous computer vision applications, including but not limited to image filtering and denoising, pixel/color interpolation, face parsing, as well as general semantic segmentation. Compared to graphical model base pairwise learning, the spatial propagation network can be a good alternative in deep-learning based frameworks.