The goal of semantic segmentation is to assign a semantic category to each pixel in the image. It has been one of the most important tasks in computer vision that enjoys a wide range of applications such as image editing and scene understanding. Recently, deep convolutional neural network (CNN) based methods have been developed for semantic segmentation and achieved significant progress. However, such approaches rely on learning supervised models that require pixel-wise annotations, which take extensive effort and time. To reduce the effort in annotating pixel-wise ground truth labels, numerous weakly-supervised methods are proposed using various types of labels such as image-level, bounding box, point-level, and scribble-based labels. In this thesis, we focus on using image-level labels which can be obtained effortlessly, yet a more challenging case under the weakly-supervised setting.
Existing weakly-supervised semantic segmentation methods using image-level annotations typically rely on initial responses to locate object regions. However, such response maps generated by the classification network usually focus on discriminative object parts, due to the fact that the network does not need the entire object for optimizing the objective function. To address this issue, we improve the generated response map by enforcing the network to pay attention to other parts of an object via self-regularization techniques. First, we apply the mixup data augmentation to effectively calibrate the model uncertainty on overconfident predictions, which enables the model to attend to more object regions. Second, we introduce a self-supervised task that discovers sub-categories in an unsupervised manner. By imposing a more challenging task, the model learns better representations, thereby improving the response map. Based on the proposed two self-regularization methods, the produced initial responses are more complete and balanced across object regions, which facilitates the latter steps for weakly-supervised semantic segmentation, i.e., response refinement and segmentation model training.