Visual enhancement is concerned with problems to improve the visual quality and viewing experience for images and videos. Researchers have been actively working on this area due to its theoretical and practical interest. However, obtaining high visual quality often comes with a cost of computational efficiency. With the growth of mobile applications and cloud services, it is crucial to develop effective and efficient algorithms for generating visually attractive images and videos. In this thesis, we address the visual enhancement problems in three aspects, including the spatial, temporal, and the joint spatial-temporal domains. We propose efficient algorithms based on deep convolutional neural networks for solving various visual enhancement problems.
First, we address the problem of spatial enhancement for single-image super-resolution. We propose a deep Laplacian Pyramid Network to reconstruct a high-resolution image from an input low-resolution input in a coarse-to-fine manner. Our model directly extracts features from input LR images and progressively reconstructs the sub-band residuals. We train the proposed model with a multi-scale training, deep supervision, and robust loss functions to achieve state-of-the-art performance. Furthermore, we exploit the recursive learning technique to share parameters across and within pyramid levels to significantly reduce the model parameters. As most of the operations are performed on a low-resolution space, our model requires less memory and runs faster than state-of-the-art methods.
Second, we address the temporal enhancement problem by learning the temporal consistency in videos. Given an input video and a per-frame processed video (processed by an existing image-based algorithm), we learn a recurrent network to reduce the temporal flickering and generate a temporally consistent video. We train the proposed network by minimizing both short-term and long-term temporal losses as well as a perceptual loss to strike a balance between temporal coherence and perceptual similarity with the processed frames. At test time, our model does not require computing optical flow and thus runs at 400+ FPS on GPU for high-resolution videos. Our model is task independent, where a single model can handle multiple and unseen tasks, including but not limited to artistic style transfer, enhancement, colorization, image-to-image translation and intrinsic image decomposition.
Third, we address the spatial-temporal enhancement problem for video stitching. Inspired by the pushbroom cameras, we cast the stitching as a spatial interpolation problem. We propose a pushbroom stitching network to learn dense flow fields to smoothly align the input videos. The stitched videos can be generated from an efficient pushbroom interpolation layer. Our approach generates more temporally stable and visually pleasing results than existing video stitching approaches and commercial software. Furthermore, our algorithm has immediate applications in many areas such as virtual reality, immersive telepresence, autonomous driving, and video surveillance.
Author
Advisor