In many domains, such as artificial intelligence, computer vision, speech, and bioinformatics, feature representation learning is a critical step to facilitate the subsequent classification, retrieval, detection, and recommendation tasks. Researchers have been actively working in this area due to its theoretical and practical interest. It has become an inevitable fact that the future of feature representation learning is a potential blend of machine learning. The general goal of feature representation is to generate discriminative features that make the information accessible and understandable for the downstream task. However, looking for a proper representation for the downstream task is a challenging problem which has taken a huge interest in researchers. In this thesis, we aim at learning a cross-modality feature representation to synthesize the learning targets and disentangling the features representation for a generative model in an unsupervised manner.
First, we address the problem of self-supervised audio spatialization. We propose a cross-modal synthesizer to learn audio-visual feature representations to generate spatial audio in a self-supervised manner. Our model directly extracts audio features from mono spectrograms and concatenates visual features in the feature domain to synthesize spatial audio. Furthermore, to improve the realism of the generated audio, we utilize an auxiliary classifier to improve our synthesizer.
Second, we address learning disentanglement representations in the image generation task. Given an arbitrary pre-trained generative adversarial network, our model can learn how to discover the interpretable directions in the latent space. We propose a learning strategy and architecture to explore the direction in latent space which gradually changes semantic attributes for generated images. We also train our model with centroid loss to help our model cluster the shifted feature representation in latent space. Our approach generates images with diverse attributes and keeps the visual quality between the synthesized image and the shifted image.