Abstract

In this paper we tackle the problem of unsupervised domain adaptation for the task of semantic segmentation, where we attempt to transfer the knowledge learned upon synthetic datasets with ground-truth labels to real-world images without any annotation. With the hypothesis that the structural content of images is the most informative and decisive factor to semantic segmentation and can be readily shared across domains, we propose a Domain Invariant Structure Extraction (DISE) framework to disentangle images into domain-invariant structure and domain-specific texture representations, which can further realize image-translation across domains and enable label transfer to improve segmentation performance. Extensive experiments verify the effectiveness of our proposed DISE model and demonstrate its superiority over several state-of-the-art approaches.

Introduction Video

CVPR 2019 Paper

@inproceedings{chang2019all,
 title={All about Structure: Adapting Structural Information across Domains for Boosting Semantic Segmentation},
 author={Chang, Wei-Lun and Wang, Hui-Po and Peng, Wen-Hsiao and Chiu, Wei-Chen},
 booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
 year={2019}
}

Method


An overview of the proposed domain-invariant structure extraction (DISE) framework for semantic segmentation. The DISE framework is composed of a common encoder $E_c$ shared across domains, two domain-specific private encoders, $E_p^s, E_p^t$, a pixel-wise classifier $T$, and a shared decoder $D$. It encodes an image, source-domain or target-domain, into a domain-specific texture component $z_p$ and a domain-invariant structure component $z_c$, as shown in part (a). With this disentanglement, it can translate an image $x^s$ (respectively, $x^t$) in one domain to another image $\hat{x}^{s2t}$ (respectively, $\hat{x}^{t2s}$) in the other domain by combining the structure content of $x^s$ (respectively, $x^t$) with the texture appearance of $x^t$ (respectively, $x^s$), as shown in parts (b) and (c). This further enables the transfer of ground-truth labels from the source domain to the target domain, as illustrated in part (d).

Example Results (from GTA5 to Cityscapes)

Source Only

DISE (ours)


Tsai et al.

DISE (ours)