UMSS

Multi-modal semantic segmentation (MSS) is essential for robust perception in complex environments, yet its potential remains largely untapped due to the prohibitive cost of human annotations. While unsupervised semantic segmentation (USS) has seen success on single RGB modality, its naive extension to multi-modal data is hampered by fusion degradation. This is because, in the absence of explicit supervision, existing frameworks struggle to reconcile the heterogeneous structural patterns captured by different sensors, failing to effectively exploit their complementary information. In this paper, we make the first attempt to address the novel problem of Unsupervised Multi-modal Semantic Segmentation (UMSS), aiming to effectively exploit complementary sensor information in a fully label-free setting. To this end, we propose UniM2 (Unified Multi-Modal), a novel framework built upon DINOv3 that transforms conventional fusion methods into consistent performance gains. Our key idea is to learn a unified latent space driven by Cross-modal Correspondence Synergy (CMCS) to extract intrinsic shared semantic cues, bypassing the need for label-guided adaptive fusion. To mitigate inherent inter-modal conflicts, we introduce a Cross-modal Harmonizer (CMH) that designates RGB as a stable reference, effectively suppressing inconsistent relational supervision while guiding the model to exploit complementary structural features. Extensive experimental results on NYU-Depth-v2 and MFNet show that UniM2 improves mIoU by 6.4% and 9.8%, respectively, demonstrating clear advantages over existing frameworks in UMSS task.

Core Question

Can heterogeneous sensor modalities be fused effectively for semantic segmentation without using any human annotations?

UMSS starts from a practical bottleneck: multi-modal semantic segmentation is valuable in complex scenes, but dense human annotations are expensive to obtain. Existing supervised fusion modules learn when to trust RGB, depth, thermal, or polarization cues from labels; once those labels disappear, the same fusion strategy can amplify inter-modal conflicts instead of extracting complementary structure.

Our preliminary analysis shows that naively importing conventional image-level, feature-level, or supervised fusion into unsupervised segmentation often degrades performance below the RGB-only baseline. UniM2 is designed to answer this question by learning a unified latent space from cross-modal correspondences, while filtering unreliable relations before they become harmful supervision.

Figure 1 motivation analysis showing conventional fusion in UMSS and fusion performance comparison — Figure 1. Without labels, conventional multimodal fusion can introduce cross-sensor conflicts and even underperform the RGB-only baseline.

Framework Summary

UniM2 builds a shared representation space from multimodal DINOv3 features. Cross-modal Correspondence Synergy extracts shared semantic cues, while the Cross-modal Harmonizer uses RGB as a stable reference to suppress contradictory relational supervision.

Component 01

Cross-modal Correspondence Synergy

CMCS drives the unified latent space to preserve shared semantic relations across RGB, depth, thermal, NIR, AoLP, and DoLP modalities.

Component 02

Cross-modal Harmonizer

CMH reduces conflicts from heterogeneous sensors by filtering inconsistent cross-modal relational supervision and preserving complementary structure.

UniM2 Pipeline

01

We define the UMSS task for fully label-free multimodal semantic segmentation.

02

UniM2 turns multimodal fusion degradation into consistent gains through unified latent learning.

03

The framework scales across NYU-Depth-v2, MFNet, and MCubeS with multiple auxiliary modalities.

04

We release simplified code, prepared datasets, checkpoints, and training diagnostics for reproducibility.

NYU Depth V2

RGB + HHA/depth

Dataset · Base

MFNet

RGB + thermal

Dataset · Small · Base

MCubeS

RGB + NIR + AoLP + DoLP

Dataset · IA · ID · IN · IND · INAD

We recommend using Weights & Biases to monitor the contrastive loss and validation mIoU curves during hyperparameter search and training.

Abstract

Motivation

Overview

Cross-modal Correspondence Synergy

Cross-modal Harmonizer

Highlights

Datasets and Checkpoints

RGB + HHA/depth

RGB + thermal

RGB + NIR + AoLP + DoLP

Training Diagnostics