PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening

(Unseen) WV2 Full-Resolution PS Results

WV3 Full-Resolution PS Results

QB Full-Resolution PS Results

GF2 Full-Resolution PS Results

Abstract

We propose Modality-Adaptive Reconstruction (MARs), a unified reconstruction framework that enables robust learning from misaligned PAN-MS image pairs by dynamically generating both HRMS and PAN images.

We introduce Cross-Modality Alignment-Aware Attention (CM3A), a novel alignment mechanism that adaptively refines textures and structures between PAN and MS images, improving spatial-spectral consistency.

We achieve state-of-the-art (SOTA) performance across multiple benchmark datasets and show strong robustness on unseen satellite datasets, demonstrating the effectiveness of our PAN-Crafter in handling real-world cross-modality misalignment.

Motivation of PAN-Crafter

Example of PAN, LRMS, and their overlayed visualization. PAN images typically have four times the spatial resolution of MS images, necessitating up-sampling on the MS images before fusion. However, this up-sampling step introduces interpolation artifacts and spatial shifts, further amplifying alignment discrepancies.
(a) High-resolution PAN image.
(b) Low-resolution multi-spectral image.
(c) Overlay of PAN and LRMS images to highlight differences.

Overview of PAN-Crafter

Overview of the proposed PAN-Crafter architecture. The network processes input PAN and LRMS images leveraging Modality-Adaptive Reconstruction (MARs) mode, which enables adaptive generation of HRMS and PAN outputs. By leveraging spatial structures and spectral fidelity of PAN and MS images, MARs ensures high-frequency details are preserved while minimizing spectral distortion. The architecture follows an encoder-decoder design, incorporating residual blocks and cross-modality alignment-aware attention (CM3A) at multiple scales to mitigate modality misalignment while preserving spectral and structural fidelity of MS and PAN images, respectively.

Cross-Modality Alignment-Aware Attention (CM3A)

Cross-Modality Alignment-Aware Attention (CM3A). Cross-Modality Alignment-Aware Attention (CM3A) enables bidirectional alignment by transferring MS texture to PAN structure during HRMS reconstruction and PAN structure to MS texture during PAN back-reconstruction. This mechanism not only mitigates cross-modality misalignment but also ensures structural and spectral fidelity in the reconstructed images.

Quantitative Evaluation

Quantitative comparison of deep learning-based PS methods on the WV3 dataset. Our PAN-Crafter consistently outperforms existing approaches across most evaluation metrics while maintaining low memory consumption and fast inference time.

Quantitative comparison of deep learning-based PS methods on the GF2 and QB datasets. Our PAN-Crafter surpasses diffusion-based models (PanDiff and TMDiff) in both full- and reduced-resolution datasets, demonstrating that our CM3A module effectively handles local misalignment without the computational burden of iterative diffusion process.

Left. Quantitative comparison of deep learning-based PS methods on the unseen WV2 satellite dataset.
Right. Ablation studies on CM3A and MARs on the WV3 dataset.

BibTeX

@InProceedings{Do_2025_ICCV,
    author    = {Do, Jeonghyeok and Kim, Sungpyo and Youk, Geunhyuk and Lee, Jaehyup and Kim, Munchurl},
    title     = {PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {4242-4252}
}