PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening

Jeonghyeok Do1         Sungpyo Kim1         Geunhyuk Youk1         Jaehyup Lee2†         Munchurl Kim1†
Co-corresponding authors
1Korea Advanced Institute of Science and Technology, South Korea
2Kyungpook National University, South Korea

(Unseen) WV2 Full-Resolution PS Results

WV3 Full-Resolution PS Results

QB Full-Resolution PS Results

GF2 Full-Resolution PS Results

Abstract

We propose Modality-Adaptive Reconstruction (MARs), a unified reconstruction framework that enables robust learning from misaligned PAN-MS image pairs by dynamically generating both HRMS and PAN images.

We introduce Cross-Modality Alignment-Aware Attention (CM3A), a novel alignment mechanism that adaptively refines textures and structures between PAN and MS images, improving spatial-spectral consistency.

We achieve state-of-the-art (SOTA) performance across multiple benchmark datasets and show strong robustness on unseen satellite datasets, demonstrating the effectiveness of our PAN-Crafter in handling real-world cross-modality misalignment.

Motivation of PAN-Crafter


Example of PAN, LRMS, and their overlayed visualization. PAN images typically have four times the spatial resolution of MS images, necessitating up-sampling on the MS images before fusion. However, this up-sampling step introduces interpolation artifacts and spatial shifts, further amplifying alignment discrepancies.
(a) High-resolution PAN image.
(b) Low-resolution multi-spectral image.
(c) Overlay of PAN and LRMS images to highlight differences.

Overview of PAN-Crafter


Overview of the proposed PAN-Crafter architecture. The network processes input PAN and LRMS images leveraging Modality-Adaptive Reconstruction (MARs) mode, which enables adaptive generation of HRMS and PAN outputs. By leveraging spatial structures and spectral fidelity of PAN and MS images, MARs ensures high-frequency details are preserved while minimizing spectral distortion. The architecture follows an encoder-decoder design, incorporating residual blocks and cross-modality alignment-aware attention (CM3A) at multiple scales to mitigate modality misalignment while preserving spectral and structural fidelity of MS and PAN images, respectively.

Cross-Modality Alignment-Aware Attention (CM3A)


Cross-Modality Alignment-Aware Attention (CM3A). Cross-Modality Alignment-Aware Attention (CM3A) enables bidirectional alignment by transferring MS texture to PAN structure during HRMS reconstruction and PAN structure to MS texture during PAN back-reconstruction. This mechanism not only mitigates cross-modality misalignment but also ensures structural and spectral fidelity in the reconstructed images.

Quantitative Evaluation


Quantitative comparison of deep learning-based PS methods on the WV3 dataset. Our PAN-Crafter consistently outperforms existing approaches across most evaluation metrics while maintaining low memory consumption and fast inference time.


Quantitative comparison of deep learning-based PS methods on the GF2 and QB datasets. Our PAN-Crafter surpasses diffusion-based models (PanDiff and TMDiff) in both full- and reduced-resolution datasets, demonstrating that our CM3A module effectively handles local misalignment without the computational burden of iterative diffusion process.


Left. Quantitative comparison of deep learning-based PS methods on the unseen WV2 satellite dataset.
Right. Ablation studies on CM3A and MARs on the WV3 dataset.