We propose Modality-Adaptive Reconstruction (MARs), a unified reconstruction framework that enables robust learning from misaligned PAN-MS image pairs by dynamically generating both HRMS and PAN images.
We introduce Cross-Modality Alignment-Aware Attention (CM3A), a novel alignment mechanism that adaptively refines textures and structures between PAN and MS images, improving spatial-spectral consistency.
We achieve state-of-the-art (SOTA) performance across multiple benchmark datasets and show strong robustness on unseen satellite datasets, demonstrating the effectiveness of our PAN-Crafter in handling real-world cross-modality misalignment.
Example of PAN, LRMS, and their overlayed visualization. PAN images typically have four times the spatial resolution of MS images, necessitating up-sampling on the MS images before fusion. However, this up-sampling step introduces interpolation artifacts and spatial shifts, further amplifying alignment discrepancies.
(a) High-resolution PAN image.
(b) Low-resolution multi-spectral image.
(c) Overlay of PAN and LRMS images to highlight differences.
Overview of the proposed PAN-Crafter architecture. The network processes input PAN and LRMS images leveraging Modality-Adaptive Reconstruction (MARs) mode, which enables adaptive generation of HRMS and PAN outputs. By leveraging spatial structures and spectral fidelity of PAN and MS images, MARs ensures high-frequency details are preserved while minimizing spectral distortion. The architecture follows an encoder-decoder design, incorporating residual blocks and cross-modality alignment-aware attention (CM3A) at multiple scales to mitigate modality misalignment while preserving spectral and structural fidelity of MS and PAN images, respectively.
Cross-Modality Alignment-Aware Attention (CM3A). Cross-Modality Alignment-Aware Attention (CM3A) enables bidirectional alignment by transferring MS texture to PAN structure during HRMS reconstruction and PAN structure to MS texture during PAN back-reconstruction. This mechanism not only mitigates cross-modality misalignment but also ensures structural and spectral fidelity in the reconstructed images.
Quantitative comparison of deep learning-based PS methods on the WV3 dataset. Our PAN-Crafter consistently outperforms existing approaches across most evaluation metrics while maintaining low memory consumption and fast inference time.
Quantitative comparison of deep learning-based PS methods on the GF2 and QB datasets. Our PAN-Crafter surpasses diffusion-based models (PanDiff and TMDiff) in both full- and reduced-resolution datasets, demonstrating that our CM3A module effectively handles local misalignment without the computational burden of iterative diffusion process.
Left. Quantitative comparison of deep learning-based PS methods on the unseen WV2 satellite dataset.
Right. Ablation studies on CM3A and MARs on the WV3 dataset.