Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey

Academic Article

Publication Date:

2025

abstract:

Highlights: What are the main findings? This review develops an innovative taxonomy for vision–X (including vision, language, audio, and position) multimodal remote sensing foundation models (MM-RSFMs) according to their backbones, encompassing CNN, Transformer, Mamba, Diffusion, vision–language model (VLM), multimodal large language model (MLLM), and hybrid backbones. A thorough analysis of the problems and challenges confronting MM-RSFMs reveals a scarcity of high-quality multimodal datasets, limited capability for multimodal feature extraction, weak cross-task generalization, absence of unified evaluation criteria, and insufficient security measures. What is the implication of the main finding? The taxonomy assists readers in developing a systematic understanding of the intrinsic characteristics and interrelationships between cross-modal alignment and multimodal fusion in MM-RSFMs from a technical perspective. By analyzing key issues and challenges, targeted improvements can be implemented to improve the generalization, interpretability, and security of MM-RSFMs, thereby advancing their research progress and innovative applications. Remote sensing foundation models (RSFMs) have demonstrated excellent feature extraction and reasoning capabilities under the self-supervised learning paradigm of “unlabeled datasets—model pre-training—downstream tasks”. These models achieve superior accuracy and performance compared to existing models across numerous open benchmark datasets. However, when confronted with multimodal data, such as optical, LiDAR, SAR, text, video, and audio, the RSFMs exhibit limitations in cross-modal generalization and multi-task learning. Although several reviews have addressed the RSFMs, there is currently no comprehensive survey dedicated to vision–X (vision, language, audio, position) multimodal RSFMs (MM-RSFMs). To tackle this gap, this article provides a systematic review of MM-RSFMs from a novel perspective. Firstly, the key technologies underlying MM-RSFMs are reviewed and analyzed, and the available multimodal RS pre-training datasets are summarized. Then, recent advances in MM-RSFMs are classified according to the development of backbone networks and cross-modal interaction methods of vision–X, such as vision–vision, vision–language, vision–audio, vision–position, and vision–language–audio. Finally, potential challenges are analyzed, and perspectives for MM-RSFMs are outlined. This survey from this paper reveals that current MM-RSFMs face the following key challenges: (1) a scarcity of high-quality multimodal datasets, (2) limited capability for multimodal feature extraction, (3) weak cross-task generalization, (4) absence of unified evaluation criteria, and (5) insufficient security measures.

Iris type:

1.1 Articolo in rivista

Keywords:

Earth observation; generative pre-trained Transformer (GPT); multimodal data; remote sensing foundation model (RSFM); self-supervised learning

List of contributors:

Zhou, G.; Qian, L.; Gamba, P.

Authors of the University:

GAMBA PAOLO ETTORE

Handle:

https://iris.unipv.it/handle/11571/1542504

Published in:

REMOTE SENSING

Journal