SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

التفاصيل البيبلوغرافية
العنوان:	SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
المؤلفون:	Guo, Xin, Lao, Jiangwei, Dang, Bo, Zhang, Yingying, Yu, Lei, Ru, Lixiang, Zhong, Liheng, Huang, Ziyuan, Wu, Kang, Hu, Dingxiang, He, Huimei, Wang, Jian, Chen, Jingdong, Yang, Ming, Zhang, Yongjun, Li, Yansheng
سنة النشر:	2023
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computer Vision and Pattern Recognition
الوصف:	Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless, these works primarily focus on a single modality without temporal and geo-context modeling, hampering their capabilities for diverse tasks. In this study, we present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue, we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to temporal, and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically, it outperforms the latest models such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications. Comment: Accepted by CVPR2024
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2312.10115
رقم الأكسشن:	edsarx.2312.10115
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.