How to Efficiently Adapt Foundation Models to Remote Sensing for Object Detection?

Abstract

Vision-Language Models (VLMs) trained on millions of images excel at generalization in various domains. However, a significant distribution shift between remote sensing and natural images limits their effectiveness without adaptation. We investigate both the ability of VLMs to adapt to satellite images and the ability of previously adapted VLMs to detect new classes of objects on these images with only few examples. We address this challenge by exploring different parameter-efficient finetuning strategies to adapt to remote sensing the knowledge acquired by these models during their large pre-training. We evaluate these strategies in few-shot settings and scenarios involving training without data constraints. We also show on the DIOR dataset that a GLIP model previously trained on satellite images is able to achieve state-of-the-art performance in the few-shot detection of new classes of objects, without requiring a dedicated training strategy.

Publication
IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium