VG-SSL provides a comprehensive benchmark for self-supervised representation learning in visual geo-localization, demonstrating that SSL methods like contrastive learning can match or surpass supervised techniques using the novel GeoPair strategy.
Visual Geo-localization (VG) is a critical research area for identifying geo-locations from visual inputs, particularly in autonomous navigation for robotics and vehicles. Current VG methods often learn feature extractors from geo-labeled images to create dense, geographically relevant representations. Recent advances in Self-Supervised Learning (SSL) have demonstrated its capability to achieve performance on par with supervised techniques with unlabeled images. This study presents a novel VG-SSL framework, designed for versatile integration and benchmarking of diverse SSL methods for representation learning in VG, featuring a unique geo-related pair strategy, GeoPair. Through extensive performance analysis, we adapt SSL techniques to improve VG on datasets from hand-held and car-mounted cameras used in robotics and autonomous vehicles. Our results show that contrastive learning and information maximization methods yield superior geo-specific representation quality, matching or surpassing the performance of state-of-the-art VG techniques. To our knowledge, this is the first benchmarking study of SSL in VG, highlighting its potential in enhancing geo-specific visual representations for robotics and autonomous vehicles. The code is publicly available.
Contrastive and information maximization SSL methods match or surpass ImageNet-supervised features for VG, even without geo-labels during pre-training.
The novel GeoPair strategy, which creates positive pairs from geographically nearby images, consistently improves SSL representations across all methods.
SSL-trained models attend more to geo-relevant structures like buildings and landmarks, while supervised models often focus on transient objects like vehicles.
The VG-SSL framework integrates diverse self-supervised learning methods for visual geo-localization. It features a modular design supporting contrastive learning (MoCo v2, SimCLR), information maximization (Barlow Twins, VICReg), and self-distillation (BYOL, SimSiam) approaches. The novel GeoPair strategy constructs geographically relevant positive pairs by sampling nearby images within a controllable radius, enabling SSL pre-training that captures geo-specific visual patterns.
Recall@N (%) for one-stage methods with ResNet-50-GeM (Dg=1024) and DeiT-S (Dg=256) backbones, trained on Pitts30k.
| Method | Category | R@1 | R@5 | R@10 |
|---|---|---|---|---|
| ResNet-50-GeM (Dg=1024) | ||||
| Triplet Loss (Baseline) | Supervised | 76.7 | 89.1 | 92.3 |
| SimCLR + GeoPair | Contrastive | 82.8 | 91.9 | 94.6 |
| MoCo v2 + GeoPair | Contrastive | 82.6 | 92.4 | 95.1 |
| Barlow Twins + GeoPair | Info. Max. | 80.8 | 91.7 | 94.2 |
| VICReg + GeoPair | Info. Max. | 80.2 | 91.3 | 94.1 |
| BYOL + GeoPair | Distillation | 80.2 | 91.5 | 94.4 |
| SimSiam + GeoPair | Distillation | 78.6 | 89.8 | 92.7 |
| DeiT-S (Dg=256) | ||||
| Triplet Loss (Baseline) | Supervised | 72.9 | 88.5 | 92.6 |
| SimCLR + GeoPair | Contrastive | 84.7 | 93.9 | 96.0 |
| MoCo v2 + GeoPair | Contrastive | 80.8 | 92.4 | 95.0 |
| Barlow Twins + GeoPair | Info. Max. | 82.6 | 92.1 | 95.0 |
| VICReg + GeoPair | Info. Max. | 81.7 | 92.3 | 95.2 |
| BYOL + GeoPair | Distillation | 76.6 | 89.4 | 92.9 |
| SimSiam + GeoPair | Distillation | 79.7 | 91.0 | 93.5 |
| Method | Category | R@1 | R@5 | R@10 |
|---|---|---|---|---|
| ResNet-50-GeM (Dg=1024) | ||||
| Triplet Loss (Baseline) | Supervised | 76.9 | 86.1 | 89.5 |
| SimCLR + GeoPair | Contrastive | 84.2 | 92.3 | 94.2 |
| MoCo v2 + GeoPair | Contrastive | 81.5 | 90.5 | 92.8 |
| Barlow Twins + GeoPair | Info. Max. | 79.5 | 89.5 | 91.9 |
| VICReg + GeoPair | Info. Max. | 77.4 | 89.3 | 91.2 |
| BYOL + GeoPair | Distillation | 72.7 | 85.5 | 87.7 |
| SimSiam + GeoPair | Distillation | 75.0 | 85.8 | 88.6 |
| DeiT-S (Dg=256) | ||||
| Triplet Loss (Baseline) | Supervised | 79.3 | 90.5 | 92.7 |
| SimCLR + GeoPair | Contrastive | 81.1 | 91.8 | 93.1 |
| MoCo v2 + GeoPair | Contrastive | 76.1 | 88.5 | 91.1 |
| Barlow Twins + GeoPair | Info. Max. | 79.7 | 91.4 | 93.1 |
| VICReg + GeoPair | Info. Max. | 75.8 | 89.5 | 91.9 |
| BYOL + GeoPair | Distillation | 58.2 | 75.3 | 79.6 |
| SimSiam + GeoPair | Distillation | 56.2 | 76.2 | 80.1 |
| Method | Category | R@1 | R@5 | R@10 |
|---|---|---|---|---|
| ResNet-50-GeM (Dg=1024) | ||||
| Triplet Loss (Baseline) | Supervised | 50.2 | 67.9 | 76.8 |
| SimCLR + GeoPair | Contrastive | 54.6 | 74.9 | 81.9 |
| MoCo v2 + GeoPair | Contrastive | 53.4 | 68.3 | 76.5 |
| Barlow Twins + GeoPair | Info. Max. | 45.7 | 61.9 | 70.8 |
| VICReg + GeoPair | Info. Max. | 50.2 | 65.4 | 74.3 |
| BYOL + GeoPair | Distillation | 44.8 | 63.8 | 70.8 |
| SimSiam + GeoPair | Distillation | 51.1 | 67.6 | 71.4 |
| DeiT-S (Dg=256) | ||||
| Triplet Loss (Baseline) | Supervised | 43.5 | 65.7 | 72.4 |
| SimCLR + GeoPair | Contrastive | 59.4 | 76.2 | 80.0 |
| MoCo v2 + GeoPair | Contrastive | 50.8 | 69.8 | 77.1 |
| Barlow Twins + GeoPair | Info. Max. | 58.4 | 75.2 | 80.6 |
| VICReg + GeoPair | Info. Max. | 51.7 | 66.7 | 74.6 |
| BYOL + GeoPair | Distillation | 43.2 | 62.2 | 68.6 |
| SimSiam + GeoPair | Distillation | 47.3 | 63.8 | 74.0 |
Blue bold = best result. Underlined = second best. Highlighted rows = our methods (with GeoPair).
GradCAM activation maps reveal how different methods attend to image regions. SSL methods with GeoPair focus more on geo-relevant structures (buildings, landmarks) rather than transient objects, leading to more robust place recognition. Each row below shows one scenario — compare across methods to see the differences.
@inproceedings{xiao2025vg,
title={VG-SSL: Benchmarking Self-Supervised Representation Learning Approaches for Visual Geo-Localization},
author={Xiao, Jiuhong and Zhu, Gao and Loianno, Giuseppe},
booktitle={2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
pages={6667--6677},
year={2025},
organization={IEEE}
}