VG-SSL: Benchmarking Self-supervised Representation Learning Approaches for Visual Geo-localization

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

New York University

TL;DR

VG-SSL provides a comprehensive benchmark for self-supervised representation learning in visual geo-localization, demonstrating that SSL methods like contrastive learning can match or surpass supervised techniques using the novel GeoPair strategy.

Abstract

Visual Geo-localization (VG) is a critical research area for identifying geo-locations from visual inputs, particularly in autonomous navigation for robotics and vehicles. Current VG methods often learn feature extractors from geo-labeled images to create dense, geographically relevant representations. Recent advances in Self-Supervised Learning (SSL) have demonstrated its capability to achieve performance on par with supervised techniques with unlabeled images. This study presents a novel VG-SSL framework, designed for versatile integration and benchmarking of diverse SSL methods for representation learning in VG, featuring a unique geo-related pair strategy, GeoPair. Through extensive performance analysis, we adapt SSL techniques to improve VG on datasets from hand-held and car-mounted cameras used in robotics and autonomous vehicles. Our results show that contrastive learning and information maximization methods yield superior geo-specific representation quality, matching or surpassing the performance of state-of-the-art VG techniques. To our knowledge, this is the first benchmarking study of SSL in VG, highlighting its potential in enhancing geo-specific visual representations for robotics and autonomous vehicles. The code is publicly available.

Key Findings

SSL Matches Supervised

Contrastive and information maximization SSL methods match or surpass ImageNet-supervised features for VG, even without geo-labels during pre-training.

GeoPair Boosts Performance

The novel GeoPair strategy, which creates positive pairs from geographically nearby images, consistently improves SSL representations across all methods.

Better Feature Focus

SSL-trained models attend more to geo-relevant structures like buildings and landmarks, while supervised models often focus on transient objects like vehicles.

Method Overview

The VG-SSL framework integrates diverse self-supervised learning methods for visual geo-localization. It features a modular design supporting contrastive learning (MoCo v2, SimCLR), information maximization (Barlow Twins, VICReg), and self-distillation (BYOL, SimSiam) approaches. The novel GeoPair strategy constructs geographically relevant positive pairs by sampling nearby images within a controllable radius, enabling SSL pre-training that captures geo-specific visual patterns.

VG-SSL Framework Overview
Overview of VG-SSL. The framework supports multiple SSL methods with a unified architecture. GeoPair creates positive pairs from geographically nearby images, enabling geo-specific representation learning without explicit geo-labels.

Quantitative Results

Recall@N (%) for one-stage methods with ResNet-50-GeM (Dg=1024) and DeiT-S (Dg=256) backbones, trained on Pitts30k.

Method Category R@1 R@5 R@10
ResNet-50-GeM (Dg=1024)
Triplet Loss (Baseline) Supervised 76.7 89.1 92.3
SimCLR + GeoPair Contrastive 82.8 91.9 94.6
MoCo v2 + GeoPair Contrastive 82.6 92.4 95.1
Barlow Twins + GeoPair Info. Max. 80.8 91.7 94.2
VICReg + GeoPair Info. Max. 80.2 91.3 94.1
BYOL + GeoPair Distillation 80.2 91.5 94.4
SimSiam + GeoPair Distillation 78.6 89.8 92.7
DeiT-S (Dg=256)
Triplet Loss (Baseline) Supervised 72.9 88.5 92.6
SimCLR + GeoPair Contrastive 84.7 93.9 96.0
MoCo v2 + GeoPair Contrastive 80.8 92.4 95.0
Barlow Twins + GeoPair Info. Max. 82.6 92.1 95.0
VICReg + GeoPair Info. Max. 81.7 92.3 95.2
BYOL + GeoPair Distillation 76.6 89.4 92.9
SimSiam + GeoPair Distillation 79.7 91.0 93.5
Method Category R@1 R@5 R@10
ResNet-50-GeM (Dg=1024)
Triplet Loss (Baseline) Supervised 76.9 86.1 89.5
SimCLR + GeoPair Contrastive 84.2 92.3 94.2
MoCo v2 + GeoPair Contrastive 81.5 90.5 92.8
Barlow Twins + GeoPair Info. Max. 79.5 89.5 91.9
VICReg + GeoPair Info. Max. 77.4 89.3 91.2
BYOL + GeoPair Distillation 72.7 85.5 87.7
SimSiam + GeoPair Distillation 75.0 85.8 88.6
DeiT-S (Dg=256)
Triplet Loss (Baseline) Supervised 79.3 90.5 92.7
SimCLR + GeoPair Contrastive 81.1 91.8 93.1
MoCo v2 + GeoPair Contrastive 76.1 88.5 91.1
Barlow Twins + GeoPair Info. Max. 79.7 91.4 93.1
VICReg + GeoPair Info. Max. 75.8 89.5 91.9
BYOL + GeoPair Distillation 58.2 75.3 79.6
SimSiam + GeoPair Distillation 56.2 76.2 80.1
Method Category R@1 R@5 R@10
ResNet-50-GeM (Dg=1024)
Triplet Loss (Baseline) Supervised 50.2 67.9 76.8
SimCLR + GeoPair Contrastive 54.6 74.9 81.9
MoCo v2 + GeoPair Contrastive 53.4 68.3 76.5
Barlow Twins + GeoPair Info. Max. 45.7 61.9 70.8
VICReg + GeoPair Info. Max. 50.2 65.4 74.3
BYOL + GeoPair Distillation 44.8 63.8 70.8
SimSiam + GeoPair Distillation 51.1 67.6 71.4
DeiT-S (Dg=256)
Triplet Loss (Baseline) Supervised 43.5 65.7 72.4
SimCLR + GeoPair Contrastive 59.4 76.2 80.0
MoCo v2 + GeoPair Contrastive 50.8 69.8 77.1
Barlow Twins + GeoPair Info. Max. 58.4 75.2 80.6
VICReg + GeoPair Info. Max. 51.7 66.7 74.6
BYOL + GeoPair Distillation 43.2 62.2 68.6
SimSiam + GeoPair Distillation 47.3 63.8 74.0

Blue bold = best result. Underlined = second best. Highlighted rows = our methods (with GeoPair).

Qualitative Results

GradCAM activation maps reveal how different methods attend to image regions. SSL methods with GeoPair focus more on geo-relevant structures (buildings, landmarks) rather than transient objects, leading to more robust place recognition. Each row below shows one scenario — compare across methods to see the differences.

Pitts30k

Query
Pitts30k Query
Database
Pitts30k Database
Triplet Loss
Triplet Query Triplet DB
MoCo v2
MoCov2 Query MoCov2 DB
SimCLR
SimCLR Query SimCLR DB
VICReg
VICReg Query VICReg DB
Barlow Twins
BT Query BT DB
BYOL
BYOL Query BYOL DB
SimSiam
SimSiam Query SimSiam DB

MSLS

Query
MSLS Query
Database
MSLS Database
Triplet Loss
Triplet Query Triplet DB
MoCo v2
MoCov2 Query MoCov2 DB
SimCLR
SimCLR Query SimCLR DB
VICReg
VICReg Query VICReg DB
Barlow Twins
BT Query BT DB
BYOL
BYOL Query BYOL DB
SimSiam
SimSiam Query SimSiam DB

24/7 Tokyo

Query
Tokyo Query
Database
Tokyo Database
Triplet Loss
Triplet Query Triplet DB
MoCo v2
MoCov2 Query MoCov2 DB
SimCLR
SimCLR Query SimCLR DB
VICReg
VICReg Query VICReg DB
Barlow Twins
BT Query BT DB
BYOL
BYOL Query BYOL DB
SimSiam
SimSiam Query SimSiam DB

Poster Presentation

BibTeX

@inproceedings{xiao2025vg,
  title={VG-SSL: Benchmarking Self-Supervised Representation Learning Approaches for Visual Geo-Localization},
  author={Xiao, Jiuhong and Zhu, Gao and Loianno, Giuseppe},
  booktitle={2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  pages={6667--6677},
  year={2025},
  organization={IEEE}
}