Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition

IEEE International Conference on Robotics & Automation (ICRA) 2026

1New York University, 2University of California, Berkeley

TL;DR

QAA uses learned queries as an extensive reference codebook to enhance model capacity โ€” replacing complex score-based aggregation with one simple matrix multiply for universal visual place recognition with multi-dataset training.

Abstract

Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate the limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA's mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Code and models are publicly available.

Key Contributions

๐Ÿ“š

Independent Reference Codebook

Learned queries form an image-independent reference codebook via self-attention โ€” a universal anchor that drives generalizable performance across diverse datasets in multi-dataset joint training.

โšก

Cross-query Similarity

CS replaces all complex score-based aggregation (softmax, optimal transport) with one matrix multiply. Coding rate analysis confirms CS retains ~2ร— the information of conventional approaches.

๐Ÿ›ก๏ธ

Robust Under Bottlenecks

Even with severely compressed image features (just 8 channels), the high-dimensional reference codebook compensates โ€” actively carrying representational structure for robust performance.

Method Overview

QAA Framework Overview

Top: The training framework for multi-dataset joint training. Images from multiple datasets are processed by a shared DINOv2 backbone, and the resulting features are aggregated by QAA into compact descriptors trained with Multi-Similarity Loss.

Bottom: The QAA architecture. Feature queries attend to backbone features via cross-attention to produce query-level image features (Pฬ‚). Reference queries pass through self-attention to form an independent codebook (Fฬ‚). The final descriptor is the Cross-query Similarity matrix S = Fฬ‚แต€Pฬ‚, followed by normalization.

Results

Recall@1 comparison with state-of-the-art VPR methods. All QAA results use DINOv2-B backbone. Reduced Cd variants (4096, 2048, 1024) demonstrate robust performance with smaller descriptors.

Method Backbone Cd AmsterTime Eynsham Pitts250k Pitts30k SPED SF-XL v1 SF-XL v2 Tokyo24/7
NetVLAD VGG-16 4096 16.3 77.7 85.9 85.0 - 40.0 76.9 69.8
MixVPR ResNet-50 4096 40.2 89.4 94.2 91.5 85.2 71.1 88.5 85.1
EigenPlace ResNet-50 2048 48.9 90.7 94.1 92.5 82.4 84.1 90.8 93.0
BoQ DINOv2-B 12288 63.0 92.2 96.6 93.7 92.5 91.8 95.2 98.1
SALAD CM DINOv2-B 8448 58.1 91.9 95.2 92.6 89.1 85.6 94.6 96.8
QAA (Ours) DINOv2-B 8192 63.7 92.9 96.6 94.4 91.8 94.4 94.6 98.4
QAA (Ours) DINOv2-B 4096 61.8 92.9 96.3 93.8 91.1 94.2 94.0 97.8
QAA (Ours) DINOv2-B 2048 61.5 92.7 96.4 94.0 91.1 94.0 94.1 96.5
QAA (Ours) DINOv2-B 1024 59.8 92.5 96.3 93.9 90.8 92.4 94.5 97.1
Method Backbone Cd MSLS Val MSLS Chall. Nordland* Nordland** SVOX Night SVOX Over. SVOX Rain SVOX Snow SVOX Sun
NetVLAD VGG-16 4096 58.9 - - 13.1 8.0 66.4 51.5 54.4 35.4
MixVPR ResNet-50 4096 88.0 64.0 58.4 76.2 64.4 96.2 91.5 96.8 84.8
EigenPlace ResNet-50 2048 89.2 67.4 54.2 71.2 58.9 93.1 90.0 93.1 86.4
BoQ DINOv2-B 12288 93.8 79.0 81.3 90.6 97.7 98.5 98.8 99.4 97.5
SALAD CM DINOv2-B 8448 94.2 82.7 90.7 95.2 95.6 98.5 98.4 99.2 98.1
QAA (Ours) DINOv2-B 8192 97.6 85.7 91.8 96.7 97.2 98.4 98.4 99.1 97.3
QAA (Ours) DINOv2-B 4096 98.1 84.8 91.6 96.6 97.2 98.5 97.9 99.0 98.2
QAA (Ours) DINOv2-B 2048 97.8 84.2 91.4 95.6 96.4 98.4 97.5 98.3 97.1
QAA (Ours) DINOv2-B 1024 97.7 82.1 88.3 92.2 95.0 98.6 97.7 99.0 95.9

Blue bold = best result. Highlighted row = our method. Italic rows = reduced Cd variants. QAA maintains robust performance even at 4โ€“8ร— smaller descriptor dimensions.

Inference Efficiency

SALAD

1.4M params

0.94 GFLOPS

BoQ (64 queries)

8.6M params

8.22 GFLOPS

QAA (256 queries)

5.1M params

2.29 GFLOPS

Despite using 4ร— more queries than BoQ, QAA requires 3.6ร— fewer GFLOPS and 40% fewer parameters.

Qualitative Results

Attention maps for different query vectors across front-view and multi-view datasets.

Attention Maps for Different Query Vectors

Attention maps corresponding to different query vectors (Qfi, Qfj, Qfk) from the Feature Prediction model for MSLS Val (front-view) and Pitts250k / Tokyo24/7 (multi-view). Each pair shows the same location from different viewpoints. The learned queries exhibit diverse attention patterns โ€” some focus on distant structures, others on nearby roads โ€” while maintaining consistency across viewpoints for the same landmarks.

BibTeX

@article{xiao2025query,
  title={Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition},
  author={Xiao, Jiuhong and Zhou, Yang and Loianno, Giuseppe},
  journal={arXiv preprint arXiv:2507.03831},
  year={2025}
}