QAA uses learned queries as an extensive reference codebook to enhance model capacity โ replacing complex score-based aggregation with one simple matrix multiply for universal visual place recognition with multi-dataset training.
Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate the limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA's mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Code and models are publicly available.
Learned queries form an image-independent reference codebook via self-attention โ a universal anchor that drives generalizable performance across diverse datasets in multi-dataset joint training.
CS replaces all complex score-based aggregation (softmax, optimal transport) with one matrix multiply. Coding rate analysis confirms CS retains ~2ร the information of conventional approaches.
Even with severely compressed image features (just 8 channels), the high-dimensional reference codebook compensates โ actively carrying representational structure for robust performance.
Top: The training framework for multi-dataset joint training. Images from multiple
datasets are processed by a shared DINOv2 backbone, and the resulting features are aggregated by QAA
into compact descriptors trained with Multi-Similarity Loss.
Bottom: The QAA architecture. Feature queries attend to backbone features via
cross-attention to produce query-level image features (Pฬ). Reference queries pass through
self-attention to form an independent codebook (Fฬ). The final descriptor is the Cross-query
Similarity matrix S = FฬแตPฬ, followed by normalization.
Recall@1 comparison with state-of-the-art VPR methods. All QAA results use DINOv2-B backbone. Reduced Cd variants (4096, 2048, 1024) demonstrate robust performance with smaller descriptors.
| Method | Backbone | Cd | AmsterTime | Eynsham | Pitts250k | Pitts30k | SPED | SF-XL v1 | SF-XL v2 | Tokyo24/7 |
|---|---|---|---|---|---|---|---|---|---|---|
| NetVLAD | VGG-16 | 4096 | 16.3 | 77.7 | 85.9 | 85.0 | - | 40.0 | 76.9 | 69.8 |
| MixVPR | ResNet-50 | 4096 | 40.2 | 89.4 | 94.2 | 91.5 | 85.2 | 71.1 | 88.5 | 85.1 |
| EigenPlace | ResNet-50 | 2048 | 48.9 | 90.7 | 94.1 | 92.5 | 82.4 | 84.1 | 90.8 | 93.0 |
| BoQ | DINOv2-B | 12288 | 63.0 | 92.2 | 96.6 | 93.7 | 92.5 | 91.8 | 95.2 | 98.1 |
| SALAD CM | DINOv2-B | 8448 | 58.1 | 91.9 | 95.2 | 92.6 | 89.1 | 85.6 | 94.6 | 96.8 |
| QAA (Ours) | DINOv2-B | 8192 | 63.7 | 92.9 | 96.6 | 94.4 | 91.8 | 94.4 | 94.6 | 98.4 |
| QAA (Ours) | DINOv2-B | 4096 | 61.8 | 92.9 | 96.3 | 93.8 | 91.1 | 94.2 | 94.0 | 97.8 |
| QAA (Ours) | DINOv2-B | 2048 | 61.5 | 92.7 | 96.4 | 94.0 | 91.1 | 94.0 | 94.1 | 96.5 |
| QAA (Ours) | DINOv2-B | 1024 | 59.8 | 92.5 | 96.3 | 93.9 | 90.8 | 92.4 | 94.5 | 97.1 |
| Method | Backbone | Cd | MSLS Val | MSLS Chall. | Nordland* | Nordland** | SVOX Night | SVOX Over. | SVOX Rain | SVOX Snow | SVOX Sun |
|---|---|---|---|---|---|---|---|---|---|---|---|
| NetVLAD | VGG-16 | 4096 | 58.9 | - | - | 13.1 | 8.0 | 66.4 | 51.5 | 54.4 | 35.4 |
| MixVPR | ResNet-50 | 4096 | 88.0 | 64.0 | 58.4 | 76.2 | 64.4 | 96.2 | 91.5 | 96.8 | 84.8 |
| EigenPlace | ResNet-50 | 2048 | 89.2 | 67.4 | 54.2 | 71.2 | 58.9 | 93.1 | 90.0 | 93.1 | 86.4 |
| BoQ | DINOv2-B | 12288 | 93.8 | 79.0 | 81.3 | 90.6 | 97.7 | 98.5 | 98.8 | 99.4 | 97.5 |
| SALAD CM | DINOv2-B | 8448 | 94.2 | 82.7 | 90.7 | 95.2 | 95.6 | 98.5 | 98.4 | 99.2 | 98.1 |
| QAA (Ours) | DINOv2-B | 8192 | 97.6 | 85.7 | 91.8 | 96.7 | 97.2 | 98.4 | 98.4 | 99.1 | 97.3 |
| QAA (Ours) | DINOv2-B | 4096 | 98.1 | 84.8 | 91.6 | 96.6 | 97.2 | 98.5 | 97.9 | 99.0 | 98.2 |
| QAA (Ours) | DINOv2-B | 2048 | 97.8 | 84.2 | 91.4 | 95.6 | 96.4 | 98.4 | 97.5 | 98.3 | 97.1 |
| QAA (Ours) | DINOv2-B | 1024 | 97.7 | 82.1 | 88.3 | 92.2 | 95.0 | 98.6 | 97.7 | 99.0 | 95.9 |
Blue bold = best result. Highlighted row = our method. Italic rows = reduced Cd variants. QAA maintains robust performance even at 4โ8ร smaller descriptor dimensions.
SALAD
1.4M params
0.94 GFLOPS
BoQ (64 queries)
8.6M params
8.22 GFLOPS
QAA (256 queries)
5.1M params
2.29 GFLOPS
Despite using 4ร more queries than BoQ, QAA requires 3.6ร fewer GFLOPS and 40% fewer parameters.
Attention maps for different query vectors across front-view and multi-view datasets.
Attention maps corresponding to different query vectors (Qfi, Qfj, Qfk) from the Feature Prediction model for MSLS Val (front-view) and Pitts250k / Tokyo24/7 (multi-view). Each pair shows the same location from different viewpoints. The learned queries exhibit diverse attention patterns โ some focus on distant structures, others on nearby roads โ while maintaining consistency across viewpoints for the same landmarks.
@article{xiao2025query,
title={Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition},
author={Xiao, Jiuhong and Zhou, Yang and Loianno, Giuseppe},
journal={arXiv preprint arXiv:2507.03831},
year={2025}
}