Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition

IEEE International Conference on Robotics & Automation (ICRA) 2026

Jiuhong Xiao¹, Yang Zhou¹, Giuseppe Loianno²

¹New York University, ²University of California, Berkeley

TL;DR

QAA uses learned queries as an extensive reference codebook to enhance model capacity — replacing complex score-based aggregation with one simple matrix multiply for universal visual place recognition with multi-dataset training.

Abstract

Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate the limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA's mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Code and models are publicly available.

Method Overview

Top: The training framework for multi-dataset joint training. Images from multiple datasets are processed by a shared DINOv2 backbone, and the resulting features are aggregated by QAA into compact descriptors trained with Multi-Similarity Loss.

Bottom: The QAA architecture. Feature queries attend to backbone features via cross-attention to produce query-level image features (P̂). Reference queries pass through self-attention to form an independent codebook (F̂). The final descriptor is the Cross-query Similarity matrix S = F̂ᵀP̂, followed by normalization.

Method	Backbone	C_d	AmsterTime	Eynsham	Pitts250k	Pitts30k	SPED	SF-XL v1	SF-XL v2	Tokyo24/7
NetVLAD	VGG-16	4096	16.3	77.7	85.9	85.0	-	40.0	76.9	69.8
MixVPR	ResNet-50	4096	40.2	89.4	94.2	91.5	85.2	71.1	88.5	85.1
EigenPlace	ResNet-50	2048	48.9	90.7	94.1	92.5	82.4	84.1	90.8	93.0
BoQ	DINOv2-B	12288	63.0	92.2	96.6	93.7	92.5	91.8	95.2	98.1
SALAD CM	DINOv2-B	8448	58.1	91.9	95.2	92.6	89.1	85.6	94.6	96.8
QAA (Ours)	DINOv2-B	8192	63.7	92.9	96.6	94.4	91.8	94.4	94.6	98.4
QAA (Ours)	DINOv2-B	4096	61.8	92.9	96.3	93.8	91.1	94.2	94.0	97.8
QAA (Ours)	DINOv2-B	2048	61.5	92.7	96.4	94.0	91.1	94.0	94.1	96.5
QAA (Ours)	DINOv2-B	1024	59.8	92.5	96.3	93.9	90.8	92.4	94.5	97.1

Method	Backbone	C_d	MSLS Val	MSLS Chall.	Nordland*	Nordland**	SVOX Night	SVOX Over.	SVOX Rain	SVOX Snow	SVOX Sun
NetVLAD	VGG-16	4096	58.9	-	-	13.1	8.0	66.4	51.5	54.4	35.4
MixVPR	ResNet-50	4096	88.0	64.0	58.4	76.2	64.4	96.2	91.5	96.8	84.8
EigenPlace	ResNet-50	2048	89.2	67.4	54.2	71.2	58.9	93.1	90.0	93.1	86.4
BoQ	DINOv2-B	12288	93.8	79.0	81.3	90.6	97.7	98.5	98.8	99.4	97.5
SALAD CM	DINOv2-B	8448	94.2	82.7	90.7	95.2	95.6	98.5	98.4	99.2	98.1
QAA (Ours)	DINOv2-B	8192	97.6	85.7	91.8	96.7	97.2	98.4	98.4	99.1	97.3
QAA (Ours)	DINOv2-B	4096	98.1	84.8	91.6	96.6	97.2	98.5	97.9	99.0	98.2
QAA (Ours)	DINOv2-B	2048	97.8	84.2	91.4	95.6	96.4	98.4	97.5	98.3	97.1
QAA (Ours)	DINOv2-B	1024	97.7	82.1	88.3	92.2	95.0	98.6	97.7	99.0	95.9

Qualitative Results

Attention maps for different query vectors across front-view and multi-view datasets.

Attention maps corresponding to different query vectors (Q^f_i, Q^f_j, Q^f_k) from the Feature Prediction model for MSLS Val (front-view) and Pitts250k / Tokyo24/7 (multi-view). Each pair shows the same location from different viewpoints. The learned queries exhibit diverse attention patterns — some focus on distant structures, others on nearby roads — while maintaining consistency across viewpoints for the same landmarks.

Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition

IEEE International Conference on Robotics & Automation (ICRA) 2026

TL;DR

Abstract

Key Contributions

Independent Reference Codebook

Cross-query Similarity

Robust Under Bottlenecks

Method Overview

Results

Inference Efficiency

Qualitative Results

BibTeX