ThermalGen is a style-disentangled flow-based generative model for RGB-to-thermal image translation that synthesizes diverse thermal images from RGB data, supported by three new collected satellite-thermal paired datasets.
Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks, including important applications such as multi-modal image alignment and retrieval. However, the scarcity of synchronized and calibrated RGB-thermal image pairs presents a major obstacle to progress in these areas. To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution, enabling the synthesis of thermal images from abundant RGB datasets for training purposes. In this study, we propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation, incorporating an RGB image conditioning architecture and a style-disentangled mechanism. To support large-scale training, we curated eight public satellite-aerial, aerial, and ground RGB-T paired datasets, and introduced three new large-scale satellite-aerial RGB-T datasets—DJI-day, BosonPlus-day, and BosonPlus-night—captured across diverse times, sensor types, and geographic regions. Extensive evaluations across multiple RGB-T benchmarks demonstrate that ThermalGen achieves comparable or superior translation performance compared to existing GAN-based and diffusion-based methods. To our knowledge, ThermalGen is the first RGB-T image translation model capable of synthesizing thermal images that reflect significant variations in viewpoints, sensor characteristics, and environmental conditions.
We present RGB inputs alongside generated thermal images using ThermalGen across a range of challenging variations, including viewpoint variation, day-night change, sensor variation, and environmental change. Variations are illustrated between the two rows in each group.
Apart from our previous Boson-night dataset, we release additional datasets: DJI-day, Bosonplus-day, and Bosonplus-night. Each dataset illustrates differences in thermal sensors, lighting conditions, and geography. Columns show paired satellite RGB and corresponding 8-bit thermal images.
ThermalGen is built on a flow-based generative paradigm with a Scalable Interpolant Transformer (SiT) backbone. It models the conditional distribution of thermal images given an RGB image and a dataset-specific style embedding. The style-disentangled mechanism uses learnable embeddings that capture diverse RGB-T mapping relationships influenced by thermal sensor characteristics, camera viewpoints, and environmental conditions. Classifier-Free Guidance (CFG) is supported through an unconditional style embedding, enabling flexible control over the generation style at inference time.
Comparison with GAN-based and diffusion-based baselines across nine RGB-T benchmarks. All ThermalGen results use a single jointly-trained model.
| Method | Category | Boson-night | Bosonplus-day | Bosonplus-night | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PSNR↑ | SSIM↑ | FID↓ | LPIPS↓ | PSNR↑ | SSIM↑ | FID↓ | LPIPS↓ | PSNR↑ | SSIM↑ | FID↓ | LPIPS↓ | ||
| Pix2Pix | Paired GAN | 23.71 | 0.79 | 149.55 | 0.31 | 14.04 | 0.30 | 170.45 | 0.44 | 19.93 | 0.70 | 137.74 | 0.40 |
| CycleGAN | Unpaired GAN | 17.27 | 0.50 | 119.62 | 0.42 | 12.62 | 0.21 | 279.16 | 0.52 | 11.36 | 0.47 | 105.36 | 0.48 |
| Pix2PixHD | Paired GAN | 21.46 | 0.75 | 106.33 | 0.26 | 12.85 | 0.21 | 157.65 | 0.43 | 16.79 | 0.71 | 89.26 | 0.35 |
| VQGAN | Paired GAN | 24.55 | 0.81 | 207.12 | 0.29 | 14.10 | 0.28 | 185.41 | 0.46 | 18.49 | 0.76 | 286.74 | 0.33 |
| DDIM | Paired Diffusion | 18.31 | 0.72 | 203.05 | 0.50 | 12.50 | 0.20 | 261.03 | 0.71 | 15.22 | 0.72 | 112.38 | 0.49 |
| BBDM | Paired Diffusion | 17.85 | 0.62 | 141.27 | 0.37 | 12.42 | 0.18 | 137.68 | 0.46 | 13.88 | 0.62 | 101.08 | 0.43 |
| DiffV2IR | Paired Diffusion | 15.47 | 0.50 | 150.11 | 0.47 | 11.01 | 0.17 | 215.20 | 0.59 | 13.76 | 0.55 | 96.42 | 0.50 |
| ThermalGen-L/2 | Paired Flow | 21.88 | 0.71 | 161.22 | 0.32 | 14.66 | 0.31 | 76.91 | 0.35 | 20.47 | 0.76 | 75.80 | 0.34 |
| Method | Category | LLVIP | NII-CU | AVIID | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PSNR↑ | SSIM↑ | FID↓ | LPIPS↓ | PSNR↑ | SSIM↑ | FID↓ | LPIPS↓ | PSNR↑ | SSIM↑ | FID↓ | LPIPS↓ | ||
| Pix2Pix | Paired GAN | 12.09 | 0.37 | 326.14 | 0.53 | 17.31 | 0.81 | 168.77 | 0.37 | 21.41 | 0.61 | 146.26 | 0.30 |
| CycleGAN | Unpaired GAN | 10.39 | 0.24 | 227.15 | 0.61 | 16.43 | 0.77 | 125.37 | 0.37 | 15.54 | 0.46 | 91.37 | 0.32 |
| Pix2PixHD | Paired GAN | 11.51 | 0.33 | 281.89 | 0.51 | 19.46 | 0.80 | 118.60 | 0.32 | 20.01 | 0.56 | 127.63 | 0.27 |
| VQGAN | Paired GAN | 11.75 | 0.36 | 273.06 | 0.58 | 15.53 | 0.81 | 173.37 | 0.37 | 21.71 | 0.60 | 96.46 | 0.23 |
| DDIM | Paired Diffusion | 10.94 | 0.41 | 297.26 | 0.71 | 17.79 | 0.77 | 180.14 | 0.48 | 10.96 | 0.36 | 290.46 | 0.76 |
| BBDM | Paired Diffusion | 9.98 | 0.22 | 313.54 | 0.67 | 14.36 | 0.71 | 118.59 | 0.42 | 18.53 | 0.49 | 141.06 | 0.31 |
| DiffV2IR† | Paired Diffusion | 22.17 | 0.77 | 50.10 | 0.11 | 12.28 | 0.63 | 159.99 | 0.50 | 18.17 | 0.53 | 51.61 | 0.21 |
| ThermalGen-L/2 | Paired Flow | 11.12 | 0.34 | 238.60 | 0.51 | 26.44 | 0.92 | 69.30 | 0.21 | 24.89 | 0.75 | 29.05 | 0.13 |
† DiffV2IR uses LLVIP test set for training.
| Method | Category | M³FD | MSRS | FLIR | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PSNR↑ | SSIM↑ | FID↓ | LPIPS↓ | PSNR↑ | SSIM↑ | FID↓ | LPIPS↓ | PSNR↑ | SSIM↑ | FID↓ | LPIPS↓ | ||
| Pix2Pix | Paired GAN | 21.10 | 0.72 | 127.62 | 0.34 | 21.78 | 0.68 | 174.81 | 0.37 | 17.13 | 0.54 | 224.11 | 0.44 |
| CycleGAN | Unpaired GAN | 11.28 | 0.42 | 171.37 | 0.56 | 11.42 | 0.35 | 99.14 | 0.53 | 11.15 | 0.32 | 137.97 | 0.49 |
| Pix2PixHD | Paired GAN | 19.37 | 0.67 | 112.31 | 0.28 | 18.21 | 0.61 | 121.05 | 0.35 | 15.63 | 0.49 | 164.53 | 0.37 |
| VQGAN | Paired GAN | 21.02 | 0.71 | 79.21 | 0.26 | 22.27 | 0.69 | 106.51 | 0.38 | 16.95 | 0.50 | 141.00 | 0.40 |
| DDIM | Paired Diffusion | 11.15 | 0.45 | 229.24 | 0.70 | 7.35 | 0.21 | 262.93 | 0.79 | 11.24 | 0.39 | 296.81 | 0.71 |
| BBDM | Paired Diffusion | 17.26 | 0.61 | 120.79 | 0.37 | 20.27 | 0.62 | 145.86 | 0.37 | 16.10 | 0.45 | 177.81 | 0.42 |
| DiffV2IR* | Paired Diffusion | 22.76 | 0.79 | 37.74 | 0.13 | 10.16 | 0.30 | 104.01 | 0.57 | 11.29 | 0.44 | 106.12 | 0.45 |
| DiffV2IR+ | Paired Diffusion | 12.68 | 0.45 | 78.45 | 0.43 | 6.80 | 0.17 | 113.26 | 0.64 | 22.39 | 0.57 | 37.83 | 0.18 |
| ThermalGen-L/2 | Paired Flow | 23.73 | 0.81 | 35.82 | 0.14 | 24.38 | 0.76 | 52.31 | 0.21 | 17.10 | 0.52 | 70.09 | 0.33 |
* Fine-tuned on M³FD; + Fine-tuned on FLIR.
Blue bold = best result. Underlined = second/third best. Highlighted row = our method (single jointly-trained model).
Visual comparisons across ground, aerial, and satellite-aerial datasets. GAN-based methods produce distorted or grid-artifact-laden outputs, while DiffV2IR tends to generate excessively sharp boundaries. ThermalGen produces high-fidelity thermal images accurately matching ground truth distributions across diverse conditions.
Freiburg Day (Ground)
Freiburg Night (Ground)
M³FD (Ground)
MSRS (Ground)
AVIID (Aerial)
NII-CU (Aerial)
Bosonplus-day (Satellite-Aerial)
Bosonplus-night (Satellite-Aerial)
t-SNE visualizations of DINOv2 features demonstrate that ThermalGen's style embeddings effectively capture distinct RGB-T mapping relationships across different conditions. The generated thermal images cluster closely with their corresponding ground truth distributions, confirming that the style-disentangled mechanism successfully encodes variations in viewpoints, sensor types, day-night conditions, and lighting environments.
Our model generates thermal maps from satellite imagery. By leveraging the style-disentangled generative framework, the CFG scale can be tuned to modulate the style and appearance of the synthesized thermal maps.
@inproceedings{
xiao2025thermalgen,
title={ThermalGen: Style-Disentangled Flow-Based Generative Models for {RGB}-to-Thermal Image Translation},
author={Jiuhong Xiao and Roshan Nayak and Ning Zhang and Daniel Toertei and Giuseppe Loianno},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=o0JSYq1TQ4}
}