ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation

Advances in Neural Information Processing Systems (NeurIPS) 2025

1New York University, 2Technology Innovation Institute, 3University of California, Berkeley

TL;DR

ThermalGen is a style-disentangled flow-based generative model for RGB-to-thermal image translation that synthesizes diverse thermal images from RGB data, supported by three new collected satellite-thermal paired datasets.

Abstract

Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks, including important applications such as multi-modal image alignment and retrieval. However, the scarcity of synchronized and calibrated RGB-thermal image pairs presents a major obstacle to progress in these areas. To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution, enabling the synthesis of thermal images from abundant RGB datasets for training purposes. In this study, we propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation, incorporating an RGB image conditioning architecture and a style-disentangled mechanism. To support large-scale training, we curated eight public satellite-aerial, aerial, and ground RGB-T paired datasets, and introduced three new large-scale satellite-aerial RGB-T datasets—DJI-day, BosonPlus-day, and BosonPlus-night—captured across diverse times, sensor types, and geographic regions. Extensive evaluations across multiple RGB-T benchmarks demonstrate that ThermalGen achieves comparable or superior translation performance compared to existing GAN-based and diffusion-based methods. To our knowledge, ThermalGen is the first RGB-T image translation model capable of synthesizing thermal images that reflect significant variations in viewpoints, sensor characteristics, and environmental conditions.

ThermalGen exhibits robust performance under diverse conditions

We present RGB inputs alongside generated thermal images using ThermalGen across a range of challenging variations, including viewpoint variation, day-night change, sensor variation, and environmental change. Variations are illustrated between the two rows in each group.

Satellite

Satellite-Thermal Paired Datasets

Apart from our previous Boson-night dataset, we release additional datasets: DJI-day, Bosonplus-day, and Bosonplus-night. Each dataset illustrates differences in thermal sensors, lighting conditions, and geography. Columns show paired satellite RGB and corresponding 8-bit thermal images.

Satellite Satellite Satellite Satellite Satellite Satellite
Boson-night
Satellite Satellite Satellite Satellite Satellite Satellite
DJI-day
Satellite Satellite Satellite Satellite Satellite Satellite
Bosonplus-day
Satellite Satellite Satellite Satellite Satellite Satellite
Bosonplus-night

Method Overview

ThermalGen is built on a flow-based generative paradigm with a Scalable Interpolant Transformer (SiT) backbone. It models the conditional distribution of thermal images given an RGB image and a dataset-specific style embedding. The style-disentangled mechanism uses learnable embeddings that capture diverse RGB-T mapping relationships influenced by thermal sensor characteristics, camera viewpoints, and environmental conditions. Classifier-Free Guidance (CFG) is supported through an unconditional style embedding, enabling flexible control over the generation style at inference time.

ThermalGen Architecture Overview
Overview of ThermalGen. Paired RGB-T data from satellite-aerial, aerial, and ground datasets are used for training. The generative model predicts the velocity for the thermal latent, conditioned on timestep t, dataset-specific style embedding y, and RGB latent zRGB. After T steps of denoising, the thermal decoder generates the final thermal image.

Quantitative Results

Comparison with GAN-based and diffusion-based baselines across nine RGB-T benchmarks. All ThermalGen results use a single jointly-trained model.

Method Category Boson-night Bosonplus-day Bosonplus-night
PSNR↑SSIM↑FID↓LPIPS↓ PSNR↑SSIM↑FID↓LPIPS↓ PSNR↑SSIM↑FID↓LPIPS↓
Pix2PixPaired GAN 23.710.79149.550.31 14.040.30170.450.44 19.930.70137.740.40
CycleGANUnpaired GAN 17.270.50119.620.42 12.620.21279.160.52 11.360.47105.360.48
Pix2PixHDPaired GAN 21.460.75106.330.26 12.850.21157.650.43 16.790.7189.260.35
VQGANPaired GAN 24.550.81207.120.29 14.100.28185.410.46 18.490.76286.740.33
DDIMPaired Diffusion 18.310.72203.050.50 12.500.20261.030.71 15.220.72112.380.49
BBDMPaired Diffusion 17.850.62141.270.37 12.420.18137.680.46 13.880.62101.080.43
DiffV2IRPaired Diffusion 15.470.50150.110.47 11.010.17215.200.59 13.760.5596.420.50
ThermalGen-L/2Paired Flow 21.880.71161.220.32 14.660.3176.910.35 20.470.7675.800.34
Method Category LLVIP NII-CU AVIID
PSNR↑SSIM↑FID↓LPIPS↓ PSNR↑SSIM↑FID↓LPIPS↓ PSNR↑SSIM↑FID↓LPIPS↓
Pix2PixPaired GAN 12.090.37326.140.53 17.310.81168.770.37 21.410.61146.260.30
CycleGANUnpaired GAN 10.390.24227.150.61 16.430.77125.370.37 15.540.4691.370.32
Pix2PixHDPaired GAN 11.510.33281.890.51 19.460.80118.600.32 20.010.56127.630.27
VQGANPaired GAN 11.750.36273.060.58 15.530.81173.370.37 21.710.6096.460.23
DDIMPaired Diffusion 10.940.41297.260.71 17.790.77180.140.48 10.960.36290.460.76
BBDMPaired Diffusion 9.980.22313.540.67 14.360.71118.590.42 18.530.49141.060.31
DiffV2IR†Paired Diffusion 22.170.7750.100.11 12.280.63159.990.50 18.170.5351.610.21
ThermalGen-L/2Paired Flow 11.120.34238.600.51 26.440.9269.300.21 24.890.7529.050.13

† DiffV2IR uses LLVIP test set for training.

Method Category M³FD MSRS FLIR
PSNR↑SSIM↑FID↓LPIPS↓ PSNR↑SSIM↑FID↓LPIPS↓ PSNR↑SSIM↑FID↓LPIPS↓
Pix2PixPaired GAN 21.100.72127.620.34 21.780.68174.810.37 17.130.54224.110.44
CycleGANUnpaired GAN 11.280.42171.370.56 11.420.3599.140.53 11.150.32137.970.49
Pix2PixHDPaired GAN 19.370.67112.310.28 18.210.61121.050.35 15.630.49164.530.37
VQGANPaired GAN 21.020.7179.210.26 22.270.69106.510.38 16.950.50141.000.40
DDIMPaired Diffusion 11.150.45229.240.70 7.350.21262.930.79 11.240.39296.810.71
BBDMPaired Diffusion 17.260.61120.790.37 20.270.62145.860.37 16.100.45177.810.42
DiffV2IR*Paired Diffusion 22.760.7937.740.13 10.160.30104.010.57 11.290.44106.120.45
DiffV2IR+Paired Diffusion 12.680.4578.450.43 6.800.17113.260.64 22.390.5737.830.18
ThermalGen-L/2Paired Flow 23.730.8135.820.14 24.380.7652.310.21 17.100.5270.090.33

* Fine-tuned on M³FD; + Fine-tuned on FLIR.

Blue bold = best result. Underlined = second/third best. Highlighted row = our method (single jointly-trained model).

Qualitative Comparison

Visual comparisons across ground, aerial, and satellite-aerial datasets. GAN-based methods produce distorted or grid-artifact-laden outputs, while DiffV2IR tends to generate excessively sharp boundaries. ThermalGen produces high-fidelity thermal images accurately matching ground truth distributions across diverse conditions.

Input
Pix2Pix
CycleGAN
Pix2PixHD
VQGAN
DiffV2IR
ThermalGen
GT

Freiburg Day (Ground)

Input Pix2Pix CycleGAN Pix2PixHD VQGAN DiffV2IR ThermalGen GT

Freiburg Night (Ground)

Input Pix2Pix CycleGAN Pix2PixHD VQGAN DiffV2IR ThermalGen GT

M³FD (Ground)

Input Pix2Pix CycleGAN Pix2PixHD VQGAN DiffV2IR ThermalGen GT

MSRS (Ground)

Input Pix2Pix CycleGAN Pix2PixHD VQGAN DiffV2IR ThermalGen GT

AVIID (Aerial)

Input Pix2Pix CycleGAN Pix2PixHD VQGAN DiffV2IR ThermalGen GT

NII-CU (Aerial)

Input Pix2Pix CycleGAN Pix2PixHD VQGAN DiffV2IR ThermalGen GT

Bosonplus-day (Satellite-Aerial)

Input Pix2Pix CycleGAN Pix2PixHD VQGAN DiffV2IR ThermalGen GT

Bosonplus-night (Satellite-Aerial)

Input Pix2Pix CycleGAN Pix2PixHD VQGAN DiffV2IR ThermalGen GT

Style Disentanglement Analysis

t-SNE visualizations of DINOv2 features demonstrate that ThermalGen's style embeddings effectively capture distinct RGB-T mapping relationships across different conditions. The generated thermal images cluster closely with their corresponding ground truth distributions, confirming that the style-disentangled mechanism successfully encodes variations in viewpoints, sensor types, day-night conditions, and lighting environments.

Day-Night t-SNE
Day vs. Night
Sensor t-SNE
Sensor Variation
Viewpoint t-SNE
Viewpoint Variation
LLVIP t-SNE
LLVIP Train/Test Distribution

Thermal Map Generation Across CFG Scales

Our model generates thermal maps from satellite imagery. By leveraging the style-disentangled generative framework, the CFG scale can be tuned to modulate the style and appearance of the synthesized thermal maps.

Satellite Map
Satellite Map
Thermal map generated with CFG 1.0
No CFG
CFG = 2.0
CFG = 2.0
CFG = 4.0
CFG = 4.0
CFG = 8.0
CFG = 8.0
CFG = 16.0
CFG = 16.0

BibTeX


      @inproceedings{
      xiao2025thermalgen,
      title={ThermalGen: Style-Disentangled Flow-Based Generative Models for {RGB}-to-Thermal Image Translation},
      author={Jiuhong Xiao and Roshan Nayak and Ning Zhang and Daniel Toertei and Giuseppe Loianno},
      booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
      year={2025},
      url={https://openreview.net/forum?id=o0JSYq1TQ4}
      }