ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation

Advances in Neural Information Processing Systems (NeurIPS) 2025

Jiuhong Xiao¹, Roshan Nayak¹, Ning Zhang², Daniel Tortei², Giuseppe Loianno³,

¹New York University, ²Technology Innovation Institute, ³University of California, Berkeley

arXiv

HF Dataset

HF Model Code

TL;DR

ThermalGen is a style-disentangled flow-based generative model for RGB-to-thermal image translation that synthesizes diverse thermal images from RGB data, supported by three new collected satellite-thermal paired datasets.

Abstract

Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks, including important applications such as multi-modal image alignment and retrieval. However, the scarcity of synchronized and calibrated RGB-thermal image pairs presents a major obstacle to progress in these areas. To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution, enabling the synthesis of thermal images from abundant RGB datasets for training purposes. In this study, we propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation, incorporating an RGB image conditioning architecture and a style-disentangled mechanism. To support large-scale training, we curated eight public satellite-aerial, aerial, and ground RGB-T paired datasets, and introduced three new large-scale satellite-aerial RGB-T datasets—DJI-day, BosonPlus-day, and BosonPlus-night—captured across diverse times, sensor types, and geographic regions. Extensive evaluations across multiple RGB-T benchmarks demonstrate that ThermalGen achieves comparable or superior translation performance compared to existing GAN-based and diffusion-based methods. To our knowledge, ThermalGen is the first RGB-T image translation model capable of synthesizing thermal images that reflect significant variations in viewpoints, sensor characteristics, and environmental conditions.

ThermalGen exhibits robust performance under diverse conditions

We present RGB inputs alongside generated thermal images using ThermalGen across a range of challenging variations, including viewpoint variation, day-night change, sensor variation, and environmental change. Variations are illustrated between the two rows in each group.

Satellite-Thermal Paired Datasets

Apart from our previous Boson-night dataset, we release additional datasets: DJI-day, Bosonplus-day, and Bosonplus-night. Each dataset illustrates differences in thermal sensors, lighting conditions, and geography. Columns show paired satellite RGB and corresponding 8-bit thermal images.

Method Overview

ThermalGen is built on a flow-based generative paradigm with a Scalable Interpolant Transformer (SiT) backbone. It models the conditional distribution of thermal images given an RGB image and a dataset-specific style embedding. The style-disentangled mechanism uses learnable embeddings that capture diverse RGB-T mapping relationships influenced by thermal sensor characteristics, camera viewpoints, and environmental conditions. Classifier-Free Guidance (CFG) is supported through an unconditional style embedding, enabling flexible control over the generation style at inference time.

ThermalGen Architecture Overview — **Overview of ThermalGen.** Paired RGB-T data from satellite-aerial, aerial, and ground datasets are used for training. The generative model predicts the velocity for the thermal latent, conditioned on timestep t, dataset-specific style embedding y, and RGB latent **z_RGB**. After T steps of denoising, the thermal decoder generates the final thermal image.

Quantitative Results

Comparison with GAN-based and diffusion-based baselines across nine RGB-T benchmarks. All ThermalGen results use a single jointly-trained model.

Satellite-Aerial
Aerial
Ground

Method	Category	Boson-night				Bosonplus-day				Bosonplus-night
Method	Category	PSNR↑	SSIM↑	FID↓	LPIPS↓	PSNR↑	SSIM↑	FID↓	LPIPS↓	PSNR↑	SSIM↑	FID↓	LPIPS↓
Pix2Pix	Paired GAN	23.71	0.79	149.55	0.31	14.04	0.30	170.45	0.44	19.93	0.70	137.74	0.40
CycleGAN	Unpaired GAN	17.27	0.50	119.62	0.42	12.62	0.21	279.16	0.52	11.36	0.47	105.36	0.48
Pix2PixHD	Paired GAN	21.46	0.75	106.33	0.26	12.85	0.21	157.65	0.43	16.79	0.71	89.26	0.35
VQGAN	Paired GAN	24.55	0.81	207.12	0.29	14.10	0.28	185.41	0.46	18.49	0.76	286.74	0.33
DDIM	Paired Diffusion	18.31	0.72	203.05	0.50	12.50	0.20	261.03	0.71	15.22	0.72	112.38	0.49
BBDM	Paired Diffusion	17.85	0.62	141.27	0.37	12.42	0.18	137.68	0.46	13.88	0.62	101.08	0.43
DiffV2IR	Paired Diffusion	15.47	0.50	150.11	0.47	11.01	0.17	215.20	0.59	13.76	0.55	96.42	0.50
ThermalGen-L/2	Paired Flow	21.88	0.71	161.22	0.32	14.66	0.31	76.91	0.35	20.47	0.76	75.80	0.34

Method	Category	LLVIP				NII-CU				AVIID
Method	Category	PSNR↑	SSIM↑	FID↓	LPIPS↓	PSNR↑	SSIM↑	FID↓	LPIPS↓	PSNR↑	SSIM↑	FID↓	LPIPS↓
Pix2Pix	Paired GAN	12.09	0.37	326.14	0.53	17.31	0.81	168.77	0.37	21.41	0.61	146.26	0.30
CycleGAN	Unpaired GAN	10.39	0.24	227.15	0.61	16.43	0.77	125.37	0.37	15.54	0.46	91.37	0.32
Pix2PixHD	Paired GAN	11.51	0.33	281.89	0.51	19.46	0.80	118.60	0.32	20.01	0.56	127.63	0.27
VQGAN	Paired GAN	11.75	0.36	273.06	0.58	15.53	0.81	173.37	0.37	21.71	0.60	96.46	0.23
DDIM	Paired Diffusion	10.94	0.41	297.26	0.71	17.79	0.77	180.14	0.48	10.96	0.36	290.46	0.76
BBDM	Paired Diffusion	9.98	0.22	313.54	0.67	14.36	0.71	118.59	0.42	18.53	0.49	141.06	0.31
DiffV2IR†	Paired Diffusion	22.17	0.77	50.10	0.11	12.28	0.63	159.99	0.50	18.17	0.53	51.61	0.21
ThermalGen-L/2	Paired Flow	11.12	0.34	238.60	0.51	26.44	0.92	69.30	0.21	24.89	0.75	29.05	0.13

† DiffV2IR uses LLVIP test set for training.

Method	Category	M³FD				MSRS				FLIR
Method	Category	PSNR↑	SSIM↑	FID↓	LPIPS↓	PSNR↑	SSIM↑	FID↓	LPIPS↓	PSNR↑	SSIM↑	FID↓	LPIPS↓
Pix2Pix	Paired GAN	21.10	0.72	127.62	0.34	21.78	0.68	174.81	0.37	17.13	0.54	224.11	0.44
CycleGAN	Unpaired GAN	11.28	0.42	171.37	0.56	11.42	0.35	99.14	0.53	11.15	0.32	137.97	0.49
Pix2PixHD	Paired GAN	19.37	0.67	112.31	0.28	18.21	0.61	121.05	0.35	15.63	0.49	164.53	0.37
VQGAN	Paired GAN	21.02	0.71	79.21	0.26	22.27	0.69	106.51	0.38	16.95	0.50	141.00	0.40
DDIM	Paired Diffusion	11.15	0.45	229.24	0.70	7.35	0.21	262.93	0.79	11.24	0.39	296.81	0.71
BBDM	Paired Diffusion	17.26	0.61	120.79	0.37	20.27	0.62	145.86	0.37	16.10	0.45	177.81	0.42
DiffV2IR*	Paired Diffusion	22.76	0.79	37.74	0.13	10.16	0.30	104.01	0.57	11.29	0.44	106.12	0.45
DiffV2IR+	Paired Diffusion	12.68	0.45	78.45	0.43	6.80	0.17	113.26	0.64	22.39	0.57	37.83	0.18
ThermalGen-L/2	Paired Flow	23.73	0.81	35.82	0.14	24.38	0.76	52.31	0.21	17.10	0.52	70.09	0.33

* Fine-tuned on M³FD; + Fine-tuned on FLIR.

Blue bold = best result. Underlined = second/third best. Highlighted row = our method (single jointly-trained model).

Qualitative Comparison

Visual comparisons across ground, aerial, and satellite-aerial datasets. GAN-based methods produce distorted or grid-artifact-laden outputs, while DiffV2IR tends to generate excessively sharp boundaries. ThermalGen produces high-fidelity thermal images accurately matching ground truth distributions across diverse conditions.

Input

Pix2Pix

CycleGAN

Pix2PixHD

VQGAN

DiffV2IR

ThermalGen

Freiburg Day (Ground)

Freiburg Night (Ground)

M³FD (Ground)

MSRS (Ground)

AVIID (Aerial)

NII-CU (Aerial)

Bosonplus-day (Satellite-Aerial)

Bosonplus-night (Satellite-Aerial)

Style Disentanglement Analysis

t-SNE visualizations of DINOv2 features demonstrate that ThermalGen's style embeddings effectively capture distinct RGB-T mapping relationships across different conditions. The generated thermal images cluster closely with their corresponding ground truth distributions, confirming that the style-disentangled mechanism successfully encodes variations in viewpoints, sensor types, day-night conditions, and lighting environments.

LLVIP t-SNE — LLVIP Train/Test Distribution

Thermal Map Generation Across CFG Scales

Our model generates thermal maps from satellite imagery. By leveraging the style-disentangled generative framework, the CFG scale can be tuned to modulate the style and appearance of the synthesized thermal maps.

Thermal map generated with CFG 1.0 — No CFG

BibTeX


      @inproceedings{
      xiao2025thermalgen,
      title={ThermalGen: Style-Disentangled Flow-Based Generative Models for {RGB}-to-Thermal Image Translation},
      author={Jiuhong Xiao and Roshan Nayak and Ning Zhang and Daniel Toertei and Giuseppe Loianno},
      booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
      year={2025},
      url={https://openreview.net/forum?id=o0JSYq1TQ4}
      }