AVSR-Diff Project Page

Interactive Arbitrary-Scale Demo

Compare AVSR-Diff (Ours) against baselines at arbitrary scales (2×–8×).

Drag the slider to compare • Scroll/Pinch to Zoom & super-resolve, Drag to Pan • Tap to Play/Pause
The render scale follows your zoom (2× → 4×) — zoom in and watch real detail appear. Switch scene below.

VEnhancer

AVSR-Diff (Ours)

1.0x

Scroll/Pinch to Zoom • Drag to Pan

Compare AVSR-Diff (Ours) against

Scene

Drag the slider to compare AVSR-Diff (Ours) against a baseline — pick Bicubic LR, VEnhancer, or BF-STVSR and a scene below.

Abstract

Diffusion models deliver striking perceptual quality for video super-resolution (VSR), yet they are locked to a single fixed upsampling factor. Coordinate-based arbitrary-scale methods are flexible but wash out fine detail at large scales, and naively combining the two lets the per-frame stochasticity of diffusion erupt into severe temporal flickering. We present AVSR-Diff, a decoupled framework that separates what to generate from the resolution at which it is rendered: all diffusion sampling runs once in a fixed low-resolution latent space (scale-agnostic), while a continuous video decoder renders the result at any target scale — keeping the generative cost and memory footprint essentially constant no matter how large the output.

Two designs make this possible. Our Temporally-Gated Feature Recurrence (TGFR) module aligns and adaptively gates recurrent features within a ControlNet, supplying the strictly aligned, flicker-free latent priors that sensitive continuous decoding demands. A Scale-Aware Fourier Refinement (SAFR) module then modulates frequency content inside the decoder to synthesize exactly the high-frequency detail each target scale calls for. Together they produce sharp, temporally stable video at any continuous scale — and AVSR-Diff not only sets a new state of the art among arbitrary-scale methods but even surpasses recent fixed-scale generative models on their own native 4× resolution.

Proposed Framework

AVSR-Diff is a decoupled framework built upon a pre-trained single-image super-resolution LDM. A trainable ControlNet guides the frozen denoising U-Net for scale-agnostic latent denoising, where the TGFR module aligns and gates recurrent features across adjacent frames to suppress flickering. The denoised latent sequence is then rendered by a Continuous Video Decoder that uses SAFR to modulate high-frequency details based on the target scale, followed by a coordinate-based INR for continuous querying.

Figure 1. Overview of the proposed AVSR-Diff: scale-agnostic latent denoising with the Temporally-Gated Feature Recurrence (TGFR) module, and a continuous video decoder with the Scale-Aware Fourier Refinement (SAFR) module.

Quantitative Results

Comparison with state-of-the-art methods on REDS4 and Vid4. AVSR-Diff achieves the best perceptual quality and temporal consistency among generative methods, and outperforms fixed-scale generative models even on their native 4× resolution. LPIPS, DISTS, and tOF are scaled by 10²; tLPIPS by 10³. The best and second-best results among all methods are highlighted in red and blue, respectively.

Method	REDS4						Vid4
Method	LPIPS↓	DISTS↓	PSNR↑	SSIM↑	tLPIPS↓	tOF↓	LPIPS↓	DISTS↓	PSNR↑	SSIM↑	tLPIPS↓	tOF↓
Fixed-scale Regression-based VSR
BasicVSR++	13.49	6.99	32.32	0.9057	9.19	18.16	19.09	12.26	27.72	0.8409	15.26	4.09
RVRT	13.32	6.91	32.70	0.9106	8.98	18.08	18.81	12.09	27.90	0.8456	15.15	4.06
Arbitrary-scale Regression-based VSR
ST-AVSR	38.74	16.27	27.13	0.7711	24.61	23.65	41.40	18.79	24.51	0.6863	34.47	6.72
SAVSR	24.30	10.15	29.41	0.8417	8.54	19.76	24.73	12.98	27.12	0.8170	11.83	4.65
Fixed-scale Generative VSR
RealBasicVSR	13.40	5.99	27.04	0.7777	6.42	34.40	21.31	12.29	24.45	0.6943	36.87	7.51
Upscale-A-Video	40.99	16.37	24.62	0.6495	23.68	106.60	40.90	21.38	21.88	0.5335	30.20	23.47
MGLD-VSR	14.53	6.23	26.25	0.7408	16.36	39.62	24.74	13.97	23.51	0.6391	32.69	30.26
StableVSR	9.74	4.51	27.97	0.7951	5.40	17.20	18.45	10.77	24.46	0.6989	25.26	6.09
STAR	29.48	12.17	23.08	0.6726	32.98	64.53	42.74	22.10	18.71	0.3977	46.11	22.84
Arbitrary-scale Generative VSR
VEnhancer	34.69	14.91	22.90	0.6413	24.95	95.51	41.59	18.16	20.58	0.5430	29.66	14.35
AVSR-Diff (Ours)	9.54	4.42	28.75	0.8204	4.20	16.79	16.79	9.92	25.12	0.7241	9.19	5.78

Table 1. Quantitative comparison for 4× VSR on REDS4 and Vid4.

Method	2×				3.25×				8×
Method	LPIPS↓	DISTS↓	PSNR↑	tOF↓	LPIPS↓	DISTS↓	PSNR↑	tOF↓	LPIPS↓	DISTS↓	PSNR↑	tOF↓
Arbitrary-scale Regression-based VSR
VideoINR	12.26	5.49	24.87	64.41	21.96	9.11	24.24	35.42	45.31	21.31	23.31	110.73
MoTIF	8.39	4.08	32.36	42.46	21.38	8.20	23.02	22.03	43.63	20.72	25.36	43.26
ST-AVSR	17.42	6.43	32.51	14.04	32.55	13.26	27.03	21.57	59.82	27.28	24.00	43.48
SAVSR	6.66	2.95	35.25	7.82	19.51	7.78	27.13	18.68	43.02	19.53	25.50	46.12
BF-STVSR	6.43	3.15	34.74	11.17	19.85	7.89	25.55	20.13	41.02	20.19	25.38	41.85
V³VSR	6.01	2.67	36.13	7.95	18.17	7.42	26.23	20.01	44.25	19.69	25.89	44.51
Fixed-scale Generative VSR + Bicubic
RealBasicVSR	8.26	4.88	31.16	22.98	13.03	5.78	24.47	30.34	36.00	16.18	24.14	71.04
Upscale-A-Video	31.27	15.03	26.32	57.35	40.78	19.94	23.47	89.66	45.46	20.11	21.70	187.20
MGLD-VSR	8.70	4.90	30.86	22.96	13.94	6.15	24.30	32.49	37.61	15.86	23.49	210.24
StableVSR	5.51	1.99	33.49	8.01	14.38	6.00	24.91	20.85	34.96	15.59	23.72	44.09
STAR	26.37	12.21	25.08	45.52	25.44	10.87	21.18	67.16	52.26	19.16	18.42	153.58
Arbitrary-scale Generative VSR
VEnhancer	23.39	10.00	23.27	77.78	30.71	12.95	22.53	83.04	43.87	19.67	21.73	129.42
AVSR-Diff (Ours)	3.84	1.72	35.47	7.37	8.17	3.21	26.50	15.23	29.43	14.09	24.95	39.13

Table 2. Quantitative comparison for arbitrary-scale VSR (2×, 3.25×, 8×) on REDS4.

Figure 4. Peak GPU memory vs. target scale. AVSR-Diff maintains a constant footprint while tiled video diffusion grows rapidly.

Qualitative Results

Figure 5. Visual comparisons across various upscaling factors on REDS4.

BibTeX

@inproceedings{youk2026avsr,
  author    = {Youk, Geunhyuk and Do, Jeonghyeok and Kim, Dayeon and Oh, Jihyong and Kim, Munchurl},
  title     = {AVSR-Diff: Scale-Agnostic Diffusion Priors for Temporally Consistent Arbitrary-Scale Video Super-Resolution},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026},
}