AVSR-Diff: Scale-Agnostic Diffusion Priors for
Temporally Consistent Arbitrary-Scale Video Super-Resolution

1Korea Advanced Institute of Science and Technology (KAIST),
2Chung-Ang University
Co-corresponding authors
ECCV 2026

Interactive Arbitrary-Scale Demo

Compare AVSR-Diff (Ours) against baselines at arbitrary scales (2×–8×).

Drag the slider to compare • Scroll/Pinch to Zoom & super-resolve, Drag to Pan • Tap to Play/Pause
The render scale follows your zoom (2× → 4×) — zoom in and watch real detail appear. Switch scene below.

VEnhancer
AVSR-Diff (Ours)
1.0x
Scroll/Pinch to Zoom • Drag to Pan
Compare AVSR-Diff (Ours) against
Scene

Drag the slider to compare AVSR-Diff (Ours) against a baseline — pick Bicubic LR, VEnhancer, or BF-STVSR and a scene below.

Abstract

Diffusion models deliver striking perceptual quality for video super-resolution (VSR), yet they are locked to a single fixed upsampling factor. Coordinate-based arbitrary-scale methods are flexible but wash out fine detail at large scales, and naively combining the two lets the per-frame stochasticity of diffusion erupt into severe temporal flickering. We present AVSR-Diff, a decoupled framework that separates what to generate from the resolution at which it is rendered: all diffusion sampling runs once in a fixed low-resolution latent space (scale-agnostic), while a continuous video decoder renders the result at any target scale — keeping the generative cost and memory footprint essentially constant no matter how large the output.

Two designs make this possible. Our Temporally-Gated Feature Recurrence (TGFR) module aligns and adaptively gates recurrent features within a ControlNet, supplying the strictly aligned, flicker-free latent priors that sensitive continuous decoding demands. A Scale-Aware Fourier Refinement (SAFR) module then modulates frequency content inside the decoder to synthesize exactly the high-frequency detail each target scale calls for. Together they produce sharp, temporally stable video at any continuous scale — and AVSR-Diff not only sets a new state of the art among arbitrary-scale methods but even surpasses recent fixed-scale generative models on their own native 4× resolution.

Proposed Framework

AVSR-Diff is a decoupled framework built upon a pre-trained single-image super-resolution LDM. A trainable ControlNet guides the frozen denoising U-Net for scale-agnostic latent denoising, where the TGFR module aligns and gates recurrent features across adjacent frames to suppress flickering. The denoised latent sequence is then rendered by a Continuous Video Decoder that uses SAFR to modulate high-frequency details based on the target scale, followed by a coordinate-based INR for continuous querying.

AVSR-Diff Framework

Figure 1. Overview of the proposed AVSR-Diff: scale-agnostic latent denoising with the Temporally-Gated Feature Recurrence (TGFR) module, and a continuous video decoder with the Scale-Aware Fourier Refinement (SAFR) module.

Quantitative Results

Comparison with state-of-the-art methods on REDS4 and Vid4. AVSR-Diff achieves the best perceptual quality and temporal consistency among generative methods, and outperforms fixed-scale generative models even on their native 4× resolution. LPIPS, DISTS, and tOF are scaled by 102; tLPIPS by 103. The best and second-best results among all methods are highlighted in red and blue, respectively.

Method REDS4 Vid4
LPIPS↓DISTS↓PSNR↑SSIM↑tLPIPS↓tOF↓ LPIPS↓DISTS↓PSNR↑SSIM↑tLPIPS↓tOF↓
Fixed-scale Regression-based VSR
BasicVSR++13.496.9932.320.90579.1918.1619.0912.2627.720.840915.264.09
RVRT13.326.9132.700.91068.9818.0818.8112.0927.900.845615.154.06
Arbitrary-scale Regression-based VSR
ST-AVSR38.7416.2727.130.771124.6123.6541.4018.7924.510.686334.476.72
SAVSR24.3010.1529.410.84178.5419.7624.7312.9827.120.817011.834.65
Fixed-scale Generative VSR
RealBasicVSR13.405.9927.040.77776.4234.4021.3112.2924.450.694336.877.51
Upscale-A-Video40.9916.3724.620.649523.68106.6040.9021.3821.880.533530.2023.47
MGLD-VSR14.536.2326.250.740816.3639.6224.7413.9723.510.639132.6930.26
StableVSR9.744.5127.970.79515.4017.2018.4510.7724.460.698925.266.09
STAR29.4812.1723.080.672632.9864.5342.7422.1018.710.397746.1122.84
Arbitrary-scale Generative VSR
VEnhancer34.6914.9122.900.641324.9595.5141.5918.1620.580.543029.6614.35
AVSR-Diff (Ours) 9.544.4228.750.82044.2016.79 16.799.9225.120.72419.195.78

Table 1. Quantitative comparison for 4× VSR on REDS4 and Vid4.

Method 3.25×
LPIPS↓DISTS↓PSNR↑tOF↓ LPIPS↓DISTS↓PSNR↑tOF↓ LPIPS↓DISTS↓PSNR↑tOF↓
Arbitrary-scale Regression-based VSR
VideoINR12.265.4924.8764.4121.969.1124.2435.4245.3121.3123.31110.73
MoTIF8.394.0832.3642.4621.388.2023.0222.0343.6320.7225.3643.26
ST-AVSR17.426.4332.5114.0432.5513.2627.0321.5759.8227.2824.0043.48
SAVSR6.662.9535.257.8219.517.7827.1318.6843.0219.5325.5046.12
BF-STVSR6.433.1534.7411.1719.857.8925.5520.1341.0220.1925.3841.85
V³VSR6.012.6736.137.9518.177.4226.2320.0144.2519.6925.8944.51
Fixed-scale Generative VSR + Bicubic
RealBasicVSR8.264.8831.1622.9813.035.7824.4730.3436.0016.1824.1471.04
Upscale-A-Video31.2715.0326.3257.3540.7819.9423.4789.6645.4620.1121.70187.20
MGLD-VSR8.704.9030.8622.9613.946.1524.3032.4937.6115.8623.49210.24
StableVSR5.511.9933.498.0114.386.0024.9120.8534.9615.5923.7244.09
STAR26.3712.2125.0845.5225.4410.8721.1867.1652.2619.1618.42153.58
Arbitrary-scale Generative VSR
VEnhancer23.3910.0023.2777.7830.7112.9522.5383.0443.8719.6721.73129.42
AVSR-Diff (Ours) 3.841.7235.477.37 8.173.2126.5015.23 29.4314.0924.9539.13

Table 2. Quantitative comparison for arbitrary-scale VSR (2×, 3.25×, 8×) on REDS4.

Peak GPU memory vs. target scale

Figure 4. Peak GPU memory vs. target scale. AVSR-Diff maintains a constant footprint while tiled video diffusion grows rapidly.

Qualitative Results

Qualitative comparison across scales

Figure 5. Visual comparisons across various upscaling factors on REDS4.

BibTeX

@inproceedings{youk2026avsr,
  author    = {Youk, Geunhyuk and Do, Jeonghyeok and Kim, Dayeon and Oh, Jihyong and Kim, Munchurl},
  title     = {AVSR-Diff: Scale-Agnostic Diffusion Priors for Temporally Consistent Arbitrary-Scale Video Super-Resolution},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026},
}