OmniText: A Training-Free Generalists for Controllable Text-Image Manipulation

Agus Gunawan1*, Samuel Teodoro1*, Yun Chen1, Soo Ye Kim2, Jihyong Oh3†, Munchurl Kim1†,
1KAIST, 2Adobe Research, 3Chung-Ang University
*Equal contribution, Co-corresponding authors

ICLR 2026

OmniText is a training-free generalist capable of tackling diverse text image manipulation tasks.

Demo: OmniText in OmniText-Bench

Select an application and a sample, then compare Input+Mask vs OmniText Output.

Edited result
Input image Target mask overlay
Left: Input + Target Mask
Right: Edited Result

Abstract

Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters.

To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization.

Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and text insertion and editing with various styles.

Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable to specialist methods.

Key Ideas

1) Observations of Text Generation Model

  • During text generation: Self-attention influences style and cross-attention controls content alignment
  • During text removal: Strong self-attention responses to surrounding text in early sampling cause text-like hallucinations, and redistributing cross-attention helps to reduce hallucination

2) OmniText Framework

  • Text Removal: Self-attention inversion and cross-attention reassignment to suppress the backbone's text generation ability
  • Controllable Inpainting: A latent optimization framework guided by cross-attention content loss to control content and self-attention style loss to control style

Benchmark Dataset: OmniText-Bench

Download
A mockup-based evaluation dataset consisting of 150 sets of input images, target texts with masks, reference images, and ground-truth, covering five distinct applications

Results

Quantitative

Text removal (SCUT-EnsText) and text editing (ScenePair) benchmarks

Qualitative

Standard application (removal and editing) and additional applications (rescaling, repositioning, and style-based insertion and editing)

BibTeX

@inproceedings{
    title={OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation},
    author={Gunawan, Agus and Teodoro, Samuel and Chen, Yun and Kim, Soo Ye and Oh, Jihyong and Kim, Munchurl},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=zF7GyVXVw6}
}