Recent advancements in diffusion-based text synthesis have
demonstrated significant performance in inserting and editing
text within images via inpainting. However, despite the
potential of text inpainting methods,
three key limitations hinder their applicability to
broader Text Image Manipulation (TIM) tasks: (i)
the inability to remove text, (ii)
the lack of control over the style of rendered text, and
(iii) a tendency to generate duplicated letters.
To address these challenges, we propose OmniText, a
training-free generalist
capable of performing a wide range of TIM tasks.
Specifically, we investigate two key properties of cross- and
self-attention mechanisms to enable text removal and to provide
control over both text styles and content. Our findings reveal
that text removal can be achieved by applying
self-attention inversion, which mitigates the model's
tendency to focus on surrounding text, thus reducing text
hallucinations. Additionally, we
redistribute cross-attention, as increasing the
probability of certain text tokens reduces text hallucination.
For controllable inpainting, we introduce
novel loss functions in a latent optimization framework:
a cross-attention content loss to improve text rendering
accuracy and a self-attention style loss to facilitate
style customization.
Furthermore, we present OmniText-Bench, a
benchmark dataset for evaluating diverse TIM tasks. It
includes input images, target text with masks, and style
references, covering diverse applications such as text removal,
rescaling, repositioning, and text insertion and editing with
various styles.
Our OmniText framework is
the first generalist method capable of performing diverse TIM
tasks. It achieves state-of-the-art performance across multiple
tasks and metrics compared to other text inpainting methods and
is comparable to specialist methods.