InstantEdit

Text-guided image editing has emerged as a powerful tool for creative expression, with diffusion models leading the way in generating high-quality results. However, the computational demands of text-guided image editing is quite high, due to the lengthy sampling process.

Current endevors to reduce the sampling steps face the two challenges:

Inaccurate inversion trajectory with few steps. The traditional DDIM inversion approach becomes inaccurate with few steps, resulting in deteriorated image quality and inconsistent editing results.
Insufficient editability. The traditional DDIM inversion and its corresponding sampling method lacks editability in few-step setting, which loosely follows the user prompt.

We focus on the two main challenges and develop our methods based on the two vital steps in image editing: inversion and regeneration.

Inversion with RectifiedFlow Model

One key insight of our work is that the linearized sampling trajectory of RectifiedFlow model can be used to reduce the inversion error. Thus we propose to use a simple first-order approximation of the inversion process, PerRFI, suitable for our RectifiedFlow backbone.

Intuitive visualization of the linearized sampling trajectory of RectifiedFlow model comparing to the traditional DDIM inversion.

Consistent and Editable Regeneration

Our regeneration pipeline has two key components:

Inversion Latent Injection(ILI). When doing inversion, we will store all the intermediate inverted latent from PerRFI and reuse them to calibrate each regeneration step. The intuition is that inverted latent at earlier time steps (close to clean image) accumulate less error during inversion. Every time we calculate one step of denoising, we anchor back to the stored latent to prevent error accumulation. We further discover that ILI provides better editability comparing to counterpart methods like the pipeline from DDPM-noise inversion.

Visualization of the inversion and regeneration process with ILI.

Disentangled Prompt Guidance(DPG). Proposed by TurboEdit, we disentangle the sampling formualtion as two terms cross-prompt and cross-trajectory, where the first cross-prompt term captures the difference between predictions for the generation trajectory under the new and original prompts. The second term is the difference between predictions from the new trajectory and the original one under the same prompt. We observe that scaling the cross-prompt term can induce undesirable changes in the generated image due to the interference between guidance signal from the original prompt and target prompt. Thus we propose DPG which scales the component of the target signal that is orthogonal to the source signal to disentangle the two signals.

Auxiliary Discovery: ControlNet for Image Editing

We also find that ControlNet can be very helpful in the inversion-regeneration pipeline for image editing. We directly use pretrained Canny-conditioned ControlNet as a plug and play component in both our inversion and regeneration process. With edge information inserted, we find improvements in performing more accurate image inversion and thus reducing structural information loss. Another advantage of this method is that users can easily control the structural rigidity by adjusting the ControlNet conditioning scale, which is supported by most of the existing ControlNet pipelines.

The effect of applying ControlNet.

InstantEdit: Text-Guided Few-Step Image Editing with Piecewise Rectified Flow

Inversion with RectifiedFlow Model

Consistent and Editable Regeneration

Auxiliary Discovery: ControlNet for Image Editing

Qualitative Comparisons

BibTeX