In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.
We use a Vision-Language Model (VLM) to annotate the visual transitions, and then prompt GroundingDINO and SAM2 with the generated textual descriptions to extract segmentation masks for the edited regions.
Examples of interleaved image-text session data are shown below.
We apply a Diffusion Transformer (DiT) framework to learn from the multimodal interleaved context, through three tasks including Next Image Prediction (NIP), Current Segmentation Prediction (CSP), and Next Segmentation Prediction (NSP). Losses are only computed on noised tokens.
Qualitative comparison between our method (w/ SFT on OmniEdit) and recent baselines (HQ-Edit, UltraEdit, OmniGen, and GPT-4o (GPT Image 1)) on the proposed MSE-Bench.
Zero-shot qualitative results for multi-concept composition (without fine-tuning on downstream task-specific datasets).
Chain-of-Editing (CoE): first generating the segmentation map of the current image (i.e., grounding) or/and the next image (i.e., layout planning), and then generating the image.
Qualitative comparison between w/o CoE and w/ CoE.
In this setting, users first provide an editing prompt to localize the RoE. Then, drag operations are applied to perform geometric transformations (e.g., object displacement, scaling, and rotation) of the RoE. The transformed segmentation map driven by the transformation is incorporated into the context, allowing the model to generate a target image that adheres to the specified edits.
@article{qu2025vincie,
title={VINCIE: Unlocking In-context Image Editing from Video},
author={Qu, Leigang and Cheng, Feng and Yang, Ziyan and Zhao, Qi and Lin, Shanchuan and Shi, Yichun and Li, Yicong and Wang, Wenjie and Chua, Tat-Seng and Jiang, Lu},
journal={arXiv preprint arXiv:2506.10941},
year={2025}
}