Unlocking Dynamic Visual Effect Generation
via In-Context Learning

Baolu Li1*  Yiming Zhang1*  Qinghe Wang1,2* † Liqian Ma3✉ Xiaoyu Shi2 Xintao Wang2 Pengfei Wan2
Zhenfei Yin4 Yunzhi Zhuge1 Huchuan Lu1 Xu Jia1✉
1Dalian University of Technology 2Kling Team, Kuaishou Technology 3ZMO AI Inc. 4Oxford University
Equal Contribution   Project Leader ✉ Corresponding Author  
Paper Code
Showcases
  Reference Video Generated Video 1 Generated Video 2 Generated Video 3
Butterfly
Angle Wings
Artistic Clay
Baby Me
Anime Couple
Venom
Blaze
Disintegration
Flow into Minecraft
Freezing
Invisible
Jellycat
Medusa
Poke
Soul_Jump
Thunder_God
Crush
Earth Fly Away
Garden_Bloom
Judge

Generalization of Out-Of-Domain Data
  Reference Video Generated Video 1 Generated Video 2 Generated Video 3
Boxing Punch
The Flash
Tiger Snuggle
Fire Breathe
Jelly Drift
Floral Eyes
Magic Hair
Shark
Burst Into Tears

Comparisons with baseline methods
  Crumble Dissolve Harley Squish
Ours
Omini-Effect
VFXCreator

How Does it Work?

VFXMaster is a unified reference-based cinematic visual effect~(VFX) generation framework that can reproduce the intricate dynamics and transformations from a reference video onto a user-provided image. It not only shows outstanding performance on in-domain effects, but also strong generalization capability on out-of-domain effects.

Architecture diagram
Architecture of VFXMaster.
Abstract

Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.

BibTeX
None