Unlocking Dynamic Visual Effect Generation
via In-Context Learning

Baolu Li^1* Yiming Zhang^1* Qinghe Wang^1,2* † Liqian Ma^3✉ Xiaoyu Shi² Xintao Wang² Pengfei Wan²
Zhenfei Yin⁴ Yunzhi Zhuge¹ Huchuan Lu¹ Xu Jia^1✉

¹Dalian University of Technology ²Kling Team, Kuaishou Technology ³ZMO AI Inc. ⁴Oxford University

^*Equal Contribution ^†Project Leader ^✉Corresponding Author

Paper Code

Showcases

	Reference Video	Generated Video 1	Generated Video 2	Generated Video 3
Butterfly
Angle Wings
Artistic Clay
Baby Me
Anime Couple
Venom
Blaze
Disintegration
Flow into Minecraft
Freezing
Invisible
Jellycat
Medusa
Poke
Soul_Jump
Thunder_God
Crush
Earth Fly Away
Garden_Bloom
Judge

Generalization of Out-Of-Domain Data

	Reference Video	Generated Video 1	Generated Video 2	Generated Video 3
Boxing Punch
The Flash
Tiger Snuggle
Fire Breathe
Jelly Drift
Floral Eyes
Magic Hair
Shark
Burst Into Tears

Comparisons with baseline methods

	Crumble	Dissolve	Harley	Squish
Ours
Omini-Effect
VFXCreator

How Does it Work?

VFXMaster is a unified reference-based cinematic visual effect~(VFX) generation framework that can reproduce the intricate dynamics and transformations from a reference video onto a user-provided image. It not only shows outstanding performance on in-domain effects, but also strong generalization capability on out-of-domain effects.

Architecture of VFXMaster.

Abstract

Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.

BibTeX

None