ComfyUI Wan Video Guide: Text-to-Video & Image-to-Video Generation
Complete guide to generating AI videos with Wan 2.1 and Wan 2.2 models in ComfyUI — model downloads, T2V and I2V workflows, and VRAM options.
What is Wan?
Wan is an open-source video generation model family from Alibaba, licensed under Apache 2.0 (commercial use allowed). It covers text-to-video (T2V) and image-to-video (I2V) generation with two main releases:
| Version | Release | Key Feature |
|---|---|---|
| Wan 2.1 | Feb 2025 | Solid baseline, 14B and 1.3B parameter versions |
| Wan 2.2 | Mid 2025 | MoE architecture, film-level aesthetics, 5B hybrid model, first-last frame generation |
Hardware Requirements
| Model | VRAM | Notes |
|---|---|---|
| Wan 2.2 5B (Hybrid) | 8 GB+ | Best entry point — supports both T2V and I2V |
If you're new to video generation, start with the Wan 2.2 5B model. It handles both text-to-video and image-to-video in a single model and works on 8 GB VRAM with ComfyUI's native offloading.
| Wan 2.1/2.2 14B FP8 | 12–16 GB | Good balance | | Wan 2.1/2.2 14B FP16 | 16–24 GB | Best quality | | Wan 2.1 GGUF Q4 | 8 GB+ | Quantized for lower VRAM |
Wan 2.2: Recommended Starting Point
Wan 2.2 introduces a MoE (Mixture of Experts) architecture with separate high-noise and low-noise expert models for better quality. The 5B hybrid model is ideal for beginners — it handles both T2V and I2V in a single model.
Wan 2.2 5B Setup (Easiest)
Models (place in corresponding folders):
| File | Location | Download |
|---|---|---|
| wan2.2_ti2v_5B_fp16.safetensors | models/diffusion_models/ | HuggingFace |
| wan2.2_vae.safetensors | models/vae/ | HuggingFace |
| umt5_xxl_fp8_e4m3fn_scaled.safetensors | models/text_encoders/ | HuggingFace |
Workflow: Update ComfyUI to the latest version, then go to Workflows → Browse Templates → Video and select "Wan2.2 5B video generation".
Steps:
- Load the diffusion model, text encoder, and VAE in the corresponding nodes
- Write a video description in the CLIP Text Encoder node
- (Optional) Load an image for I2V mode — enable the Load Image node with
Ctrl+B - Adjust frame count via the
lengthparameter - Click Run (
Ctrl+Enter)
Wan 2.2 14B T2V Setup
For higher quality text-to-video, the 14B version uses two diffusion models (high-noise and low-noise experts):
| File | Location | Download |
|---|---|---|
| wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors | models/diffusion_models/ | HuggingFace |
| wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors | models/diffusion_models/ | HuggingFace |
| wan_2.1_vae.safetensors | models/vae/ | HuggingFace |
| umt5_xxl_fp8_e4m3fn_scaled.safetensors | models/text_encoders/ | Same as above |
Wan 2.2 14B I2V Setup
For image-to-video, download the I2V-specific diffusion models:
| File | Download |
|---|---|
| wan2.2_i2v_high_noise_14B_fp16.safetensors | HuggingFace |
| wan2.2_i2v_low_noise_14B_fp16.safetensors | HuggingFace |
Wan 2.2 First-Last Frame Video
A unique mode that generates a video transitioning from a start frame to an end frame. Uses the same I2V models — load two images as first and last frames, and ComfyUI interpolates the motion between them.
Wan 2.1 Setup (Alternative)
Wan 2.1 remains a solid option with broader community tooling (Kijai wrapper, GGUF versions).
ComfyUI Native T2V
| File | Location | Download |
|---|---|---|
| wan2.1_t2v_14B_fp8_e4m3fn.safetensors | models/diffusion_models/ | HuggingFace |
| umt5_xxl_fp8_e4m3fn_scaled.safetensors | models/text_encoders/ | HuggingFace |
| wan_2.1_vae.safetensors | models/vae/ | HuggingFace |
For I2V, also download:
- I2V diffusion model: 480p or 720p
- CLIP Vision: clip_vision_h.safetensors (place in
models/clip_vision/)
GGUF Version (Low VRAM)
Requires the ComfyUI-GGUF plugin.
| File | Download |
|---|---|
| T2V GGUF models | city96/Wan2.1-T2V-14B-gguf |
| I2V GGUF models | city96/Wan2.1-I2V-14B-720P-gguf |
T2V and I2V use separate diffusion models. Make sure you download the correct one for your workflow — they are not interchangeable.
Saving Videos as MP4
ComfyUI's default output is .webp. To save as MP4, install the ComfyUI-VideoHelperSuite plugin and use the Video Combine node. All generated videos are saved to ComfyUI/output/.
Common Issues and Fixes
Out of memory during video generation
- Use the Wan 2.2 5B model (works on 8 GB VRAM)
- Use FP8 or GGUF quantized models
- Reduce resolution (480p instead of 720p)
- Reduce frame count
Video has visual drift or inconsistent motion
- Wan 2.2's MoE architecture significantly reduces drift compared to 2.1
- Write more specific motion descriptions in your prompt
- Try first-last frame mode for controlled transitions
T2V model specified but I2V model needed (or vice versa)
- T2V and I2V use separate diffusion models — make sure you download the correct one
- I2V workflows also require a CLIP Vision model that T2V does not
Models don't appear in node dropdown
- Verify files are in the correct folder (
diffusion_models/, notcheckpoints/) - Restart ComfyUI after adding new model files
Related Guides
- HunyuanVideo Guide — Tencent's video generation model
- FramePack Guide — Low-VRAM video generation with FramePack
- Text to Image — Basic ComfyUI image generation
ComfyUI OpenPose ControlNet: Control Character Poses in AI Images
How to use OpenPose ControlNet in ComfyUI to generate images with precise human poses — from skeleton detection to pose-controlled generation.
ComfyUI HunyuanVideo Guide: Text-to-Video Generation Setup
How to set up and run Tencent's HunyuanVideo model in ComfyUI for text-to-video generation — model downloads, workflow setup, and optimization tips.
Wonderful Launcher ドキュメント