Wan 2.2 + Qwen Image: Alibaba AI Content Creation Pipeline

This is a practical recap of the latest updates to Wan 2.2 and the Qwen Image models, and how they can be combined into a single, local workflow for content creation. The goal is to build a consistent pipeline: generate an image, edit it with precision, and animate it into a coherent video—all with local models inside ComfyUI.

I’ll walk through the exact models, files, and workflow I used, including a post-apocalyptic scene built with Qwen Image, refined with Qwen Image Edit, and animated with Wan 2.2.

Models and Files Used

The pipeline combines four key components: a base image model, an image editing model, a control module for guidance, and a video model for animation. Everything runs locally.

Wan 2.2 Animate

Wan 2.2 Animate handles the animation stage. It takes a still image and a motion reference, extracts pose data, and produces a coherent animated clip that follows the motion reference while preserving the visual identity of the source image.

Qwen Image (FP8)

Qwen Image generates the initial image. I used the FP8 variant for speed and consistency in ComfyUI. It works well with ControlNet for pose and structure guidance.

Qwen Image Edit 2.5.9

Qwen Image Edit 2.5.9 is built for macro-level edits while preserving untouched regions. It works reliably for object insertion and targeted adjustments when paired with a strong base image.

InstantX ControlNet Union

The InstantX ControlNet Union model provides multiple ControlNet types in a single file. For this workflow, I used depth guidance to preserve background structure while shaping the character pose.

Model Overview

Component	Role	Notes
Qwen Image (FP8)	Base image generation	Works well with ControlNet. Good default for character and object creation.
InstantX ControlNet Union	Structural guidance	Single file covering most ControlNet types; depth guidance used here.
Qwen Image Edit 2.5.9	Precision image editing	Preserves original details; strong at object insertion and structural edits.
Wan 2.2 Animate	Image-to-video animation	Motion guided by a reference video, with pose extraction and face crops.

Setting Up in ComfyUI

The entire workflow runs inside ComfyUI. Templates and model folders make setup straightforward, with only minimal tweaks needed for quality improvements.

Download and Place Files

Download Qwen Image (FP8) and place it in your ComfyUI models directory.
Download the InstantX ControlNet Union model and drop the files into the ComfyUI ControlNet folder.
Download Qwen Image Edit 2.5.9 and place it alongside Qwen Image in the appropriate models directory.
Download Wan 2.2 Animate and ensure ComfyUI recognizes it under the video model section.

Load the Workflow Template

Open ComfyUI and go to Browse Templates.
Select the Image tab.
Load the template named “Qwen Image with InstantX Union ControlNet.”
Keep the default prompt and structure. I only applied small quality tweaks described below.

Quality Tweaks

I added a second sampler stage to increase clarity and resolution. The pipeline benefits from a two-pass approach that refines latent details.

First pass: generate at 1280x720.
Second pass: upsample and resample to 1920x1080 for sharper textures and cleaner geometry.

This simple two-stage setup noticeably improves detail without overcomplicating the graph.

Generating the Base Image

The initial image sets the tone for everything that follows. I kept the template prompt and used a reference strictly for control guidance.

Prompt and Guidance

Prompt: a post-apocalyptic style description from the template. I kept it unchanged.
ControlNet: I used a depth map instead of DW Pose or OpenPose. Depth guidance helps preserve background shapes and environmental structure while locking in the character pose.

I supplied a reference image for the pose. It was used exclusively as a ControlNet reference, not for texture or style transfer.

Sampling and Resolution

First sampler pass generates the base frame.
Second sampler upscales the latent representation and runs an additional refinement pass.
Output resolution: 1920x1080.

The result is sharper, with clearer textures and defined edges. Two-pass sampling is a reliable baseline for image workflows in ComfyUI.

Editing the Image with Qwen Image Edit 2.5.9

After generating the scene, I made a single macro edit to support the story: an object insertion that fits the context of the image.

Macro Edits That Preserve Details

I prompted Qwen Image Edit to add a baseball bat to the character’s hand. The model added the object while preserving original details, including:

Robotic arms with gear components
Hoodie texture and folds
Torn clothing and surface wear

This edit respected the source image’s identity and style while introducing a new element.

Notes on Model Precision

I used the FP8 version to maintain visual consistency. Heavier quantization can introduce detail loss in areas that should remain unchanged. For object insertion and structural edits, FP8 has been reliable.

I avoided over-editing. One well-targeted change keeps the character coherent and minimizes unintended artifacts.

Animating with Wan 2.2 Animate

With the edited image finalized, the next phase is animation. Wan 2.2 Animate translates the still frame into a moving shot guided by motion reference.

Motion Reference Setup

I used a stock clip of a woman running that matched the direction and pace I wanted. It was the same source I referenced earlier for pose guidance. Using the same reference ensures the motion aligns naturally with the single-frame image.

The scene concept: the character runs across a collapsed street.

Pose Extraction and Settings

Wan 2.2 Animate extracted:

DW Pose for full-body motion
Face crops for stable identity

I disabled reference background and character mask outputs. The goal was to keep the generated background from the Qwen Image output, not the one from the motion source. I also skipped extended video generation for background and mask to save memory.

These settings help preserve the source image while ensuring the animation follows the motion faithfully.

Prompt and Generation

For text conditioning, I used a minimal line: “The character is running across the street.”

Wan 2.2 loaded the video, extracted pose data, handled face crops, and generated the animation. The final output was an 18-second clip with left-to-right motion that matched the original frame’s direction and composition. The animated result aligned with the source image’s style and the intended movement.

Putting It All Together: Local Model Workflow

This section summarizes the full process from blank graph to finished video, using only local models.

Step-by-Step Summary

Set up ComfyUI and models
- Place Qwen Image (FP8), Qwen Image Edit 2.5.9, InstantX ControlNet Union, and Wan 2.2 Animate in their respective folders.
Load the Qwen Image + InstantX ControlNet template
- Browse Templates > Image tab > “Qwen Image with InstantX Union ControlNet.”
Prepare ControlNet guidance
- Select depth guidance to preserve environmental structure.
- Provide a reference image for pose; use it only for ControlNet.
Keep the default prompt
- Use the included post-apocalyptic prompt for consistency.
Generate the base image
- Run the first sampler pass to produce the initial 720p image.
Refine and upscale
- Add a second sampler stage.
- Upsample latents and run a refinement pass to 1920x1080.
Edit with Qwen Image Edit 2.5.9
- Issue a concise instruction: add a baseball bat to the character’s hand.
- Keep edits minimal to avoid drift.
Prepare Wan 2.2 Animate
- Import the edited image as the source frame.
- Load the motion reference video (running sequence).
Configure motion extraction
- Enable DW Pose and face crops.
- Disable background and mask outputs to preserve the generated environment.
- Skip extended generation for these to save memory.
Prompt Wan 2.2
- Use a short line: “The character is running across the street.”
Generate the animation
- Process pose, face crops, and final frames.
- Export the 18-second video.

Resource and Performance Notes

Disabling background and mask outputs reduces memory usage significantly.
Two-pass sampling in the image stage improves clarity without a heavy compute load.
Keeping the motion reference consistent with earlier pose guidance helps align direction and pacing.

Extending the Pipeline

This setup forms a base pipeline for local AI content creation. It can be expanded with authoring tools and alternate animation methods.

Add a Language Model for Writing

Insert a language model at the start to generate scene briefs, shot lists, and character beats. This creates a clear prompt foundation for image and animation stages.

Generate scene descriptions and character notes.
Draft short scripts or dialogue.
Create batch prompts for multi-shot sequences.

Alternative Animation Option

You can also use Wan VaseFun from Alibaba to animate with only the first and last frames. Those frames can be generated in Qwen Image. ControlNet can still guide motion similar to Wan Animate.

Choose the method based on the movement you want. For a simple running sequence, a direct motion reference works well.

Monthly Wrap-Up Approach

A monthly review of new local models and how they connect in real workflows helps avoid tool overload. The goal is to identify what was updated, where it fits, and how to chain it with existing pieces for a repeatable, end-to-end process.

Practical Notes on Control and Consistency

A clean pipeline depends on stable guidance and minimal prompt drift. Here are a few practical checkpoints that kept results consistent from still image to animation.

ControlNet Choice Matters

Depth guidance preserves the environment while shaping pose.
Use the same reference for both pose guidance and motion extraction to maintain direction and rhythm.
Keep control inputs limited to what you need; extra signals can cause conflicts.

Keep Prompts Minimal and Consistent

For the image stage, stick with a focused prompt that matches your scene intent.
For the animation stage, avoid verbose prompts; the reference video already dictates motion.
Consistency between stages reduces style and identity shifts.

Edit Sparingly

Target a single, meaningful edit with Qwen Image Edit (e.g., object insertion).
FP8 models are better for preserving textures and small details.
Avoid repetitive edits; each pass introduces some risk of drift.

Troubleshooting Tips

Here are quick fixes for common issues without altering the core pipeline.

Motion Doesn’t Align

Re-check the direction of motion in the reference video.
Confirm the pose reference and motion reference match in orientation.
Verify that background and mask outputs are disabled if you want to keep your generated environment.

Detail Loss After Editing

Switch to FP8 for the editing model if you used a heavier-quantized variant.
Reduce edit complexity; focus on one object or attribute per pass.
Re-run the second sampler after editing to restore clarity.

Over-Processing or Artifacts

Limit the number of ControlNet inputs to avoid conflicting signals.
Keep prompts concise; over-specification can introduce unwanted features.
Use two sampling passes instead of piling on additional filter nodes.

Why This Pipeline Works

By combining Qwen Image for generation, Qwen Image Edit for precision edits, and Wan 2.2 for motion, the process stays coherent from concept to final video. ControlNet ties the pose and structure together, while a consistent prompt and motion source prevent identity drift.

One consistent prompt guides the visual tone.
One consistent motion source guides the physical action.
Two-pass sampling preserves detail and resolution.
Local models enable iteration without dependency issues.

Example Configuration Snapshot

This is a concise set of parameters that kept outputs consistent and clean in ComfyUI. Adjust to fit your hardware and scene needs.

Qwen Image (FP8)
- Resolution: 1280x720 (first pass), up to 1920x1080 (second pass)
- Samplers: two-pass refinement
- ControlNet: Depth (InstantX Union)
Qwen Image Edit 2.5.9
- Edit type: object insertion
- Instruction format: short, specific phrasing
- Model precision: FP8
Wan 2.2 Animate
- Inputs: edited image, motion reference video
- Extraction: DW Pose + face crops
- Background/mask: disabled for reference video
- Prompt: a single line tied to action

Checklist for a Clean Run

Use this quick checklist before you generate the final video.

Models installed in correct folders
Template loaded: Qwen Image with InstantX Union ControlNet
Depth guidance selected
Two-pass sampling enabled
Edit instruction finalized (short and specific)
Motion reference direction matches your image composition
Background and mask outputs disabled in Wan if keeping the generated environment
Minimal prompt for Wan 2.2

Key Takeaways

A complete local pipeline is achievable with four components: Qwen Image, InstantX ControlNet Union, Qwen Image Edit 2.5.9, and Wan 2.2 Animate.
Depth-based ControlNet guidance preserves environmental structure while shaping pose.
Two-pass sampling improves clarity and resolution without complexity.
Qwen Image Edit 2.5.9 maintains visual consistency for macro edits when used with FP8 precision.
Wan 2.2 Animate turns a single image into a coherent motion clip using pose extraction and face crops from a simple reference video.
Keep prompts minimal. Let the motion reference and ControlNet do the structural work.
Disable reference backgrounds and masks in Wan 2.2 if you need to preserve your generated environment.
For different movement styles, consider Wan VaseFun with first/last frame animation, or continue with motion references guided by ControlNet.

Final Thoughts

This September brought updates that make local content creation smoother and more practical. Qwen Image and Qwen Image Edit handle stills and edits with control, while Wan 2.2 Animate produces stable motion guided by a simple reference. The result is a clear, repeatable pipeline you can adapt to your own scenes and characters without swapping tools mid-project.