Multi-Person InfiniteTalk Step-by-Step Setup

This guide walks through running Multi-Person InfiniteTalk in Wan2GP, step by step. The goal is to animate multiple speakers from a single reference image while keeping lip sync accurate and mapping each voice to the correct subject in the frame.

I assume you already have Wan2GP running. For this walkthrough, I ran on RunPod (A4B). I’ll cover setup, the short list of gotchas, and the exact configuration that produced clean multi-speaker results at 480p.

The flow follows the same order as my working process:

Prepare the correct LoRA by triggering a quick image-to-video generation.
Set up InfiniteTalk for multiple speakers with automatic audio separation.
Clean the audio (remove background music), then split longer audio to fit memory limits.
Create speaker-specific bounding boxes with the Video Mask Creator.
Configure generation parameters and LoRA usage.
Check the auto-separated audio order and correct it if needed.
Generate in segments, enhance, and stitch as needed.

What is Multi-Person InfiniteTalk in Wan2GP?

Multi-Person InfiniteTalk animates two (or more) subjects in a single image so they speak in turn based on provided audio. It:

Separates voices into distinct tracks.
Maps each voice to a given region (bounding box) in the image.
Produces synchronized mouth movements across the timeline.
Supports 480p output with a stable, reproducible setup on Wan 2.1.

Table Overview

Item	Value
Platform	Wan2GP (Wan 2.1)
Run environment	RunPod (A4B)
Mode	InfiniteTalk (Multi-Speakers, 480p)
Base setup step	Wan 2.1 Image-to-Video 480p (Fusion EX-14B) for LoRA preload
LoRA used	Fusion EX-14B (Fusion X)
Input	One reference image with two subjects
Audio	Vocals-only duet (male + female), background music removed
Audio separation	Built-in (Audio to Speakers)
Speaker mapping	Bounding boxes via Video Mask Creator
Resolution	480p (9:6)
Inference steps	14 (multi-speaker)
Frame rate	25 FPS
Guidance scale	1
Audio guidance	6 (or 7)
Shift scale	3
Sampler	UniPC/Euler
Step skipping	TCache
Typical segment length	11–17 seconds (split longer audio to avoid RAM issues)
Post-process	Venhancer (optional), then re-add instruments

Key Features of This Workflow

Auto speaker separation: Automatically splits two voices into separate tracks.
Manual speaker mapping: Explicit control over where each speaker “lives” in the image through bounding boxes.
Deterministic mapping: Ensures the correct voice maps to the correct subject by verifying and, if needed, manually reordering separated audio.
LoRA-controlled style: Uses Fusion EX-14B (Fusion X) for consistent results.
Reproducibility: Uses a fixed seed and clear generation settings.
Memory-safe strategy: Splits longer audio into shorter segments for stable runs.
Simple enhancement path: Improves clarity with Venhancer, then restores instrumentals.

How it works

A single image with multiple subjects is used as the visual base.
The audio is provided as a duet or two-person track.
Wan2GP separates the audio into two tracks (or you supply two tracks yourself).
You define bounding boxes for each speaker with the Video Mask Creator so the model knows who speaks where.
The model generates frames at 480p with mouth movements aligned to the assigned audio.
For longer audio, runs are split into manageable segments to avoid out-of-memory errors, then stitched afterward.
Optional: Enhance clarity with Venhancer and reintroduce background instruments.

How to use Multi-Person InfiniteTalk (Step by Step)

1) Preload the LoRA via Image-to-Video

Before InfiniteTalk, make sure the Fusion EX-14B LoRA is present in your I2V LoRA directory.

Open Wan2GP.
Go to Wan 2.1 → Image-to-Video → 480p → Fusion EX-14B.
Kick off a short generation. This ensures the correct LoRA downloads into the Bora_I2V (LoRAs I2V) directory.
Once the LoRA downloads, you can abort the generation or let it finish. Confirm the LoRA file exists in the expected folder.

You will use this LoRA later in InfiniteTalk.

2) Open InfiniteTalk (Multi-Speakers) and Set Basic Options

Open InfiniteTalk Multi-Speakers 480p (Wan 2.1).
Select Image-to-Video with Smooth Transitions.
Drag in your reference image (e.g., an image with two subjects).
In audio options, select Audio to Speakers to auto-separate the voices.
Drag in the vocals-only duet audio.

Notes on audio preparation:

I removed background music beforehand using Moises.ai. This reduces variables during generation and lets me confirm the vocals are clean.
If you want Wan2GP to remove background music, use the “video motion ignores background music” option. This is convenient but you won’t be able to confirm the removal quality in advance.
Summary strategy used here: Wan2GP handles voice separation (two speakers); background music removal is done externally.

3) Keep Audio Segments Short to Avoid RAM Limits

My audio was 27 seconds. I’ve found Wan2GP can run out of RAM in the high 20s at 480p multi-speaker.
Split longer audio into two parts at a natural gap. You’ll generate twice and stitch later.
For the first part, I targeted about 11 seconds. For the second, about 17 seconds.

4) Create Bounding Boxes for Each Speaker

Use Wan2GP’s Video Mask Creator to precisely define where each subject is located.

Open the Video Mask Creator.
Load your reference image.
Click on subject A repeatedly until you get a tight selection.
Click Add Mask, then Image Matting.
Scroll down to view the created mask. Ignore the mask; you only need the bounding box (the coordinates showing where the subject is).
Copy the bounding box coordinates.

Paste these coordinates into InfiniteTalk:

In InfiniteTalk, find Speaker Locations (the field for location entries).
Paste the bounding box for the first speaker slot. This is just a placeholder for now; you’re gathering both locations first.

Now capture subject B’s location:

Clearing click selections sometimes does not isolate the second subject. If you add a mask and run Image Matting again, both subjects may be captured in one bounding box, which is wrong.
Workaround: Remove the image from the Mask Creator and re-upload it to start fresh.
Click on subject B until it isolates properly.
Add Mask → Image Matting.
Copy the bounding box for subject B.

Back in InfiniteTalk:

Add the second bounding box after a space in the speaker locations field.
Now you have two bounding boxes: the first for subject A, the second for subject B.
The separated audio tracks will map to these in order: audio track 1 → bounding box 1; audio track 2 → bounding box 2.

Important: The automatic audio separator is not deterministic about which voice becomes “track 1” or “track 2.” You will verify the order before the full generation proceeds.

5) Configure Generation Settings (First Segment)

Set the following:

Prompt: “The dogs are singing.” (Adapt to your image content.)
Resolution: 480p (9:6).
Frames: 275 for ~11 seconds at 25 FPS. Use 25 × seconds.
Inference steps: 14. Single-person runs can often work at 8–10, but multi-speaker benefits from 14.
Seed: Set a specific number for reproducibility.
Guidance: 1.
Audio guidance: 6 (7 can also work).
Shift scale: 3.
Sampler: UniPC/Euler.

LoRA tab:

Select Fusion EX-14B (Fusion X) LoRA.
Strength: 1.0.

Step Skipping:

Select TCache.

Start the generation.

If you see an error about a missing module:

Error: no module named ponote
Fix: install it via pip install ponote.audio
Restart the generation.

6) Verify the Auto-Separated Audio Order

As the generation starts, the system will first run audio separation. This produces two temporary files, often named something like:

audio_temp_1
audio_temp_2

Open your file browser (e.g., Jupyter) and find these files in the outputs directory. Download them and listen quickly:

Confirm which voice is in temp 1 and temp 2.
Make sure temp 1 matches the subject assigned to bounding box 1.
Make sure temp 2 matches the subject assigned to bounding box 2.

If the order is correct, proceed.

If the order is wrong:

Abort the generation.
In the multi-speaker section, select the option for two speakers with separate inputs.
Drag temp 1 into the first audio slot and temp 2 into the second audio slot manually.
This explicitly binds the first audio upload to bounding box 1 and the second upload to bounding box 2.
Start the generation again.

7) Generate the First Segment

Let it run. The 11-second segment took about 30 minutes at 480p with the settings above.
Inspect the result. Lip sync should be accurate across both subjects.
If the base output appears a bit soft, plan to enhance it after both segments are complete.

8) Generate the Second Segment

Trim the remaining audio so the second part covers the rest (e.g., ~17 seconds).
Update Frames to 425 for ~17 seconds at 25 FPS.
Keep the same settings, LoRA, and sampler.
Run again.

Verify the audio separation order again:

Because it’s a separate run, the temp files may flip.
Listen to the new temp 1 and temp 2.
If the assignment is wrong, abort and manually upload temp 1 to slot 1 and temp 2 to slot 2, then rerun.

This second segment took about 50 minutes. The output should again have strong lip sync.

9) Optional: Enhance and Re-add Instruments

If clarity is not where you want it, run the outputs through Venhancer.
If you removed background music initially, re-add the instrument track during editing to finalize the mix.
Stitch the two segments at the original audio cut point.

Settings Reference

Core Video Settings

Resolution: 480p (9:6)
Frame Rate: 25 FPS
Frames per segment:
- 11 seconds → 275 frames
- 17 seconds → 425 frames

Sampler and Scales

Sampler: UniPC/Euler
Inference steps: 14 (recommended for multi-speaker)
Guidance: 1
Audio guidance: 6 (7 also reasonable)
Shift scale: 3
Seed: Fixed value for repeatable results

LoRA

Fusion EX-14B (Fusion X)
Strength: 1.0
Preload via Wan 2.1 Image-to-Video 480p generation

Step Skipping

TCache

Audio Separation and Mapping

Use “Audio to Speakers” initially.
Verify temp audio files (temp 1 and temp 2).
If order is wrong, switch to manual two-speaker uploads and map temp 1 to slot 1 and temp 2 to slot 2.

Troubleshooting and Gotchas

Audio Separation Order Is Inconsistent

Symptom: The male voice appears in temp 2 during one run and temp 1 in another.
Fix: Abort, switch to manual two-slot input, drag temp 1 to slot 1 and temp 2 to slot 2, and rerun.

Second Bounding Box Captures Both Subjects

Symptom: After adding a second mask, the bounding box covers both subjects.
Fix: Remove the image from the Mask Creator and re-upload it. Select only the second subject from a fresh start, then Add Mask → Image Matting and copy the bounding box.

Out-of-Memory with Longer Audio

Symptom: Crashes or failures when generating ~25–30 seconds at 480p multi-speaker.
Fix: Split the audio into two parts at a silent gap. Generate each segment separately. Stitch afterward.

Missing Python Module (ponote)

Symptom: Error “no module named ponote”.
Fix: Install via pip install ponote.audio and rerun the generation.

Soft or Noisy Output

Symptom: Lip sync is accurate but the video looks soft.
Fix: Run Venhancer and re-add the instrument track in editing.

FAQs

Do I need to remove background music before separation?

No. Wan2GP can ignore background music during motion processing if you enable the option. I prefer removing it beforehand (e.g., with Moises.ai) so I can confirm the vocal stems are clean before generation.

How do I ensure each voice maps to the correct subject?

Verify the separated audio temp files after the split. If temp 1 aligns with the first bounding box voice and temp 2 with the second, continue. If not, abort and use the manual two-slot input so you can explicitly map temp 1 → slot 1 and temp 2 → slot 2.

Why split into two segments?

To avoid memory issues at 480p multi-speaker. Splitting a ~27-second track into ~11s and ~17s is a stable approach. Adjust based on your hardware.

Which LoRA should I use?

Fusion EX-14B (Fusion X) at strength 1.0. Preload it by running a short Wan 2.1 Image-to-Video 480p generation, then select it in InfiniteTalk under the LoRA tab.

What inference steps work best?

For multi-speaker, 14 steps provided better stability than the 8–10 range used for single-person runs.

Do I need a fixed seed?

A fixed seed helps with reproducibility across segments and retries.

What sampler should I select?

UniPC/Euler worked well in this setup.

How do I get precise speaker locations?

Use Video Mask Creator:

Load the image, click the subject until the selection is clean.
Add Mask → Image Matting.
Copy the bounding box and paste into the speaker locations field in InfiniteTalk.
Re-upload the image when isolating the second subject to avoid combined selections.

Can I run this at higher than 480p?

This walkthrough targets 480p based on the tools and VRAM available. Higher resolutions may further increase memory demand. If you try higher resolutions, expect to shorten segments or scale down steps.

How long do runs take?

On RunPod (A4B), ~11 seconds took about 30 minutes, and ~17 seconds took about 50 minutes at 480p with the settings listed.

How the pieces fit together

Audio flow

Start with a duet or two-speaker track.
Remove the instrumental (optional but recommended).
Feed the vocals into the multi-speaker split.
Verify temp 1 and temp 2.
Map to bounding boxes.

Visual flow

Use a single reference image.
Create one bounding box per subject.
Generate frames at 480p, 25 FPS.
For long content, split into segments, then stitch.

Post-processing

Enhance with Venhancer if needed.
Re-add instruments to the final output.

Step-by-Step Quick Reference

Segment 1

Preload Fusion EX-14B via a quick Wan 2.1 I2V 480p generation.
Open InfiniteTalk Multi-Speakers 480p.
Enable Image-to-Video with Smooth Transitions.
Load the reference image.
Select Audio to Speakers; drag in vocals-only audio.
Open Video Mask Creator; isolate subject A; Add Mask → Image Matting; copy bounding box → paste as first location.
Re-upload image; isolate subject B; Add Mask → Image Matting; copy bounding box → paste as second location.
Configure:
- 480p (9:6), 275 frames at 25 FPS
- Inference steps: 14
- Guidance: 1
- Audio guidance: 6
- Shift scale: 3
- Sampler: UniPC/Euler
- LoRA: Fusion EX-14B (1.0)
- Step skipping: TCache
- Seed: fixed number
Generate. If error “no module named ponote,” install via pip install ponote.audio and rerun.
Verify temp 1 and temp 2 voices match bounding boxes. If mismatched, abort, upload temp 1 → slot 1, temp 2 → slot 2, and regenerate.
Save output.

Segment 2

Trim audio to the remaining portion (~17 seconds).
Update frames to 425 at 25 FPS.
Keep the same settings, LoRA, and bounding boxes.
Generate.
Verify temp 1 and temp 2 again; fix mapping via manual upload if needed.
Save output.

Finalization

Enhance with Venhancer (optional).
Re-add instruments.
Stitch the two segments at the cut point.

Practical Notes

The Video Mask Creator’s bounding boxes are the key to accurate speaker placement. The mask preview is secondary; focus on the bounding box coordinates.
The auto-separation is helpful but not guaranteed to keep a stable voice order. Always verify the temp files before committing compute time to a full run.
If you plan for longer content, organize your timing and cut points up front so stitching is smooth.
Keep a record of your seed and settings, especially if you plan to iterate or fix sections later.

FAQs (Additional)

What if both voices mix into one track after separation?

Re-check your input audio. If background music is still present and loud, remove it before separation. Then retry the separation step. If the built-in split still struggles, separate the two vocal tracks externally and feed them manually to slot 1 and slot 2.

Do I need Smooth Transitions?

For multi-speaker InfiniteTalk from a single image, Smooth Transitions helps keep the animation stable through the speech segments.

Where do I find the separated temp audio files?

They are typically written to the outputs directory (e.g., via Jupyter). The names usually indicate order, such as audio_temp_1 and audio_temp_2.

Can I map more than two speakers?

This guide focuses on two speakers. The core approach (bounding boxes per subject and audio mapping) extends to more subjects, but each added speaker increases complexity and resource requirements.

Conclusion

Multi-Person InfiniteTalk in Wan2GP works reliably at 480p when you:

Preload the correct LoRA via a quick image-to-video run.
Use the Video Mask Creator for precise, per-subject bounding boxes.
Let the tool separate audio, then verify temp 1 and temp 2 before committing to the full run.
Split longer audio into two segments to avoid memory issues.
Use stable settings: 14 inference steps, guidance 1, audio guidance 6, shift scale 3, UniPC/Euler, Fusion EX-14B at 1.0, TCache, and a fixed seed.
Enhance and re-introduce instrumentals at the end if needed.

Follow the sequence above and you’ll get accurate lip sync for multiple speakers from a single image, with clean mapping between voices and subjects.