Infinite Meigen Talk Lip-Sync Video-to-Video

Introduction

I set out to build a longer singing video on an RTX 4070 with 8 GB of VRAM. My goal was simple: get a clip with active lighting, moving hands, and solid identity from start to finish, then lip‑sync it with Infinite Talk. I’ll walk through the process, compare results with my Wan 2.2 S2V workflow, and share the exact setup, timings, and trade‑offs I saw on a modest GPU.

The bottom line: a single long clip didn’t hold up, but a two‑part approach with a first/last frame workflow plus Infinite Talk produced a clean, stable result in a reasonable time budget.

Table Overview

Item	Wan 2.2 S2V (First/Last Frame)	Infinite Talk (Video-to-Video Lip Sync)
GPU	RTX 4070 (8 GB VRAM)	RTX 4070 (8 GB VRAM)
FPS	16 fps	16 fps (final combined video)
Clip Length(s)	9 s (first), 7 s (second)	Combined 16 s (lip‑synced)
Run Time	~10+ min for 9 s clip; second clip faster	~10+ min lip‑sync pass
Total Time (Two Clips + Lip‑Sync)	—	Roughly 30 minutes end‑to‑end
Motion	Strong background and hand movement	Keeps motion from source; adds lip sync
Identity Consistency	High when refreshing reference every 5 s	Stable when fed from video‑to‑video; can drift with a single start image
Expressions	Better facial expression in Wan 2.2 S2V	Expressions less nuanced than Wan 2.2 S2V, but stable
Control Over Lighting/Hand Motion	Limited within the base method	Inherits movement from source clips
Long Single Clip Attempt	1h10m with ghosting (not acceptable)	—

Key Features

Free video‑to‑video lip sync with Infinite Talk
Works well with a two‑segment workflow to keep identity stable
Keeps background and hand movement from the source clips
Simple timing control via start time and duration for audio
VRAM‑aware resolution: go as high as 8 GB allows without out‑of‑memory
Reliable on 16 fps, 16 s output for longer singing segments

Comparison: Wan 2.2 S2V vs. Infinite Talk

I first compared Wan 2.2 S2V with Infinite Talk to check motion, stability, and output quality.

Head movement: Wan 2.2 S2V produced more head movement; Infinite Talk was steadier.
Background: The base Wan 2.2 S2V run could look static; Infinite Talk’s final output kept the motion from my source video.
Hands: Infinite Talk captured slightly more hand motion than a static look; the Wan 2.2 S2V run could appear stuck on the keyboard.
Consistency: Infinite Talk can lose identity near the end if it only has a single start image. Wan 2.2 S2V stayed consistent when I refreshed a reference image every 5 seconds.
Text fidelity: Wan 2.2 S2V held small on‑screen text better across frames, even if there were minor glitches.

Why the identity difference? With Wan 2.2 S2V, I fed a reference image on a schedule (every 5 seconds). That kept the character locked from start to finish. In Infinite Talk, starting from a single image over a long timeline can drift. Video‑to‑video input fixes that by providing a consistent reference throughout.

Control remains limited on both approaches. Lighting changes and hand prompts don’t always translate unless they’re baked into the source. That’s fine for short 5‑second shots, but longer clips get repetitive if nothing in the scene changes.

What is Infinite Talk?

Infinite Talk is a free video‑to‑video lip‑sync tool. You feed it a source video and an audio file, set the timing, and it syncs the mouth to the audio while preserving the motion and lighting from the source. It works best when the face is large enough in frame and the source video already has the movement you want.

How It Works

Here’s the flow I used:

Create motion‑rich source clips with a first/last frame workflow in Wan 2.2 S2V.
Keep the face large in frame to improve lip‑sync accuracy.
Render shorter segments (9 s + 7 s) at 16 fps to avoid ghosting.
Combine both clips in Infinite Talk, add the vocal track, and set the duration to 16 seconds.
Render at the highest resolution that fits in 8 GB VRAM without out‑of‑memory errors.

This two‑stage approach keeps identity stable, maintains camera and hand motion, and produces a final lip‑synced video that looks coherent from start to finish.

Attempting a Single Long Clip

I first tried a single long run using a first/last frame workflow. I built a last frame with a bit of pose change (arm lifted, pointing toward the audience). The output ghosted heavily and took 1 hour 10 minutes. That was not usable.

The fix was to move to shorter clips. Shorter segments held up better visually and completed in a reasonable time on 8 GB of VRAM.

Shorter Clips with the First/Last Frame Workflow

I switched to two segments and ran them through my workflow UI.

Face crop: I cropped both images so the face was large. If the face is small, lip‑sync quality drops.
First segment (9 seconds at 16 fps): Render time was a bit more than 10 minutes. The output had moving background lights and the right hand motion (including hitting the keyboard).
Second segment (7 seconds at 16 fps): I used the last frame of the first clip as the start frame for the second. For the last frame of the second, I reused the earlier start frame. The lighting changes between the two clips made the transition very hard to notice. This second pass was faster than the first because it was shorter.

These two clips formed the motion base for the final lip‑synced video.

Lip‑Sync and Merge with Infinite Talk

With the two source clips ready, I moved to Infinite Talk.

Workflow: I used my Infinite Talk setup with notes that list the required models. I kept the settings focused on timing and resolution.
Audio: I loaded a test vocal track. I set a start time and a duration of 17 seconds, though the final video length was 16 seconds (16 fps × 16 seconds).
Resolution: I pushed the resolution as high as possible without running out of memory on 8 GB VRAM.
Clip order: The first source clip went first. The second clip began at the last frame of the first, so motion continued without a visible jump.
Prompt: I added a light “singing” indication. It wasn’t essential to the result.

The Infinite Talk run took a bit more than 10 minutes. Total time across both source clips and lip‑sync was roughly 30 minutes.

Final output: Facial expression subtlety was not as strong as the base Wan 2.2 S2V run, but the issues I cared about were fixed. The background no longer looked static, there was real hand motion, and the face stayed consistent since I fed Infinite Talk a video source instead of a single image.

How to Use

Follow this step‑by‑step process to replicate the result on a GPU with 8 GB VRAM.

Step 1: Prepare Your Assets

Start frame: Crop so the face is large in the frame.
Last frame: Make a final frame with a slight pose or gesture change (e.g., arm lifted) to introduce visual interest.
Audio: Have a clean vocal track ready and note the target duration (I used 16 seconds at 16 fps).

Tips:

Keep the face centered and well‑lit.
Avoid tiny faces; lip sync needs clear facial detail.

Step 2: Build Two Source Clips with a First/Last Frame Workflow

Use your first/last frame workflow in your UI.
Segment A: 9 seconds at 16 fps.
Segment B: 7 seconds at 16 fps.

Settings and order:

Segment A: Start with your start frame; end on the corresponding last frame.
Segment B: Start from the last frame of Segment A; for the last frame, you can reuse the original start frame if your background lighting changes across clips. That change helps hide the seam.

VRAM considerations:

On 8 GB VRAM, expect a bit more than 10 minutes for a 9‑second pass at 16 fps; the shorter clip should be faster.
If you hit out‑of‑memory, drop resolution first.

Step 3: Check Movement and Identity

Verify background lights move.
Confirm hand motion matches your intended action.
Inspect the last frames for identity consistency; if it drifts, increase the frequency of reference refresh in your base method or shorten segments.

Step 4: Lip‑Sync in Infinite Talk

Load both source clips in order.
Set the second clip to begin at the last frame of the first to keep continuity.
Load your audio track.
Timing:
- Start time: As needed for your audio.
- Duration: 16 seconds at 16 fps (adjust to your target length).
Resolution: Push as high as 8 GB VRAM allows without memory errors.
Optional: Add a minimal prompt (e.g., “singing”). It’s not critical.

Run Infinite Talk and wait for the pass to complete (a bit more than 10 minutes for my setup).

Step 5: Review and Export

Check that the mouth matches the audio throughout.
Ensure no identity drift.
Confirm background and hand movement are preserved.
Export at your chosen resolution.

If something looks off, rerun Infinite Talk with a slightly larger face crop or a modest resolution change to stay within VRAM limits.

How It Works (Under the Hood)

First/last frame source generation:
- I used a two‑image approach (start and end) to introduce motion and lighting variation over each clip.
- Cropping the face larger improved the lip‑sync accuracy later.
- For longer runs in Wan 2.2 S2V, refreshing a reference image every 5 seconds kept identity locked.
Lip sync pass:
- Infinite Talk syncs the mouth to the audio while preserving motion from the input video.
- Starting the second clip from the last frame of the first keeps motion continuous.
- Keeping the final duration exactly at 16 seconds (16 fps × 16 s) ensures the audio and video align tightly.
Performance notes:
- A single long first/last frame pass ran 1 hour 10 minutes and produced ghosting; shorter clips fixed that.
- On 8 GB VRAM, 10+ minutes per 9‑second clip and ~10 minutes for lip‑sync was acceptable.

FAQs

Is 8 GB of VRAM enough?

Yes, with careful settings. Keep the resolution within limits and split longer videos into shorter segments. Expect around 10 minutes for a 9‑second clip and about the same for the lip‑sync pass.

Why did a single long clip fail?

The long pass introduced ghosting and ran too long (1h10m). Breaking the project into a 9‑second clip and a 7‑second clip fixed ghosting and reduced time.

Why did Infinite Talk lose identity at the end when starting from a single image?

A single start image can drift over long durations. Feeding a video source instead of a single image keeps identity consistent across the timeline.

How do I hide transitions between clips?

Start the second clip from the last frame of the first. If background lighting changes across clips, the cut is hard to notice.

What frame rate and duration worked best?

I used 16 fps, with a final duration of 16 seconds. The first source clip was 9 seconds and the second was 7 seconds.

How important is face size in the frame?

Very important. Crop so the face is large and clear. Small faces reduce lip‑sync accuracy.

Can I control lighting and hand prompts directly?

Not in a granular way. Bake movement and lighting changes into your source clips. Infinite Talk will keep what’s already there.

Are expressions better in Wan 2.2 S2V or Infinite Talk?

Wan 2.2 S2V produced more nuanced expressions in my tests. Infinite Talk prioritized stability and motion retention while syncing the mouth to audio.

Conclusion

On an RTX 4070 with 8 GB VRAM, a two‑segment workflow plus Infinite Talk delivers a clean, stable singing video with real motion and dependable lip sync. The key is to avoid long single runs, keep the face large in frame, and stitch shorter clips so the second starts from the last frame of the first. With 16 fps and a 16‑second target, the full pipeline finished in about 30 minutes and produced a result with active lighting, visible hand movement, and consistent identity from start to finish.