InfiniteTalk Lip-Sync Video-to-Video

Learn how to turn an existing video into a lip-synced talking video using InfiniteTalk, an open-source audio-driven video generation model that runs entirely inside ComfyUI.

What is InfiniteTalk?

InfiniteTalk is an open-source model designed to take audio input and sync it perfectly with the lip movements of characters in a video.

It works by combining several pre-trained models and components within ComfyUI, producing high-quality talking videos based on the provided audio clip and video input.

This tool is especially useful for creating talking avatars, adding lip-sync effects to existing videos, and generating effects like laughing, whispering, or singing based on audio clips.

Step 1: Getting Started

Before we begin, you need to prepare your setup by downloading and loading the workflow into Comfy UI.

Download the workflow and load it into Comfy UI.
Install missing nodes if Comfy UI asks for them. These are required for the workflow to run correctly.

Step 2: Download Model Files

You will need to download several model files for Infinite Talk to work. Each model has a specific folder where it needs to be placed.

Model Name	File Size	Destination Folder in Comfy UI
114B Diffusion Model	~16 GB	models/diffusion_models
Infinite Talk Single Diffusion Model	~2.6 GB	models/diffusion_models
VAE	~0.25 GB	models/VAE
Clip Vision Model	~1.2 GB	models/clip_vision
Melband Row Former Model	~0.5 GB	models/diffusion_models
Text Encoder	~11 GB	models/text_encoders

Steps:

Download the 114B Diffusion Model (~16 GB) and place it in the Comfy UI/models/diffusion_models folder.
Download the Infinite Talk Single Diffusion Model (~2.6 GB) and place it in the same folder.
Download the VAE file (~0.25 GB) and place it in the Comfy UI/models/VAE folder.
Download the Clip Vision Model (~1.2 GB) and place it in the Comfy UI/models/clip_vision folder.
Download the Melband Row Former Model (~0.5 GB) and place it in the Comfy UI/models/diffusion_models folder.
Download the Text Encoder (~11 GB) and place it in the Comfy UI/models/text_encoders folder.

Step 3: Upload Audio and Video

After all models are placed in their respective folders:

Upload the audio file you want to use for lip-syncing.
Upload the video file you want to transform.
Set the video dimensions for the output video you want to generate.

Step 4: Creating Effects

To create special effects like laughing or whispering, simply use audio clips that contain those sounds.

For a laughing effect, upload an audio file with laughter.
For a whisper effect, upload an audio clip with whispering.

This makes your generated video match the audio perfectly.

Step 5: Writing a Prompt and Generating Video

Write a simple text prompt describing the video generation task.
Press Run to generate the video.

Note: The first time you run a video generation, the download wave2vec model node will automatically download a required model. This is a one-time process, so be patient.

Step 6: Understanding Generation Speed

On an RTX 3090, generation takes about 33 seconds for each second of video.
For example, it took 5 minutes to create a 9-second video.

Example Calculation: 9 seconds of video x 33 seconds per frame = approximately 5 minutes.

Step 7: Improving Generation Speed

You can make the process faster, but this may reduce the quality of the output video. Here are two ways to speed it up:

Method 1: Lower Frame Window Size & Motion Frame

Reduce the frame window size and motion frame settings.
This will make the video generation faster, but the quality will decrease.

Method 2: Use Lidex 2V Laura

Download and use the Lidex 2V Laura model.
Select it in Comfy UI and press Ctrl + B on your keyboard.
This will speed up generation, but again, the video quality will not be as high.

Key Features of InfiniteTalk

Lip-Sync Video Generation

Converts any video into a perfectly synced talking video using only audio input.

Custom Effects Support

Add effects like laughing, whispering, or singing by using specialized audio clips.

Run Entirely in ComfyUI

No external tools are needed once everything is set up.

Adjustable Video Quality and Speed

Fine-tune generation speed and video quality by adjusting settings like frame window size and motion frame.

First-Time Model Download Automation

The first generation automatically downloads the required Wave2Vec model node.

Summary of Steps

Step	Action
1	Download workflow and load it into Comfy UI
2	Install any missing nodes
3	Download required model files
4	Place models in their correct folders
5	Upload your audio and video files
6	Set the desired video dimensions
7	Write a text prompt
8	Run the process and generate the video
9	Optional: Adjust settings to speed up generation

Final Thoughts

Using Infinite Talk inside Comfy UI, you can turn any existing video into a perfectly lip-synced talking video. By carefully following each step and adjusting your settings, you can create realistic outputs, add special effects like whispering or laughing, and even experiment with speed vs. quality trade-offs.

Now you can start experimenting and generating your own lip-sync videos directly from your computer.

Frequently Asked Questions (FAQs)

Q1. How do I create a whispering effect?

Upload an audio clip of whispering. The model will automatically generate a video where the character whispers.

Q2. Why is my first generation taking so long?

The first run downloads the Wave2Vec model node automatically. This only happens once, so subsequent runs will be faster.

Q3. How can I make the generation faster without losing too much quality?

Try slightly reducing the frame window size while keeping motion frame values balanced.
If more speed is needed, use the Lidex 2V LoRA, but expect a drop in quality.

Q4. Where do I place each model file?

Refer to the table above for correct folder placement. Incorrect placement will cause errors in ComfyUI.

Q5. How much GPU memory do I need?

At least 24 GB of VRAM is recommended for smooth performance.

Back to Homepage