InfiniteTalk Lip-Sync Video-to-Video

Learn how to turn an existing video into a lip-synced talking video using InfiniteTalk, an open-source audio-driven video generation model that runs entirely inside ComfyUI.
What is InfiniteTalk?
InfiniteTalk is an open-source model designed to take audio input and sync it perfectly with the lip movements of characters in a video.
It works by combining several pre-trained models and components within ComfyUI, producing high-quality talking videos based on the provided audio clip and video input.
This tool is especially useful for creating talking avatars, adding lip-sync effects to existing videos, and generating effects like laughing, whispering, or singing based on audio clips.
Step 1: Getting Started
Before we begin, you need to prepare your setup by downloading and loading the workflow into Comfy UI.
- Download the workflow and load it into Comfy UI.
- Install missing nodes if Comfy UI asks for them. These are required for the workflow to run correctly.
Step 2: Download Model Files
You will need to download several model files for Infinite Talk to work. Each model has a specific folder where it needs to be placed.
Model Name | File Size | Destination Folder in Comfy UI |
---|---|---|
114B Diffusion Model | ~16 GB | models/diffusion_models |
Infinite Talk Single Diffusion Model | ~2.6 GB | models/diffusion_models |
VAE | ~0.25 GB | models/VAE |
Clip Vision Model | ~1.2 GB | models/clip_vision |
Melband Row Former Model | ~0.5 GB | models/diffusion_models |
Text Encoder | ~11 GB | models/text_encoders |
Steps:
- Download the 114B Diffusion Model (~16 GB) and place it in the
Comfy UI/models/diffusion_models
folder. - Download the Infinite Talk Single Diffusion Model (~2.6 GB) and place it in the same folder.
- Download the VAE file (~0.25 GB) and place it in the
Comfy UI/models/VAE
folder. - Download the Clip Vision Model (~1.2 GB) and place it in the
Comfy UI/models/clip_vision
folder. - Download the Melband Row Former Model (~0.5 GB) and place it in the
Comfy UI/models/diffusion_models
folder. - Download the Text Encoder (~11 GB) and place it in the
Comfy UI/models/text_encoders
folder.
Step 3: Upload Audio and Video
After all models are placed in their respective folders:
- Upload the audio file you want to use for lip-syncing.
- Upload the video file you want to transform.
- Set the video dimensions for the output video you want to generate.
Step 4: Creating Effects
To create special effects like laughing or whispering, simply use audio clips that contain those sounds.
- For a laughing effect, upload an audio file with laughter.
- For a whisper effect, upload an audio clip with whispering.
This makes your generated video match the audio perfectly.
Step 5: Writing a Prompt and Generating Video
- Write a simple text prompt describing the video generation task.
- Press Run to generate the video.
Note: The first time you run a video generation, the download wave2vec model node will automatically download a required model. This is a one-time process, so be patient.
Step 6: Understanding Generation Speed
- On an RTX 3090, generation takes about 33 seconds for each second of video.
- For example, it took 5 minutes to create a 9-second video.
Example Calculation: 9 seconds of video x 33 seconds per frame = approximately 5 minutes.
Step 7: Improving Generation Speed
You can make the process faster, but this may reduce the quality of the output video. Here are two ways to speed it up:
Method 1: Lower Frame Window Size & Motion Frame
- Reduce the frame window size and motion frame settings.
- This will make the video generation faster, but the quality will decrease.
Method 2: Use Lidex 2V Laura
- Download and use the Lidex 2V Laura model.
- Select it in Comfy UI and press Ctrl + B on your keyboard.
- This will speed up generation, but again, the video quality will not be as high.
Key Features of InfiniteTalk
Lip-Sync Video Generation
Converts any video into a perfectly synced talking video using only audio input.
Custom Effects Support
Add effects like laughing, whispering, or singing by using specialized audio clips.
Run Entirely in ComfyUI
No external tools are needed once everything is set up.
Adjustable Video Quality and Speed
Fine-tune generation speed and video quality by adjusting settings like frame window size and motion frame.
First-Time Model Download Automation
The first generation automatically downloads the required Wave2Vec model node.
Summary of Steps
Step | Action |
---|---|
1 | Download workflow and load it into Comfy UI |
2 | Install any missing nodes |
3 | Download required model files |
4 | Place models in their correct folders |
5 | Upload your audio and video files |
6 | Set the desired video dimensions |
7 | Write a text prompt |
8 | Run the process and generate the video |
9 | Optional: Adjust settings to speed up generation |
Final Thoughts
Using Infinite Talk inside Comfy UI, you can turn any existing video into a perfectly lip-synced talking video. By carefully following each step and adjusting your settings, you can create realistic outputs, add special effects like whispering or laughing, and even experiment with speed vs. quality trade-offs.
Now you can start experimenting and generating your own lip-sync videos directly from your computer.
Frequently Asked Questions (FAQs)
Q1. How do I create a whispering effect?
Upload an audio clip of whispering. The model will automatically generate a video where the character whispers.
Q2. Why is my first generation taking so long?
The first run downloads the Wave2Vec model node automatically. This only happens once, so subsequent runs will be faster.
Q3. How can I make the generation faster without losing too much quality?
- Try slightly reducing the frame window size while keeping motion frame values balanced.
- If more speed is needed, use the Lidex 2V LoRA, but expect a drop in quality.
Q4. Where do I place each model file?
Refer to the table above for correct folder placement. Incorrect placement will cause errors in ComfyUI.
Q5. How much GPU memory do I need?
At least 24 GB of VRAM is recommended for smooth performance.