How to make TTS voice in sync with the video

I’ve been trying to make a video dubbing pipeline i nailed most of the part of processing video and audio but I’m stuck to that voice and video synchronization.

like when i make a narration of a new language and then convert it to a audio it just doesn’t match like a 7 mins Chinese voice becomes a 5 minutes English voice. and then even if i stretch the audio or make video bit faster it just doesn’t sync with the video.

I’m mainly processing story type videos like recap or some story i don’t need lips syncing but i do need to sync the videos scene with my voice

here is the process I’m following

take video → extract the audio → process the audio (remove everything aside from the main voice) → generate text from that voice with proper time stamps → calibrate on a demo story to get the TTS speed like word per second average gap etc. → pass that whole data to a LLM to convert that to English → convert that LLM response to voice using that TTS → merge the video with the new audio

I’m using kokoro for TTS and gemini for LLM whisper for audio to text convertion.

the main issue is happening in last 2-3 steps LLM gives me response but after converting to audio it doesn’t match the videos length sometimes its either too short or too long which makes it hard to sync with the video.

anyone have experience with similar work what’s the approach that worked for you. is there any way to fix this properly. is there any trick to make a TTS voice make in sync with the video or way to make the LLM give precise amount of words .. anything worth trying ???

In a Voice→ASR→Text→TTS pipeline, information related to “time” is lost at the ASR stage… (unless you find a way to explicitly preserve it…)
I suppose in the future, it will be easy to do with a single model

One method that works with most ASR models is to feed the audio into the pipeline after pre-segmenting it by time intervals and then concatenating the segments. Even if individual segments are out of sync, this approach helps prevent the entire sequence from being out of sync:


How to make TTS voice sync with video for translated recap/story dubbing

You are running into a very common automatic-dubbing problem: the target-language narration is not naturally the same duration as the source-language narration. A 7-minute Chinese narration can easily become a 5-minute English narration, or the reverse. Stretching the whole audio file, speeding up the video, or asking an LLM for an exact word count will not reliably fix it.

The core fix is architectural:

Do not generate one full translated script, synthesize one full TTS file, and then stretch it to match the video.
Instead, split the source into timed narration segments, generate duration-aware English variants per segment, synthesize each segment separately, measure actual TTS duration, then place each fitted clip back onto the original timeline.

This is the difference between global duration matching and local scene/beat synchronization.

Useful background:


1. What is actually going wrong?

Your current pipeline is approximately:

video
→ extract audio
→ isolate main voice
→ ASR with timestamps
→ calibrate average TTS speed / words per second
→ send large transcript to LLM for English
→ generate TTS
→ stretch/compress/merge with video

The failure is happening because the last part treats dubbing as a whole-file duration problem.

But story/recap dubbing is a timeline problem.

A full English TTS file matching the full video length does not guarantee that:

  • the character name is spoken when the character appears,
  • the twist is explained when the twist is shown,
  • the pause happens where the original pause happened,
  • the scene transition lines up with the scene transition,
  • the important reveal word lands near the visual reveal.

So global stretching can make the end time correct while the middle is still wrong.


2. Why exact word count does not solve it

Asking the LLM for a precise word count is not enough.

Example:

"He ran."
"Unbelievable."
"She underestimated him."

These have very different spoken durations even though the word counts are small.

TTS duration depends on:

  • phonemes,
  • syllables,
  • punctuation,
  • pauses,
  • voice,
  • TTS model prosody,
  • names,
  • numbers,
  • sentence structure,
  • whether the model inserts expressive pauses.

The VideoDubber paper specifically argues that word/character-count control is not enough for dubbing because spoken duration varies across languages and tokens.

So the LLM should not be asked only:

Make this exactly 15 words.

It should be asked:

Make this a natural English narration line that can be spoken in about 4.2 seconds.
Return normal, compact, ultra_compact, and expanded versions.

Then your code should synthesize the audio and measure the real duration.

The measured WAV duration is the final truth.


3. Correct target: scene/beat sync, not lip sync

You said you do not need lip-sync. That makes the problem easier.

There are four different sync levels:

Sync level Meaning Needed for your case?
Global sync final audio starts/ends with the video Yes, but not enough
Segment sync each narration line fits its local source slot Yes
Beat sync important words land near important visual events Yes
Lip sync phonemes match mouth shapes No

Your target is:

segment sync + story beat sync

not full lip-sync.

For story recaps, a good dub is one where:

  • “he opened the door” happens near the door opening,
  • “she was the traitor” happens near the reveal,
  • “three years later” lands near the time-skip card,
  • the narration resets when the scene changes,
  • pauses still feel natural.

4. Recommended architecture

Use this pipeline instead:

video
→ extract audio
→ isolate speech for ASR
→ keep music/SFX bed separately
→ ASR + VAD + timestamps
→ clean into dubbing-friendly segments
→ optionally add scene/shot boundaries
→ LLM generates timed English variants per segment
→ TTS each segment separately
→ measure each TTS clip duration
→ choose / rewrite / speed-adjust / pad
→ place each clip at the original segment timestamp
→ mix with music/SFX bed
→ export final video

The key difference:

Wrong:
one translated script → one TTS file → global stretch

Right:
many timestamped segments → many fitted TTS clips → timeline overlay

5. Build around a segment table

Use one central data structure for the whole pipeline.

Example:

[
  {
    "id": 42,
    "start": 183.20,
    "end": 187.60,
    "target_duration": 4.40,
    "source_text": "<source text>",
    "scene_note": "<the woman reveals the letter>",
    "importance": "high",
    "can_start_early_ms": 100,
    "can_end_late_ms": 250,
    "english_candidates": {
      "normal": "",
      "compact": "",
      "ultra_compact": "",
      "expanded": ""
    },
    "chosen_text": "",
    "tts_duration": null,
    "fit_action": ""
  }
]

This makes debugging much easier.

Instead of only knowing:

The final English audio is 2 minutes too short.

you can know:

segment 42 is 38% too long
segment 57 is too short but can be padded
segment 91 overlaps the next scene
segment 103 needs a shorter rewrite

That is the difference between guessing and engineering.


6. How to create good segments

Raw Whisper segments are not always good dubbing segments. You need to convert them into TTS-friendly, scene-aware narration blocks.

Use three types of boundaries.

Speech boundaries

These come from ASR/VAD:

  • speech starts,
  • speech ends,
  • pause begins,
  • pause ends.

Tools to study:

Semantic boundaries

These are story boundaries:

  • one plot point,
  • one reveal,
  • one joke,
  • one explanation,
  • one transition.

Visual boundaries

These come from the video:

  • scene cut,
  • character appears,
  • object appears,
  • title card appears,
  • fight starts,
  • flashback begins.

For story/recap videos, visual boundaries matter. If the source says “then he opened the door” after the door is already open, the dub feels late even if the audio duration is technically correct.


7. Segment length rules for Kokoro

Kokoro is fast and practical, but its segment size matters.

The Kokoro voice notes mention:

  • weakness on very short utterances, especially under about 10–20 tokens,
  • rushing on long utterances, especially over 400 tokens,
  • possible mitigation by bundling short utterances, chunking long utterances, or adjusting speed.

For recap/story narration, start with:

Segment duration Recommendation
Under 1 sec Usually too short; merge if possible
1–2 sec Use only for punchy dramatic beats
3–8 sec Best range for most narration
8–12 sec Good for slower explanation
12–15 sec Usable but monitor
15+ sec Usually split

A good practical default:

Most segments: 3–8 seconds
Rare short beats: 1.5–3 seconds
Rare long explanations: 8–12 seconds

Avoid both extremes:

  • too many tiny clips → choppy, unstable TTS,
  • huge paragraphs → hard to fit, rushed, bad timing.

8. The most useful trick: generate multiple timing variants

Do not ask Gemini for one translation.

Ask for several spoken variants per segment:

{
  "id": 42,
  "normal": "At that moment, he realized she had been lying to him all along.",
  "compact": "Then he realized she had been lying.",
  "ultra_compact": "She had lied all along.",
  "expanded": "At that moment, he finally realized she had been lying to him from the beginning.",
  "must_keep": ["she lied", "he realizes it now"],
  "can_drop": ["from the beginning", "emotional emphasis"]
}

Then synthesize candidates and measure real duration.

Example:

Candidate TTS duration Target slot Decision
normal 5.8s 3.9s Too long
compact 4.2s 3.9s Good with slight speedup
ultra_compact 2.6s 3.9s Too short unless pause works
expanded 6.5s 3.9s Reject

This matches the direction of length-aware dubbing research:


9. Better Gemini prompt

Use Gemini as a timed adaptation model, not only a translator.

You are adapting narration for an English TTS dub of a story recap video.

The goal is not literal translation.
The goal is natural English narration that fits the original video timing.

Rules:
1. Preserve the segment id.
2. Do not merge segments.
3. Do not move plot information across segment boundaries.
4. Preserve character names and plot-critical facts.
5. Use short spoken English.
6. Avoid long clauses.
7. If the time slot is short, compress less important detail.
8. If the time slot is long, use natural pacing but do not add new facts.
9. Output valid JSON only.

For each segment, return:
{
  "id": number,
  "normal": "natural spoken English",
  "compact": "shorter version",
  "ultra_compact": "shortest acceptable version",
  "expanded": "slightly fuller version if TTS is too short",
  "must_keep": ["critical facts"],
  "can_drop": ["details that may be removed if compact"]
}

Duration guide:
- Under 2 sec: 2–6 words
- 2–4 sec: 5–12 words
- 4–7 sec: 10–22 words
- 7–10 sec: 18–35 words
- 10+ sec: one or two short sentences

Input:
[
  {
    "id": 42,
    "start": 183.20,
    "end": 187.60,
    "duration_sec": 4.40,
    "source_text": "<source text>",
    "context_before": "<previous context>",
    "context_after": "<next context>",
    "visual_note": "<the woman reveals the letter>",
    "importance": "high"
  }
]

The word guide is only a rough hint. The actual TTS duration is the final authority.


10. Duration fitting policy

For every generated TTS clip:

target = source_end - source_start
actual = generated_tts_duration
ratio = actual / target

Use this decision table:

Ratio Meaning Best action
0.90–1.05 Good fit Keep; pad tiny gap if needed
0.75–0.90 Short Add silence or try expanded version
0.60–0.75 Very short Use expanded version or add deliberate pause
1.05–1.15 Slightly long Small speedup or small tempo correction
1.15–1.30 Long Try compact version first
1.30–1.60 Too long Rewrite/compress
>1.60 Bad fit Re-segment or mark for manual review

Important rule:

Rewrite text first.
Use TTS speed second.
Use small audio tempo correction third.
Pad silence when short.
Manual-review extreme failures.

Do not do:

literal translation
→ huge time-stretch

11. When to stretch audio

Use audio stretching only for small mismatches.

FFmpeg’s atempo filter changes audio tempo. It can be useful, but for dubbing the real limit is not the technical maximum. The real limit is naturalness.

Practical limits:

Adjustment Usually okay?
0.95×–1.05× Safe
0.90×–1.10× Usually fine
0.85×–1.20× Sometimes acceptable
Below 0.85× Often sounds dragged
Above 1.20× Often sounds rushed
Above 1.30× Rewrite instead

Example:

ffmpeg -i <input.wav> -filter:a "atempo=1.10" <output.wav>

If your clip is 6.5 seconds and the target is 3.8 seconds, do not speed it up by 1.71×. Rewrite it.


12. How to handle long vs short TTS

If TTS is too long

Example:

target: 4.0s
generated: 5.7s
ratio: 1.43

Bad fix:

speed up to 1.43×

Better fix:

  • ask Gemini for a compact version,
  • remove adjectives,
  • remove repeated context,
  • preserve only the plot-critical beat.

Example:

Too long:
At that moment, he finally understood that the woman standing in front of him had secretly planned everything from the start.

Better:
Then he realized she planned everything.

If TTS is too short

Example:

target: 5.0s
generated: 3.4s
ratio: 0.68

Bad fix:

slow it down heavily

Better fixes:

  • try expanded variant,
  • add silence after the line,
  • add silence before a reveal,
  • add a natural pause between clauses.

Example:

Short:
She was the traitor.

Expanded:
Only then did he realize she was the traitor.

For story videos, silence is often natural. A small pause before a reveal can improve drama.


13. Timeline assembly

Do not concatenate TTS clips.

Create a silent audio track with the same duration as the video, then overlay each fitted clip at its original timestamp.

Conceptually:

final_audio = silence(video_duration)

for segment in segments:
    clip = generated_tts(segment["chosen_text"])
    clip = fit_to_segment(clip, segment)
    final_audio.overlay(clip, position=segment["start"])

This preserves the original timeline.

If two clips overlap, do not blindly mix them. Overlap means something needs fixing:

  • translation too verbose,
  • segment boundary wrong,
  • TTS speed too slow,
  • adjacent segments should be merged,
  • current segment should be rewritten,
  • visual beat needs manual review.

14. Chinese → English adaptation tips

Chinese-to-English dubbing has special issues.

English often needs explicit connectors

Chinese narration can be compact and context-heavy. English often adds words like:

but then
because of this
at that moment
meanwhile
three years later

These improve clarity but add duration.

Prompt the LLM:

Use connectors only when needed.
Prefer short spoken English.
Avoid long clauses.
Do not add explanation that is not necessary for this scene.

Use short glossary terms

Chinese story videos often contain relationship terms, fantasy titles, cultivation terms, names, or recurring labels. These can become too long in English.

Chinese term Long English Better timed version
师兄 his senior martial brother senior brother
魔尊 the supreme demon lord Demon Lord
灵根 spiritual root spirit root
三年后 after three years had passed three years later

Use a glossary:

{
  "师兄": "senior brother",
  "魔尊": "Demon Lord",
  "灵根": "spirit root",
  "三年后": "three years later"
}

This improves consistency and reduces duration.

Compress phrasing, not plot facts

Bad compression removes story meaning. Good compression removes extra phrasing.

Example:

Literal:
He never expected that everything that had happened until now was actually only one small part of her carefully prepared plan.

Timed:
It was all part of her plan.

The second version is better if the scene slot is short.


15. Timestamping options

Whisper is fine as a baseline, but for dubbing you must evaluate timestamp quality, not only transcript quality.

Check:

  • segment start accuracy,
  • segment end accuracy,
  • pause detection,
  • word timing near important beats.

Useful options:

For Chinese-heavy content, Qwen3-ASR + Qwen3-ForcedAligner is worth testing. The Qwen3-ASR model card describes Qwen3-ForcedAligner as supporting timestamp prediction for arbitrary units within up to 5 minutes of speech in 11 languages.

Recommended test:

Take 10 representative videos.
Run Whisper / faster-whisper.
Run WhisperX.
Run Qwen3-ASR + Qwen3-ForcedAligner.
Compare 30–50 important scene boundaries manually.
Choose based on timestamp quality, not only text accuracy.

16. TTS model thoughts

Keep Kokoro first

Kokoro-82M is a good baseline because it is fast, small, and practical. That matters because your sync strategy may generate multiple TTS candidates per segment.

If a 7-minute video has 180 segments and you generate 3 variants each, that can be hundreds of TTS generations. A fast TTS model is useful.

Use Kokoro like this:

good:
Kokoro per segment
→ measure duration
→ choose / rewrite / adjust / pad

bad:
one huge Kokoro output
→ global stretch

Test other TTS models only after timing works

A better voice will not fix bad segment logic.

After the sync loop works, compare:

Chatterbox is interesting because its model card includes pacing-related controls such as cfg and exaggeration. VoxCPM2 is interesting for higher-quality multilingual TTS and voice design. But do not switch models before fixing the segment-level timing pipeline.


17. Projects and papers worth studying

Practical projects

  • Auto-Synced-Translated-Dubs
    Good reference for subtitle-timing-based translation and dubbed audio.

  • SoniTranslate
    Larger synchronized video translation/dubbing project with useful real-world issues.

  • VideoLingo
    Useful for subtitle segmentation, translation, alignment, and dubbing workflow.

  • Bluez-Dubbing
    Useful for modular dubbing, source separation, VAD-based alignment, and sync strategy ideas.

  • Subdub
    Useful for subtitle-to-dub CLI workflow.

Papers / research


18. Recommended stack for your case

Baseline stack

Whisper / faster-whisper
→ Gemini timed adaptation
→ Kokoro
→ per-segment duration fitting
→ timeline overlay

This is the first thing I would build.

Chinese-focused timestamp stack

Qwen3-ASR-1.7B
→ Qwen3-ForcedAligner-0.6B
→ Gemini timed adaptation
→ Kokoro
→ per-segment duration fitting

Use this if Whisper timestamps are weak on Chinese/Cantonese/dialect-heavy content.

Better TTS experiment

Whisper or Qwen3-ASR
→ Gemini timed adaptation
→ Chatterbox or VoxCPM2
→ same duration fitting logic

Only test this after the Kokoro timing loop works.


19. Concrete implementation order

Step 1: Stop full-video TTS

Change:

one translated script → one TTS file

to:

many timestamped segments → many TTS clips

Step 2: Add candidate variants

For each segment, generate:

normal
compact
ultra_compact
expanded

Step 3: Measure real audio duration

After TTS:

duration = actual WAV duration

Do not trust estimated words-per-second.

Step 4: Add fitting logic

Use:

rewrite if too long
pad if too short
small speed correction if close
manual review if extreme

Step 5: Overlay clips by timestamp

Do not concatenate. Place each clip at the original segment start.

Step 6: Generate a QA report

Example:

{
  "video_duration": 420.0,
  "segments_total": 184,
  "segments_good": 132,
  "segments_padded": 31,
  "segments_speed_corrected": 14,
  "segments_rewritten": 7,
  "overlaps": 2,
  "manual_review": [44, 91]
}

Flag segments where:

duration ratio > 1.25
duration ratio < 0.70
speed correction > 1.20x
overlap > 150 ms
segment contains visual reveal
segment has title/name/date/number

This lets you review only problematic segments instead of repeatedly watching the whole video.


20. What I would not do

Do not rely on exact word counts

Word count is too weak. Use duration budgets plus measured TTS duration.

Do not globally stretch the final audio

It fixes the ending, not the middle.

Do not globally change video speed

This may be acceptable for lectures/tutorials, but for story/recap edits it usually damages pacing.

Do not over-split into tiny clips

Kokoro can be weaker on very short utterances. Merge tiny fragments when they belong to the same scene.

Do not use aggressive tempo correction

Rewrite the text instead.


21. The actual trick

The practical trick is:

Generate multiple English versions per timestamped segment, synthesize them, measure the real audio duration, and choose or rewrite until each segment fits.

Not:

calculate one average words-per-second

Not:

stretch the final audio

Not:

ask the LLM for exactly 22 words

The real loop is:

target segment duration
→ LLM generates variants
→ TTS generates audio
→ code measures duration
→ choose / rewrite / speed / pad
→ place on timeline

That is the method I would build.


Short summary

  • Your issue is mostly pipeline design, not simply Kokoro/Gemini/Whisper.
  • Use per-segment timed adaptation, not full-script translation.
  • Use a central segment table with start, end, duration, source_text, visual_note, and candidate English lines.
  • Ask Gemini for normal, compact, ultra_compact, and expanded variants.
  • Generate Kokoro TTS per segment and measure actual audio duration.
  • If too long, rewrite shorter. If slightly long, speed up a little. If too short, pad or use expanded text.
  • Overlay clips at original timestamps instead of concatenating them.
  • Keep Kokoro for now; improve the architecture first.
  • Test WhisperX or Qwen3-ASR + Qwen3-ForcedAligner if timestamps are weak.
  • Test Chatterbox/VoxCPM2 only after segment-level sync works.