In a Voice→ASR→Text→TTS pipeline, information related to “time” is lost at the ASR stage… (unless you find a way to explicitly preserve it…)
I suppose in the future, it will be easy to do with a single model…
One method that works with most ASR models is to feed the audio into the pipeline after pre-segmenting it by time intervals and then concatenating the segments. Even if individual segments are out of sync, this approach helps prevent the entire sequence from being out of sync:
How to make TTS voice sync with video for translated recap/story dubbing
You are running into a very common automatic-dubbing problem: the target-language narration is not naturally the same duration as the source-language narration. A 7-minute Chinese narration can easily become a 5-minute English narration, or the reverse. Stretching the whole audio file, speeding up the video, or asking an LLM for an exact word count will not reliably fix it.
The core fix is architectural:
Do not generate one full translated script, synthesize one full TTS file, and then stretch it to match the video.
Instead, split the source into timed narration segments, generate duration-aware English variants per segment, synthesize each segment separately, measure actual TTS duration, then place each fitted clip back onto the original timeline.
This is the difference between global duration matching and local scene/beat synchronization.
Useful background:
1. What is actually going wrong?
Your current pipeline is approximately:
video
→ extract audio
→ isolate main voice
→ ASR with timestamps
→ calibrate average TTS speed / words per second
→ send large transcript to LLM for English
→ generate TTS
→ stretch/compress/merge with video
The failure is happening because the last part treats dubbing as a whole-file duration problem.
But story/recap dubbing is a timeline problem.
A full English TTS file matching the full video length does not guarantee that:
- the character name is spoken when the character appears,
- the twist is explained when the twist is shown,
- the pause happens where the original pause happened,
- the scene transition lines up with the scene transition,
- the important reveal word lands near the visual reveal.
So global stretching can make the end time correct while the middle is still wrong.
2. Why exact word count does not solve it
Asking the LLM for a precise word count is not enough.
Example:
"He ran."
"Unbelievable."
"She underestimated him."
These have very different spoken durations even though the word counts are small.
TTS duration depends on:
- phonemes,
- syllables,
- punctuation,
- pauses,
- voice,
- TTS model prosody,
- names,
- numbers,
- sentence structure,
- whether the model inserts expressive pauses.
The VideoDubber paper specifically argues that word/character-count control is not enough for dubbing because spoken duration varies across languages and tokens.
So the LLM should not be asked only:
Make this exactly 15 words.
It should be asked:
Make this a natural English narration line that can be spoken in about 4.2 seconds.
Return normal, compact, ultra_compact, and expanded versions.
Then your code should synthesize the audio and measure the real duration.
The measured WAV duration is the final truth.
3. Correct target: scene/beat sync, not lip sync
You said you do not need lip-sync. That makes the problem easier.
There are four different sync levels:
| Sync level |
Meaning |
Needed for your case? |
| Global sync |
final audio starts/ends with the video |
Yes, but not enough |
| Segment sync |
each narration line fits its local source slot |
Yes |
| Beat sync |
important words land near important visual events |
Yes |
| Lip sync |
phonemes match mouth shapes |
No |
Your target is:
segment sync + story beat sync
not full lip-sync.
For story recaps, a good dub is one where:
- “he opened the door” happens near the door opening,
- “she was the traitor” happens near the reveal,
- “three years later” lands near the time-skip card,
- the narration resets when the scene changes,
- pauses still feel natural.
4. Recommended architecture
Use this pipeline instead:
video
→ extract audio
→ isolate speech for ASR
→ keep music/SFX bed separately
→ ASR + VAD + timestamps
→ clean into dubbing-friendly segments
→ optionally add scene/shot boundaries
→ LLM generates timed English variants per segment
→ TTS each segment separately
→ measure each TTS clip duration
→ choose / rewrite / speed-adjust / pad
→ place each clip at the original segment timestamp
→ mix with music/SFX bed
→ export final video
The key difference:
Wrong:
one translated script → one TTS file → global stretch
Right:
many timestamped segments → many fitted TTS clips → timeline overlay
5. Build around a segment table
Use one central data structure for the whole pipeline.
Example:
[
{
"id": 42,
"start": 183.20,
"end": 187.60,
"target_duration": 4.40,
"source_text": "<source text>",
"scene_note": "<the woman reveals the letter>",
"importance": "high",
"can_start_early_ms": 100,
"can_end_late_ms": 250,
"english_candidates": {
"normal": "",
"compact": "",
"ultra_compact": "",
"expanded": ""
},
"chosen_text": "",
"tts_duration": null,
"fit_action": ""
}
]
This makes debugging much easier.
Instead of only knowing:
The final English audio is 2 minutes too short.
you can know:
segment 42 is 38% too long
segment 57 is too short but can be padded
segment 91 overlaps the next scene
segment 103 needs a shorter rewrite
That is the difference between guessing and engineering.
6. How to create good segments
Raw Whisper segments are not always good dubbing segments. You need to convert them into TTS-friendly, scene-aware narration blocks.
Use three types of boundaries.
Speech boundaries
These come from ASR/VAD:
- speech starts,
- speech ends,
- pause begins,
- pause ends.
Tools to study:
Semantic boundaries
These are story boundaries:
- one plot point,
- one reveal,
- one joke,
- one explanation,
- one transition.
Visual boundaries
These come from the video:
- scene cut,
- character appears,
- object appears,
- title card appears,
- fight starts,
- flashback begins.
For story/recap videos, visual boundaries matter. If the source says “then he opened the door” after the door is already open, the dub feels late even if the audio duration is technically correct.
7. Segment length rules for Kokoro
Kokoro is fast and practical, but its segment size matters.
The Kokoro voice notes mention:
- weakness on very short utterances, especially under about 10–20 tokens,
- rushing on long utterances, especially over 400 tokens,
- possible mitigation by bundling short utterances, chunking long utterances, or adjusting speed.
For recap/story narration, start with:
| Segment duration |
Recommendation |
| Under 1 sec |
Usually too short; merge if possible |
| 1–2 sec |
Use only for punchy dramatic beats |
| 3–8 sec |
Best range for most narration |
| 8–12 sec |
Good for slower explanation |
| 12–15 sec |
Usable but monitor |
| 15+ sec |
Usually split |
A good practical default:
Most segments: 3–8 seconds
Rare short beats: 1.5–3 seconds
Rare long explanations: 8–12 seconds
Avoid both extremes:
- too many tiny clips → choppy, unstable TTS,
- huge paragraphs → hard to fit, rushed, bad timing.
8. The most useful trick: generate multiple timing variants
Do not ask Gemini for one translation.
Ask for several spoken variants per segment:
{
"id": 42,
"normal": "At that moment, he realized she had been lying to him all along.",
"compact": "Then he realized she had been lying.",
"ultra_compact": "She had lied all along.",
"expanded": "At that moment, he finally realized she had been lying to him from the beginning.",
"must_keep": ["she lied", "he realizes it now"],
"can_drop": ["from the beginning", "emotional emphasis"]
}
Then synthesize candidates and measure real duration.
Example:
| Candidate |
TTS duration |
Target slot |
Decision |
| normal |
5.8s |
3.9s |
Too long |
| compact |
4.2s |
3.9s |
Good with slight speedup |
| ultra_compact |
2.6s |
3.9s |
Too short unless pause works |
| expanded |
6.5s |
3.9s |
Reject |
This matches the direction of length-aware dubbing research:
9. Better Gemini prompt
Use Gemini as a timed adaptation model, not only a translator.
You are adapting narration for an English TTS dub of a story recap video.
The goal is not literal translation.
The goal is natural English narration that fits the original video timing.
Rules:
1. Preserve the segment id.
2. Do not merge segments.
3. Do not move plot information across segment boundaries.
4. Preserve character names and plot-critical facts.
5. Use short spoken English.
6. Avoid long clauses.
7. If the time slot is short, compress less important detail.
8. If the time slot is long, use natural pacing but do not add new facts.
9. Output valid JSON only.
For each segment, return:
{
"id": number,
"normal": "natural spoken English",
"compact": "shorter version",
"ultra_compact": "shortest acceptable version",
"expanded": "slightly fuller version if TTS is too short",
"must_keep": ["critical facts"],
"can_drop": ["details that may be removed if compact"]
}
Duration guide:
- Under 2 sec: 2–6 words
- 2–4 sec: 5–12 words
- 4–7 sec: 10–22 words
- 7–10 sec: 18–35 words
- 10+ sec: one or two short sentences
Input:
[
{
"id": 42,
"start": 183.20,
"end": 187.60,
"duration_sec": 4.40,
"source_text": "<source text>",
"context_before": "<previous context>",
"context_after": "<next context>",
"visual_note": "<the woman reveals the letter>",
"importance": "high"
}
]
The word guide is only a rough hint. The actual TTS duration is the final authority.
10. Duration fitting policy
For every generated TTS clip:
target = source_end - source_start
actual = generated_tts_duration
ratio = actual / target
Use this decision table:
| Ratio |
Meaning |
Best action |
| 0.90–1.05 |
Good fit |
Keep; pad tiny gap if needed |
| 0.75–0.90 |
Short |
Add silence or try expanded version |
| 0.60–0.75 |
Very short |
Use expanded version or add deliberate pause |
| 1.05–1.15 |
Slightly long |
Small speedup or small tempo correction |
| 1.15–1.30 |
Long |
Try compact version first |
| 1.30–1.60 |
Too long |
Rewrite/compress |
| >1.60 |
Bad fit |
Re-segment or mark for manual review |
Important rule:
Rewrite text first.
Use TTS speed second.
Use small audio tempo correction third.
Pad silence when short.
Manual-review extreme failures.
Do not do:
literal translation
→ huge time-stretch
11. When to stretch audio
Use audio stretching only for small mismatches.
FFmpeg’s atempo filter changes audio tempo. It can be useful, but for dubbing the real limit is not the technical maximum. The real limit is naturalness.
Practical limits:
| Adjustment |
Usually okay? |
| 0.95×–1.05× |
Safe |
| 0.90×–1.10× |
Usually fine |
| 0.85×–1.20× |
Sometimes acceptable |
| Below 0.85× |
Often sounds dragged |
| Above 1.20× |
Often sounds rushed |
| Above 1.30× |
Rewrite instead |
Example:
ffmpeg -i <input.wav> -filter:a "atempo=1.10" <output.wav>
If your clip is 6.5 seconds and the target is 3.8 seconds, do not speed it up by 1.71×. Rewrite it.
12. How to handle long vs short TTS
If TTS is too long
Example:
target: 4.0s
generated: 5.7s
ratio: 1.43
Bad fix:
speed up to 1.43×
Better fix:
- ask Gemini for a compact version,
- remove adjectives,
- remove repeated context,
- preserve only the plot-critical beat.
Example:
Too long:
At that moment, he finally understood that the woman standing in front of him had secretly planned everything from the start.
Better:
Then he realized she planned everything.
If TTS is too short
Example:
target: 5.0s
generated: 3.4s
ratio: 0.68
Bad fix:
slow it down heavily
Better fixes:
- try expanded variant,
- add silence after the line,
- add silence before a reveal,
- add a natural pause between clauses.
Example:
Short:
She was the traitor.
Expanded:
Only then did he realize she was the traitor.
For story videos, silence is often natural. A small pause before a reveal can improve drama.
13. Timeline assembly
Do not concatenate TTS clips.
Create a silent audio track with the same duration as the video, then overlay each fitted clip at its original timestamp.
Conceptually:
final_audio = silence(video_duration)
for segment in segments:
clip = generated_tts(segment["chosen_text"])
clip = fit_to_segment(clip, segment)
final_audio.overlay(clip, position=segment["start"])
This preserves the original timeline.
If two clips overlap, do not blindly mix them. Overlap means something needs fixing:
- translation too verbose,
- segment boundary wrong,
- TTS speed too slow,
- adjacent segments should be merged,
- current segment should be rewritten,
- visual beat needs manual review.
14. Chinese → English adaptation tips
Chinese-to-English dubbing has special issues.
English often needs explicit connectors
Chinese narration can be compact and context-heavy. English often adds words like:
but then
because of this
at that moment
meanwhile
three years later
These improve clarity but add duration.
Prompt the LLM:
Use connectors only when needed.
Prefer short spoken English.
Avoid long clauses.
Do not add explanation that is not necessary for this scene.
Use short glossary terms
Chinese story videos often contain relationship terms, fantasy titles, cultivation terms, names, or recurring labels. These can become too long in English.
| Chinese term |
Long English |
Better timed version |
| 师兄 |
his senior martial brother |
senior brother |
| 魔尊 |
the supreme demon lord |
Demon Lord |
| 灵根 |
spiritual root |
spirit root |
| 三年后 |
after three years had passed |
three years later |
Use a glossary:
{
"师兄": "senior brother",
"魔尊": "Demon Lord",
"灵根": "spirit root",
"三年后": "three years later"
}
This improves consistency and reduces duration.
Compress phrasing, not plot facts
Bad compression removes story meaning. Good compression removes extra phrasing.
Example:
Literal:
He never expected that everything that had happened until now was actually only one small part of her carefully prepared plan.
Timed:
It was all part of her plan.
The second version is better if the scene slot is short.
15. Timestamping options
Whisper is fine as a baseline, but for dubbing you must evaluate timestamp quality, not only transcript quality.
Check:
- segment start accuracy,
- segment end accuracy,
- pause detection,
- word timing near important beats.
Useful options:
For Chinese-heavy content, Qwen3-ASR + Qwen3-ForcedAligner is worth testing. The Qwen3-ASR model card describes Qwen3-ForcedAligner as supporting timestamp prediction for arbitrary units within up to 5 minutes of speech in 11 languages.
Recommended test:
Take 10 representative videos.
Run Whisper / faster-whisper.
Run WhisperX.
Run Qwen3-ASR + Qwen3-ForcedAligner.
Compare 30–50 important scene boundaries manually.
Choose based on timestamp quality, not only text accuracy.
16. TTS model thoughts
Keep Kokoro first
Kokoro-82M is a good baseline because it is fast, small, and practical. That matters because your sync strategy may generate multiple TTS candidates per segment.
If a 7-minute video has 180 segments and you generate 3 variants each, that can be hundreds of TTS generations. A fast TTS model is useful.
Use Kokoro like this:
good:
Kokoro per segment
→ measure duration
→ choose / rewrite / adjust / pad
bad:
one huge Kokoro output
→ global stretch
Test other TTS models only after timing works
A better voice will not fix bad segment logic.
After the sync loop works, compare:
Chatterbox is interesting because its model card includes pacing-related controls such as cfg and exaggeration. VoxCPM2 is interesting for higher-quality multilingual TTS and voice design. But do not switch models before fixing the segment-level timing pipeline.
17. Projects and papers worth studying
Practical projects
-
Auto-Synced-Translated-Dubs
Good reference for subtitle-timing-based translation and dubbed audio.
-
SoniTranslate
Larger synchronized video translation/dubbing project with useful real-world issues.
-
VideoLingo
Useful for subtitle segmentation, translation, alignment, and dubbing workflow.
-
Bluez-Dubbing
Useful for modular dubbing, source separation, VAD-based alignment, and sync strategy ideas.
-
Subdub
Useful for subtitle-to-dub CLI workflow.
Papers / research
18. Recommended stack for your case
Baseline stack
Whisper / faster-whisper
→ Gemini timed adaptation
→ Kokoro
→ per-segment duration fitting
→ timeline overlay
This is the first thing I would build.
Chinese-focused timestamp stack
Qwen3-ASR-1.7B
→ Qwen3-ForcedAligner-0.6B
→ Gemini timed adaptation
→ Kokoro
→ per-segment duration fitting
Use this if Whisper timestamps are weak on Chinese/Cantonese/dialect-heavy content.
Better TTS experiment
Whisper or Qwen3-ASR
→ Gemini timed adaptation
→ Chatterbox or VoxCPM2
→ same duration fitting logic
Only test this after the Kokoro timing loop works.
19. Concrete implementation order
Step 1: Stop full-video TTS
Change:
one translated script → one TTS file
to:
many timestamped segments → many TTS clips
Step 2: Add candidate variants
For each segment, generate:
normal
compact
ultra_compact
expanded
Step 3: Measure real audio duration
After TTS:
duration = actual WAV duration
Do not trust estimated words-per-second.
Step 4: Add fitting logic
Use:
rewrite if too long
pad if too short
small speed correction if close
manual review if extreme
Step 5: Overlay clips by timestamp
Do not concatenate. Place each clip at the original segment start.
Step 6: Generate a QA report
Example:
{
"video_duration": 420.0,
"segments_total": 184,
"segments_good": 132,
"segments_padded": 31,
"segments_speed_corrected": 14,
"segments_rewritten": 7,
"overlaps": 2,
"manual_review": [44, 91]
}
Flag segments where:
duration ratio > 1.25
duration ratio < 0.70
speed correction > 1.20x
overlap > 150 ms
segment contains visual reveal
segment has title/name/date/number
This lets you review only problematic segments instead of repeatedly watching the whole video.
20. What I would not do
Do not rely on exact word counts
Word count is too weak. Use duration budgets plus measured TTS duration.
Do not globally stretch the final audio
It fixes the ending, not the middle.
Do not globally change video speed
This may be acceptable for lectures/tutorials, but for story/recap edits it usually damages pacing.
Do not over-split into tiny clips
Kokoro can be weaker on very short utterances. Merge tiny fragments when they belong to the same scene.
Do not use aggressive tempo correction
Rewrite the text instead.
21. The actual trick
The practical trick is:
Generate multiple English versions per timestamped segment, synthesize them, measure the real audio duration, and choose or rewrite until each segment fits.
Not:
calculate one average words-per-second
Not:
stretch the final audio
Not:
ask the LLM for exactly 22 words
The real loop is:
target segment duration
→ LLM generates variants
→ TTS generates audio
→ code measures duration
→ choose / rewrite / speed / pad
→ place on timeline
That is the method I would build.
Short summary
- Your issue is mostly pipeline design, not simply Kokoro/Gemini/Whisper.
- Use per-segment timed adaptation, not full-script translation.
- Use a central segment table with
start, end, duration, source_text, visual_note, and candidate English lines.
- Ask Gemini for
normal, compact, ultra_compact, and expanded variants.
- Generate Kokoro TTS per segment and measure actual audio duration.
- If too long, rewrite shorter. If slightly long, speed up a little. If too short, pad or use expanded text.
- Overlay clips at original timestamps instead of concatenating them.
- Keep Kokoro for now; improve the architecture first.
- Test WhisperX or Qwen3-ASR + Qwen3-ForcedAligner if timestamps are weak.
- Test Chatterbox/VoxCPM2 only after segment-level sync works.