Overview
Dubbing generates a new audio track for your video in the target language, using a voice that sounds like the original speaker. This goes beyond subtitles — viewers hear the content in their language with the creator’s own voice characteristics.
Supported dubbing languages
Voice cloning is supported for 10 languages:
| Flag | Language | Code |
|---|
| 🇺🇸 | English | eng |
| 🇨🇳 | Chinese | zho |
| 🇰🇷 | Korean | kor |
| 🇮🇹 | Italian | ita |
| 🇪🇸 | Spanish | spa |
| 🇧🇷 | Portuguese | por |
| 🇩🇪 | German | deu |
| 🇫🇷 | French | fra |
| 🇯🇵 | Japanese | jpn |
| 🇷🇺 | Russian | rus |
For the full capabilities matrix across transcription, translation, and dubbing, see Supported Languages.
How voice cloning works
The dubbing pipeline processes each segment through four stages:
- Analyze — Extracts voice characteristics from the original speaker’s audio
- Enroll — Creates a voice profile with the TTS provider (DashScope Qwen3-TTS-VC)
- Synthesize — Generates speech in the target language using the cloned voice
- Calibrate — Adjusts audio duration to match the original segment timing (up to 7 retries, 10% tolerance)
The output is a 24kHz WAV audio file for each segment, which is then merged into complete audio tracks.
Starting a dubbing job
- From Add Languages, select your target languages
- Enable the Dubbing toggle alongside (or instead of) captions
- Click Start
Dubbing jobs run after translation completes — the translated text is what gets synthesized into speech.
Reviewing dubbed audio
Once complete, open the video workspace and select the dubbed language track. The video player plays the dubbed audio synced with the video so you can review before publishing.
Downloading dubbed audio
Click the download icon on any completed language card to access:
- Merged Audio (MP3) — Dubbed speech mixed with the instrumental background track
- Dubbed Audio (MP3) — Dubbed speech only, no background audio
- Instrumental (WAV) — Background audio only (extracted from original)
- Audio Segments (ZIP) — Individual WAV files for each segment
See Exporting for details on all download formats.
Tips for best results
- Single clear speaker — Videos with one speaker produce the best cloning results
- Minimal background noise — Heavy music or ambient noise degrades voice clone quality
- Sufficient reference audio — Videos under 30 seconds may not provide enough audio for accurate cloning
- Edit dubbed captions — If synthesis sounds unnatural, try editing the translated caption text to be shorter or simpler — the dubbed audio will regenerate
Videos detected as containing primarily music content (no speech) will fail dubbing with a music_detected error. This is a permanent error — the video is not suitable for voice dubbing. Credits are automatically refunded.
Credit cost
Dubbing is charged at 250 credits per 1,000 characters of source text, per language. See Credit Costs for a detailed breakdown.