Audio Quality Tips

Getting accurate transcriptions

Do:

Use clean audio with minimal background noise
Record with a good microphone (lapel or directional)
Speak clearly at a moderate pace
Select the correct source language (or use auto-detect for common languages)

Avoid:

Heavy background music during speech
Multiple speakers talking simultaneously
Echo-heavy rooms or outdoor environments with wind
Very fast speech or heavy accents without post-editing

Speaker diarization

Neolli automatically identifies different speakers in your video. For best results:

Speakers should have distinct voices
Minimize crosstalk (speakers talking over each other)
Longer speaking turns produce more reliable speaker identification

If speaker labels are wrong, you can reassign them in the caption editor using the T shortcut.

Getting natural-sounding dubs

Voice profile requirements:

At least 30 seconds of clear speech from the target speaker
Minimal background music or effects during speech
Single speaker produces significantly better results than multi-speaker videos

What degrades voice cloning:

Heavy background music blended with speech
Echo or reverb in the recording
Very short videos (under 30 seconds) — insufficient reference audio
Multiple speakers in the same video — produces a blended voice

Improving results:

Edit translated captions to be shorter or simpler — shorter phrases synthesize more naturally
The dubbing system calibrates audio duration automatically to match segment timing
If a specific segment sounds unnatural, edit that caption text and regenerate

Fixing caption sync

If captions appear early or late relative to the audio:

Open the caption editor
Select the affected segment
Use to snap the start time to the playhead, or for the end time
Use Alt + ← / → to slide the entire segment timing

For systemic offset (all captions off by the same amount), select multiple segments and adjust them together.

Hold timing shortcuts to accelerate — the editor ramps up to 10× speed the longer you hold the key.

Videos with primarily music content (no speech) will fail dubbing with a music_detected error. This is expected — voice dubbing requires spoken content to clone and re-synthesize. Credits are automatically refunded.

Getting Started

Workflows

Workspace & Team

Billing & Credits

Troubleshooting

Audio Quality Tips

Getting accurate transcriptions

Speaker diarization

Getting natural-sounding dubs

Fixing caption sync

​Getting accurate transcriptions

​Speaker diarization

​Getting natural-sounding dubs

​Fixing caption sync

Getting accurate transcriptions

Speaker diarization

Getting natural-sounding dubs

Fixing caption sync