How Descript enables multilingual video dubbing at scale

OpenAI
Descript redesigned its video dubbing pipeline using OpenAI models to optimize for both semantic fidelity and duration adherence simultaneously, significantly improving natural pacing.

Summary

Descript, an AI-native video editor, has significantly improved its multilingual video dubbing capabilities by redesigning its translation pipeline to address the critical issue of duration adherence, which often made translated speech sound unnatural.

Previously, translations optimized for meaning often failed timing constraints because different languages require different speaking rates (e.g., German is often 'longer' than English). This forced users into tedious manual adjustments. Descript's new approach uses OpenAI reasoning models, specifically leveraging improved consistency in tasks like syllable counting, to optimize simultaneously for semantic fidelity and duration adherence during generation, rather than correcting timing afterward.

The results showed a 15% increase in translated video exports and a 13 to 43 percentage point improvement in duration adherence. Listening tests confirmed that the redesigned pipeline increased segments falling within a natural pacing window from 40%-60% up to 73%-83%. Descript is now building batch processing capabilities to enable large-scale localization, with future improvements focusing on making the pipeline more multimodal to better preserve nonverbal speech characteristics like tone and emphasis.

(Source:OpenAI)