AI Voice
SSML (Speech Synthesis Markup Language)
Last updated
An XML-based markup language for fine-grained control of TTS output — pauses, emphasis, pitch, rate and phoneme pronunciation.
Full definition
Speech Synthesis Markup Language (SSML) is an XML-based standard for controlling fine-grained aspects of text-to-speech generation. SSML lets developers and content creators specify pauses (<break>), emphasis (<emphasis>), pitch and rate (<prosody>), specific phoneme pronunciation (<phoneme>), and named voice substitution (<voice>). All major TTS providers (ElevenLabs, Murf, PlayHT, Amazon Polly, Google Cloud TTS) support a subset of SSML. Tools like Murf abstract SSML behind visual controls (pause sliders, emphasis buttons); tools like ElevenLabs and PlayHT support SSML directly in their APIs for developer use.
Examples
- ·Adding <break time='500ms'/> in SSML to force a 500ms pause before a key phrase.
- ·Using <emphasis level='strong'> to emphasize a product name in a generated voiceover.
- ·Specifying <phoneme alphabet='ipa' ph='dɪˈskrɪpt'>Descript</phoneme> to override default pronunciation.