Google released Gemini 3.1 Flash TTS on April 15, its latest text-to-speech model and the newest addition to the Gemini 3.1 family. It’s built for developers, enterprises, and Workspace users who want more control over how AI-generated speech sounds.

Google isn’t the only company pushing hard on voice AI right now. ElevenLabs, OpenAI, and Inworld have all shipped major updates in recent months, and the leaderboard rankings have been shifting. Gemini 3.1 Flash TTS enters that race with a strong benchmark score, new speech controls, and wide language support.

Here are a few things you should know about the model:

/1. It Ranks Number Two on the Global AI Voice Leaderboard

The Artificial Analysis TTS leaderboard, which ranks models based on blind human listening tests across thousands of comparisons, currently places Gemini 3.1 Flash TTS at number two globally with an Elo score of 1,211. Inworld TTS 1.5 Max holds the top spot at 1,215, and ElevenLabs Eleven v3 sits third at 1,179.

Artificial Analysis also placed Gemini 3.1 Flash TTS in its "most attractive quadrant," a designation for models that balance high speech quality with low generation cost.

/2. Audio Tags Give You Over 200 Ways to Control How It Speaks

Gemini 3.1 Flash TTS introduces audio tags, a system that lets you embed natural language commands directly into your text to steer vocal delivery. There are over 200 tags available, covering pacing, tone, accent, and expression. Tags go in square brackets exactly where the shift should happen in the script, things like [whispers], [happy], or [cautious], and no two tags should sit directly next to each other without text in between.

The tags are written in English only, but they work alongside text in other languages, so a French script can carry English emotional instructions embedded within it.

/3. Two Speakers in One Go, No Workarounds Needed

Most voice tools require a separate API call for each speaker, which often makes dialogue sound choppy and unnatural. Gemini 3.1 Flash TTS handles multiple speakers natively, keeping a more natural conversational flow across turns.

Google AI Studio supports this with a director-style interface where you can define the scene, cast characters using individual audio profiles, and set notes for each speaker's pace, tone, and accent. When everything sounds right, the whole setup exports as ready-to-use API code.

/4. It Covers 70-Plus Languages With Regional Accent Options

The model supports more than 70 languages with regional accent options built in. In English alone, you can choose from American Valley, Southern, several British accents, including Brixton and RP, and Transatlantic.

For Google Workspace users, the Google Vids integration brings 30 new conversational voice options across 24 languages, with 16 new languages added, including Arabic, Hindi, Bengali, Turkish, Vietnamese, and Ukrainian, joining the previously supported English, Spanish, Portuguese, Japanese, Korean, French, Italian, and German.

/5. Every Clip It Generates Has an Invisible Watermark

All audio from Gemini 3.1 Flash TTS is watermarked using SynthID. The watermark is woven directly into the audio in a way that is imperceptible to listeners but detectable when checked, and it is there to identify AI-generated content and help prevent misinformation.

The watermark doesn’t degrade the audio quality on the listener's end.

/6. It Is Live and Open to Test Today

Gemini 3.1 Flash TTS is available in preview through the Gemini API and Google AI Studio for developers, on Vertex AI for enterprise teams, and inside Google Vids for Workspace users, all rolling out from April 15.

Google AI Studio has a dedicated audio playground for testing the controls, covering voice style instructions, the audio tag prompting framework, and directing expression and pacing across use cases including accessibility, audiobooks, and enterprise applications. No waitlist, access is open now.

Swiggy Introduces AI Voice Ordering With Sarvam AI Across 11 Indian Languages
Swiggy’s new partnership with Sarvam AI brings voice-led ordering across food delivery, groceries, and dining reservations.