Say Hello to ‘Her’: Real-Time AI Voice Agents with 500ms Latency, Now Open Source

Agent Issue
5 min readAug 17, 2024

Voice Mode is hands down one of the coolest features in ChatGPT, right?

OpenAI didn’t just slap some voices together — they went all out, working with top-notch casting and directing pros.

They sifted through over 400 voice submissions before finally landing on the 5 voices you hear in ChatGPT today.

Just imagine all the people involved — voice actors, talent agencies, casting directors, industry advisors. It’s wild!

There was even some drama along the way, like with the voice of Sky, which had to be paused because it sounded a little too much like Scarlett Johansson.

Now, here’s where it gets really exciting…

Imagine having that kind of speech-to-speech magic, but with less than 500ms latency, running locally on your own GPU.

Sounds like a dream, right?

Well, it’s just open-source nowadays.

Today, I’m going to walk you through how you can set up your very own modular speech-to-speech pipeline, which is built with a series of consecutive parts:

  1. Voice Activity Detection (VAD): Pipeline is using Silero VAD v5. It’s the gatekeeper that decides when someone is actually speaking.
  2. Speech to Text (STT): This is where Whisper comes in. Whisper models are known for their accuracy, and you can even use distilled versions for faster…

--

--

Agent Issue
Agent Issue

Written by Agent Issue

Your front-row seat to the future of Agents.

Responses (12)