How Build Your Own AI Confessional: How to Add a Voice to the LLM | HackerNoon

While OpenAI is delaying the release of the advanced Voice Modes for ChatGPT, I want to share how we built our LLM voice application and integrated it into an interactive booth.

Talk to the AI in the jungle

At the end of February, Bali hosted the Lampu festival, arranged according to the principles of the famous Burning Man. According to its tradition, participants create their own installations and art objects.

My friends from Camp 19:19 and I, inspired by the idea of Catholic confessionals and the capabilities of the current LLMs, came up with the idea of building our own AI confessional, where anyone could talk to an artificial intelligence.

Here’s how we envisioned it at the very beginning:

  • When the user enters a booth, we determine that we need to start a new session.
  • The user asks a question, and the AI listens and answers. We wanted to create a trusting and private environment where everyone could openly discuss their thoughts and experiences.
  • When the user leaves the room, the system ends the session and forgets all the conversation details. This is necessary to keep all dialogs private.

Proof of Concept

To test the concept and start experimenting with a prompt for the LLM, I created a naive implementation in one evening:

  • Listen to a microphone.
  • Recognize user speech using the Speech-to-Text (STT) model.
  • Generate a response via LLM.
  • Synthesize a voice response using the Text-to-Speech (TTS) model.
  • Play back the response to the user.

To implement this demo, I entirely relied on the cloud models from OpenAI: Whisper, GPT-4, and TTS. Thanks to the excellent library speech_recognition, I built the demo in just a few dozen lines of code.

import os
import asyncio
from dotenv import load_dotenv
from io import BytesIO
from openai import AsyncOpenAI
from soundfile import SoundFile
import sounddevice as sd
import speech_recognition as sr


load_dotenv()

aiclient = AsyncOpenAI(
    api_key=os.environ.get("OPENAI_API_KEY")
)

SYSTEM_PROMPT = """
  You are helpfull assistant. 
"""

async def listen_mic(recognizer: sr.Recognizer, microphone: sr.Microphone):
    audio_data = recognizer.listen(microphone)
    wav_data = BytesIO(audio_data.get_wav_data())
    wav_data.name = "SpeechRecognition_audio.wav"
    return wav_data


async def say(text: str):
    res = await aiclient.audio.speech.create(
        model="tts-1",
        voice="alloy",
        response_format="opus",
        input=text
    )
    buffer = BytesIO()
    for chunk in res.iter_bytes(chunk_size=4096):
        buffer.write(chunk)
    buffer.seek(0)
    with SoundFile(buffer, 'r') as sound_file:
        data = sound_file.read(dtype='int16')
        sd.play(data, sound_file.samplerate)
        sd.wait()


async def respond(text: str, history):
    history.append({"role": "user", "content": text})
    completion = await aiclient.chat.completions.create(
        model="gpt-4",
        temperature=0.5,
        messages=history,
    )
    response = completion.choices[0].message.content
    await say(response)
    history.append({"role": "assistant", "content": response})


async def main() -> None:
    m = sr.Microphone()
    r = sr.Recognizer()
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    with m as source:
        r.adjust_for_ambient_noise(source)
        while True:
            wav_data = await listen_mic(r, source)
            transcript = await aiclient.audio.transcriptions.create(
                model="whisper-1",
                temperature=0.5,
                file=wav_data,
                response_format="verbose_json",
            )
            if transcript.text == '' or transcript.text is None:
                continue
            await respond(transcript.text, messages)

if __name__ == '__main__':
    asyncio.run(main())

The problems we had to solve immediately became apparent after the first tests of this demo:

  • Response delay. In a naive implementation, the delay between user question and response is 7-8 seconds or longer. This is not good, but obviously, there are many ways to optimize the response time.
  • Ambient noise. We discovered that in noisy environments, we cannot rely on the microphone to detect when a user has started and finished speaking automatically. Recognizing the start and end of a phrase (endpointing) is a non-trivial task. Couple this with the noisy environment of a music festival, and it’s clear that a conceptually different approach is needed.
  • Mimic live conversation. We wanted to give the user the ability to interrupt the AI. To achieve this, we would have to keep the microphone on. But in this case, we would have to separate the user’s voice not only from the background sounds but also from the AI’s voice.
  • Feedback. Because of the response delay, it sometimes seemed to us that the system was frozen. We realized that we need to inform the user how long the response will be processing

We had a choice of how to solve these problems: by looking for a suitable engineering or product solution.

Thinking Through the UX of the Booth

Before we even got to code, we had to decide how the user would interact with the booth:

  • We should decide how to detect a new user in the booth to reset the past dialog history.
  • How to recognize the beginning and end of a user’s speech, and what to do if they want to interrupt the AI.
  • How to implement feedback when there is a delayed response from the AI.

To detect a new user in the booth, we considered several options: door opening sensors, floor weight sensors, distance sensors, and a camera + YOLO model. The distance sensor behind the back seemed to us the most reliable, as it excluded accidental triggers, such as when the door is not closed tightly enough, and did not require complicated installation, unlike the weight sensor.

To avoid the challenge of recognizing the beginning and end of a dialog, we decided to add a big red button to control the microphone. This solution also allowed the user to interrupt the AI at any moment.

We had many different ideas about implementing feedback on processing a request. We decided on an option with a screen that shows what the system is doing: listening to the microphone, processing a question, or answering.

We also considered a rather smart option with an old landline phone. The session would start when the user picked up the phone, and the system would listen to the user until he hung up. However, we decided it is more authentic when the user is “answered” by the booth rather than by a voice from the phone.

During installation and at the festival

In the end, the final user flow came out like this:

  • A user walks into a booth. A distance sensor triggers behind his back, and we greet him.
  • The user presses a red button to start a dialog. We listen to the microphone while the button is pressed. When the user releases the button, we begin processing the request and indicate it on the screen.
  • If the user wants to ask a new question while the AI is answering, they can press the button again, and the AI will immediately stop answering.
  • When the user leaves the booth, the distance sensor triggers again, and we clear the dialog history.

Architecture

Arduino monitors the state of the distance sensor and the red button. It sends all changes to our backend via HTTP API, which allows the system to determine whether the user has entered or left the booth and whether it is necessary to activate listening to the microphone or start generating a response.

The web UI is just a web page opened in a browser that continuously receives the system’s current state from the backend and displays it to the user.

The backend controls the microphone, interacts with all necessary AI models, and voices the LLM responses. It contains the app’s core logic.

Hardware

How to code a sketch for Arduino, properly connect the distance sensor and the button, and assemble it all in the booth is a topic for a separate article. Let’s briefly review what we got without going into technical details.

We used an Arduino, more precisely, the model ESP32 with a built-in Wi-Fi module. The microcontroller was connected to the same Wi-Fi network as the laptop, which was running the backend.

Complete list of hardware we used:

Backend

The main components of the pipeline are Speech-To-Text (STT), LLM, and Text-To-Speech (TTS). For each task, many different models are available both locally and via the cloud.

Since we didn’t have a powerful GPU on hand, we decided to opt for cloud-based versions of the models. The weakness of this approach is the need for a good internet connection. Nevertheless, the interaction speed after all optimizations was acceptable, even with the mobile Internet we had at the festival.

Now, let’s take a closer look at each component of the pipeline.

Speech Recognition

Many modern devices have long supported speech recognition. For example, Apple Speech API is available for iOS and macOS, and Web Speech API is for browsers.

Unfortunately, they are very inferior in quality to Whisper or Deepgram and cannot automatically detect the language.

To reduce processing time, the best option is to recognize speech in real-time as the user speaks. Here are some projects with examples of how to implement them: whisper_streamingwhisper.cpp

With our laptop, the speed of speech recognition using this approach turned out to be far from real-time. After several experiments, we decided on the cloud-based Whisper model from OpenAI.

LLM and Prompt Engineering

The result of the Speech To Text model from the previous step is the text we send to the LLM with the dialog history.

When choosing an LLM, we compared GPT-3.5. GPT-4 and Claude. It turned out that the key factor was not so much the specific model as its configuration. Ultimately, we settled on GPT-4, whose answers we liked more than the others.

Customization of the prompt for LLM models has become a separate art form. There are many guides on the Internet on how to tune your model as you need:

We had to experiment extensively with the prompt and temperature settings to make the model respond engagingly, concisely, and humorously.

Text-To-Speech

We voice the response received from the LLM using the Text-To-Speech model and play it back to the user. This step was the primary source of delays in our demo.

LLMs take quite a long time to respond. However, they support the response generation in streaming mode – token by token. We can use this feature to optimize the waiting time by voicing individual phrases as they are received without waiting for a complete response from the LLM.

Voicing individual sentences

  • Make a query to the LLM.
  • We accumulate the response in the buffer token by token until we have a complete sentence of minimum length. The minimum length parameter is significant because it affects both the intonation of voicing and the initial delay time.
  • Send the generated sentence to the TTS model, and play the result to the user. At this step, it is necessary to ensure that there is no race condition in the playback order.
  • Repeat the previous step until the end of the LLM response

We use the time while the user listens to the initial fragment to hide the delay in processing the remaining parts of the response from the LLM. Thanks to this approach, the response delay occurs only at the beginning and is ~3 seconds.

  async generateResponse(history) {
    const completion = await this.ai.completion(history);
    
    const chunks = new DialogChunks();
    for await (const chunk of completion) {
      const delta = chunk.choices[0]?.delta?.content;
      if (delta) {
        chunks.push(delta);
        if (chunks.hasCompleteSentence()) {
          const sentence = chunks.popSentence();
          this.voice.ttsAndPlay(sentence);
        }
      }
    }
    const sentence = chunks.popSentence();
    if (sentence) {
      this.voice.say(sentence);
    }
    return chunks.text;
  }

Final Touches

Even with all our optimizations, a 3-4 second delay is still significant. We decided to take care of the UI with feedback to save the user from the feeling that the response is hung. We looked at several approaches:

  • LED Indicators. We needed to display five states: idle, waiting, listening, thinking, and speaking. But we couldn’t figure out how to do it in a way that was easy to understand with LED’s.
  • Filler words, such as “Let me think,” “Hmm,” and so on, mimic real-life speech. We rejected this option because fillers often did not match the tone of the model’s responses.
  • Put a screen in the booth. And display different states with animations.

We settled on the last option with a simple web page that polls the backend and shows animations according to the current state.

The Results

Our AI confession room ran for four days and attracted hundreds of attendees. We spent just around $50 on OpenAI APIs. In return, we received substantial positive feedback and valuable impressions.

This small experiment showed that it is possible to add an intuitive and efficient voice interface to an LLM even with limited resources and challenging external conditions.

By the way, the backend sources available on GitHub