Give It a Voice

Name: Digital FTEs: Engineering — Achieving 10× Productivity
Author: Muhammad Usman Akbar

What You Will Learn

In this chapter, you will enable text-to-speech and discover why the hard part is not making the agent speak, but knowing when it should.

By the end, you should be able to enable a TTS provider, compare the four auto modes (always, inbound, tagged, off), and configure tagged mode where the agent decides whether to respond with voice or text.

James scrolled through the heartbeat log from Chapter 8. His agent had checked three items on the morning checklist, sent a summary to WhatsApp, and gone quiet with HEARTBEAT_OK. Everything worked. But the message sat there as text, and he was thinking about his old operations team.

"When I managed the warehouse crew," he said, "half of them never read the group chat messages. They listened to voice notes while driving forklifts."

Emma looked up. "Two config lines and one plugin enable. Sixty seconds."

"Sixty seconds? I spent three hours on heartbeats yesterday."

"Voice is easier than scheduling. The hard part is knowing when NOT to speak." She stood up. "Enable it. Make it say something. When I get back, tell me what annoyed you."

You are doing exactly what James is doing. Enable voice, hear every reply as audio, and discover for yourself what annoyed Emma's previous students enough to switch modes.

Your agent acts on its own schedule now, checking tasks every 30 minutes and delivering messages on cron. It does not wait for you. But every response is text. Now it learns to speak. By the end of this chapter, your agent will send voice notes on WhatsApp, and you will understand why letting the agent choose when to speak produces better results than forcing every reply into audio.

Three Providers, One Interface

OpenClaw bundles four text-to-speech providers behind a unified interface. All produce Opus-encoded OGG audio (48kHz, 64kbps), the exact format WhatsApp uses for push-to-talk voice messages. Your agent's replies appear as playable voice notes, not file attachments.

Provider	Key Required	Quality	Cost	Voices
Microsoft Edge	No	Good	Free	300+ neural
OpenAI TTS	Yes (OPENAI_API_KEY)	Excellent	~$15/M chars	6
ElevenLabs	Yes (ELEVENLABS_API_KEY)	Premium	Free tier available	Thousands
MiniMax	Yes (MINIMAX_API_KEY)	Good	Free tier available	Multiple

Microsoft Edge TTS uses the same backend as the Edge browser's "Read Aloud" feature. No API key. No signup. No cost. Start here.

One important limit: replies longer than 1,500 characters are either auto-summarized before synthesis or skipped entirely. If your agent writes long replies and you hear silence, the text exceeded this limit. You can adjust it with /tts limit 3000 on WhatsApp or check the current setting with /tts status.

Enable Voice Output

Three commands:

bash

openclaw config set messages.tts.auto always
openclaw config set messages.tts.provider microsoft
openclaw config set plugins.entries.microsoft.enabled true

Restart the gateway:

bash

openclaw gateway restart

Before sending a test message, verify the setup on WhatsApp:

text

/tts status

You should see: State: enabled, Provider: microsoft (configured). If the provider shows (not configured), the plugin did not load. Run openclaw gateway restart again and recheck.

Now send a message on WhatsApp. Your agent's reply arrives as a playable voice note.

If No Voice Note Arrives

With always mode, every reply must go through TTS conversion. If the TTS pipeline is not ready (provider not configured, gateway still starting up, WhatsApp reconnecting), replies are silently dropped. Check /tts status on WhatsApp first. If it shows the provider as configured but replies still do not arrive, check the gateway log at ~/.openclaw/logs/gateway.log for errors.

James stared at the voice note playing on his screen. "That's it? After the crash loop, the auth cache, the tool profiles, three hours on heartbeat configs... voice is three commands and a restart?"

"Sixty seconds," Emma said from the doorway. "I told you."

The Activation Dance

Every OpenClaw capability follows the same four steps:

Bundled plugin exists (check: openclaw plugins list)
Disabled by default (security: nothing auto-activates)
Enable: openclaw config set plugins.entries.<id>.enabled true
Configure the feature-specific settings

You first saw this in Module 9.1, Chapter 2 (installation), again in Module 9.1, Chapter 6 (skills). The speech plugin follows the same dance. By Module 9.1, Chapter 13, you will write a plugin that other people activate through this same pattern.

Verify the plugin loaded:

bash

openclaw plugins list --verbose

Look for microsoft in the list with status loaded. If it shows disabled, the config entry was not picked up. Restart the gateway and check again.

Four TTS Modes

The messages.tts.auto setting controls who decides when the agent speaks:

Mode	Who Decides	Behavior
off	Nobody	Text only (default)
always	Config	Every reply becomes a voice note
inbound	Customer	Voice reply only when the customer sends a voice message
tagged	The Agent	TTS fires only when the model includes [[tts]] in reply

Why always Gets Annoying Fast

With always mode, every single reply is audio. A one-word confirmation ("Done.") becomes a voice note. A list of five items becomes a voice note. A booking confirmation with a reference number the customer needs to copy becomes a voice note. The customer cannot copy text from audio.

always mode proves the pipeline. It is not a production setting.

Why inbound Is the Smart Production Default

In inbound mode, the agent matches the customer's modality. If the customer sends text, the agent replies with text. If the customer sends a voice note, the agent replies with a voice note. No SOUL.md configuration needed. The gateway handles it automatically.

Switch to Inbound Mode

bash

openclaw config set messages.tts.auto inbound

No gateway restart needed for this change. The gateway applies messages.tts config dynamically.

Send a text message. You get text back. Send a voice note. You get a voice note back. The agent adapts to whatever the customer prefers.

About Tagged Mode

A fourth mode, tagged, lets the agent decide when to speak by including [[tts]] tags in its replies. In theory, this is the most flexible option: voice for descriptions, text for confirmations. In practice, the [[tts]] tags often appear as literal text in the chat instead of triggering synthesis. Until this is resolved, inbound is the reliable production choice.

Upgrading to OpenAI TTS

Microsoft Edge proves the pipeline. For production voice quality, switch to OpenAI with one config change:

bash

openclaw config set messages.tts.provider openai

The OpenAI provider supports an instructions field for voice character:

json

{
  "messages": {
    "tts": {
      "auto": "tagged",
      "provider": "openai",
      "providers": {
        "openai": {
          "model": "gpt-4o-mini-tts",
          "voice": "coral",
          "instructions": "Speak in a warm, professional tone"
        }
      }
    }
  }
}

At roughly $0.015 per 1,000 characters, a typical message costs less than a tenth of a cent to voice.

The Modality Design Principle

Voice and text are not interchangeable. Each has strengths:

Voice Works Best For	Text Works Best For
Descriptions, summaries	Reference numbers, links, code
Emotional, persuasive content	Lists the customer needs to copy
Hands-busy users (driving)	Search-friendly content
Long-form explanations	Short confirmations

The inbound mode handles this automatically by matching the customer's modality: if they send a voice note, reply with a voice note. If they type, reply with text. This is the safe production default.

tagged mode goes further. The agent evaluates its own response and decides whether voice or text serves the content better. The agent becomes the UX designer.

Try With AI

Exercise 1: Hear Your Agent Speak

Enable Microsoft Edge TTS with always mode if you have not already:

bash

openclaw config set messages.tts.auto always
openclaw config set messages.tts.provider microsoft
openclaw config set plugins.entries.microsoft.enabled true
openclaw gateway restart

Send any message on WhatsApp. Your agent's reply should arrive as a playable voice note.

What you are learning: The TTS pipeline converts text to Opus-encoded OGG and delivers it as a WhatsApp push-to-talk message. The channel adapter handles codec selection automatically. You configured three settings; the platform handled encoding, formatting, and delivery.

Exercise 2: Experience the Annoyance

With always mode still active, send these three messages in sequence:

text

Tell me about the benefits of AI agents for small businesses

OK

What is 2 + 2?

All three replies come as voice notes. Message 1 makes sense as audio. Messages 2 and 3 do not.

What you are learning: Blanket voice output degrades the user experience for short, functional replies. The right question is not "should the agent speak?" but "when should it speak?"

Exercise 3: Let the Agent Decide

Switch to tagged mode and add voice instructions to SOUL.md:

bash

openclaw config set messages.tts.auto tagged

Add to your SOUL.md:

text

## Voice Output Rules
Use [[tts]] at the end of your reply when giving descriptions,
explanations, or detailed answers. Use text only for confirmations,
short answers, and anything containing numbers or links the user
might need to copy.

Send the same three messages again. Does the agent choose voice for the description and text for the short answers?

What you are learning: tagged mode delegates modality decisions to the agent. The quality of those decisions depends on the instructions in SOUL.md and the capability of the underlying model. You are designing the agent's communication style, not just its knowledge.

What You Should Remember

Start Free, Upgrade Later

Microsoft Edge TTS is free, requires no API key, and produces good quality. Use it to prove the pipeline works. Upgrade to OpenAI TTS or ElevenLabs when voice quality matters for production.

Four Modes

always: every reply is audio (annoying for short confirmations). inbound: match the customer's modality (safe production default). tagged: the agent decides per-message using [[tts]] markers (best UX, requires SOUL.md instructions). off: text only.

The Agent as UX Designer

In tagged mode, the agent evaluates each response and chooses voice or text. Descriptions and explanations get voice. Confirmations, reference numbers, and links get text. The quality of this decision depends on the instructions in SOUL.md and the capability of the model.

The 1,500-Character Limit

Replies longer than 1,500 characters are auto-summarized or skipped. If you hear silence after a long reply, the text exceeded the limit. Adjust with /tts limit.

When Emma came back, James had his phone playing a voice note. The agent was describing a property listing in a warm, measured tone. The previous three messages in the chat were text: "Done.", "Confirmed.", and a booking reference number.

"You switched to tagged mode," Emma said. It was not a question.

"I lasted about four messages on always mode before I wanted to throw the phone." He held it up. "Every reply was audio. Even 'OK.' That is not useful."

"So what did you change?"

"Added rules to SOUL.md. Descriptions get voice. Confirmations get text. The agent picks." He paused. "It is basically the same thing I did with the warehouse crew. Voice notes for updates, text messages for part numbers."

Emma nodded slowly. "The agent is the UX designer now. Not the config file." She glanced at the caution block earlier in the chapter. "Tagged mode on secondary agents is where I am least confident. The caution in the docs is real."

James looked at the WhatsApp thread. Voice for descriptions, text for confirmations. One agent handling both. "What happens when the workload splits? Right now one agent does everything. Customer questions and my internal operations go through the same queue."

"Same problem as one receptionist handling walk-ins and phone calls at the same time," Emma said. "Module 9.1, Chapter 10. You add a second agent."