USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookThe Modality Trap: Why Your Agent Needs to Know WHEN to Speak
Previous Chapter
Make It Proactive
Next Chapter
Add a Second Agent
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

20 sections

Progress0%
1 / 20

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Give It a Voice

What You Will Learn

In this chapter, you will enable text-to-speech and discover why the hard part is not making the agent speak, but knowing when it should.

By the end, you should be able to enable a TTS provider, compare the four auto modes (always, inbound, tagged, off), and configure tagged mode where the agent decides whether to respond with voice or text.

James scrolled through the heartbeat log from Chapter 8. His agent had checked three items on the morning checklist, sent a summary to WhatsApp, and gone quiet with HEARTBEAT_OK. Everything worked. But the message sat there as text, and he was thinking about his old operations team.

"When I managed the warehouse crew," he said, "half of them never read the group chat messages. They listened to voice notes while driving forklifts."

Emma looked up. "Two config lines and one plugin enable. Sixty seconds."

"Sixty seconds? I spent three hours on heartbeats yesterday."

"Voice is easier than scheduling. The hard part is knowing when NOT to speak." She stood up. "Enable it. Make it say something. When I get back, tell me what annoyed you."


You are doing exactly what James is doing. Enable voice, hear every reply as audio, and discover for yourself what annoyed Emma's previous students enough to switch modes.

Your agent acts on its own schedule now, checking tasks every 30 minutes and delivering messages on cron. It does not wait for you. But every response is text. Now it learns to speak. By the end of this chapter, your agent will send voice notes on WhatsApp, and you will understand why letting the agent choose when to speak produces better results than forcing every reply into audio.

Three Providers, One Interface

OpenClaw bundles four text-to-speech providers behind a unified interface. All produce Opus-encoded OGG audio (48kHz, 64kbps), the exact format WhatsApp uses for push-to-talk voice messages. Your agent's replies appear as playable voice notes, not file attachments.

ProviderKey RequiredQualityCostVoices
Microsoft EdgeNoGoodFree300+ neural
OpenAI TTSYes (OPENAI_API_KEY)Excellent~$15/M chars6
ElevenLabsYes (ELEVENLABS_API_KEY)PremiumFree tier availableThousands
MiniMaxYes (MINIMAX_API_KEY)GoodFree tier availableMultiple

Microsoft Edge TTS uses the same backend as the Edge browser's "Read Aloud" feature. No API key. No signup. No cost. Start here.

One important limit: replies longer than 1,500 characters are either auto-summarized before synthesis or skipped entirely. If your agent writes long replies and you hear silence, the text exceeded this limit. You can adjust it with /tts limit 3000 on WhatsApp or check the current setting with /tts status.

Enable Voice Output

Three commands:

bash
openclaw config set messages.tts.auto always openclaw config set messages.tts.provider microsoft openclaw config set plugins.entries.microsoft.enabled true

Restart the gateway:

bash
openclaw gateway restart

Before sending a test message, verify the setup on WhatsApp:

text
/tts status

You should see: State: enabled, Provider: microsoft (configured). If the provider shows (not configured), the plugin did not load. Run openclaw gateway restart again and recheck.

Now send a message on WhatsApp. Your agent's reply arrives as a playable voice note.

If No Voice Note Arrives

With always mode, every reply must go through TTS conversion. If the TTS pipeline is not ready (provider not configured, gateway still starting up, WhatsApp reconnecting), replies are silently dropped. Check /tts status on WhatsApp first. If it shows the provider as configured but replies still do not arrive, check the gateway log at ~/.openclaw/logs/gateway.log for errors.

James stared at the voice note playing on his screen. "That's it? After the crash loop, the auth cache, the tool profiles, three hours on heartbeat configs... voice is three commands and a restart?"

"Sixty seconds," Emma said from the doorway. "I told you."

The Activation Dance

Every OpenClaw capability follows the same four steps:

  1. Bundled plugin exists (check: openclaw plugins list)
  2. Disabled by default (security: nothing auto-activates)
  3. Enable: openclaw config set plugins.entries.<id>.enabled true
  4. Configure the feature-specific settings

You first saw this in Module 9.1, Chapter 2 (installation), again in Module 9.1, Chapter 6 (skills). The speech plugin follows the same dance. By Module 9.1, Chapter 13, you will write a plugin that other people activate through this same pattern.

Verify the plugin loaded:

bash
openclaw plugins list --verbose

Look for microsoft in the list with status loaded. If it shows disabled, the config entry was not picked up. Restart the gateway and check again.

Four TTS Modes

The messages.tts.auto setting controls who decides when the agent speaks:

ModeWho DecidesBehavior
offNobodyText only (default)
alwaysConfigEvery reply becomes a voice note
inboundCustomerVoice reply only when the customer sends a voice message
taggedThe AgentTTS fires only when the model includes [[tts]] in reply

Why always Gets Annoying Fast

With always mode, every single reply is audio. A one-word confirmation ("Done.") becomes a voice note. A list of five items becomes a voice note. A booking confirmation with a reference number the customer needs to copy becomes a voice note. The customer cannot copy text from audio.

always mode proves the pipeline. It is not a production setting.

Why inbound Is the Smart Production Default

In inbound mode, the agent matches the customer's modality. If the customer sends text, the agent replies with text. If the customer sends a voice note, the agent replies with a voice note. No SOUL.md configuration needed. The gateway handles it automatically.

Switch to Inbound Mode

bash
openclaw config set messages.tts.auto inbound

No gateway restart needed for this change. The gateway applies messages.tts config dynamically.

Send a text message. You get text back. Send a voice note. You get a voice note back. The agent adapts to whatever the customer prefers.

About Tagged Mode

A fourth mode, tagged, lets the agent decide when to speak by including [[tts]] tags in its replies. In theory, this is the most flexible option: voice for descriptions, text for confirmations. In practice, the [[tts]] tags often appear as literal text in the chat instead of triggering synthesis. Until this is resolved, inbound is the reliable production choice.

Upgrading to OpenAI TTS

Microsoft Edge proves the pipeline. For production voice quality, switch to OpenAI with one config change:

bash
openclaw config set messages.tts.provider openai

The OpenAI provider supports an instructions field for voice character:

json
{ "messages": { "tts": { "auto": "tagged", "provider": "openai", "providers": { "openai": { "model": "gpt-4o-mini-tts", "voice": "coral", "instructions": "Speak in a warm, professional tone" } } } } }

At roughly $0.015 per 1,000 characters, a typical message costs less than a tenth of a cent to voice.

The Modality Design Principle

Voice and text are not interchangeable. Each has strengths:

Voice Works Best ForText Works Best For
Descriptions, summariesReference numbers, links, code
Emotional, persuasive contentLists the customer needs to copy
Hands-busy users (driving)Search-friendly content
Long-form explanationsShort confirmations

The inbound mode handles this automatically by matching the customer's modality: if they send a voice note, reply with a voice note. If they type, reply with text. This is the safe production default.

tagged mode goes further. The agent evaluates its own response and decides whether voice or text serves the content better. The agent becomes the UX designer.

Try With AI

Exercise 1: Hear Your Agent Speak

Enable Microsoft Edge TTS with always mode if you have not already:

bash
openclaw config set messages.tts.auto always openclaw config set messages.tts.provider microsoft openclaw config set plugins.entries.microsoft.enabled true openclaw gateway restart

Send any message on WhatsApp. Your agent's reply should arrive as a playable voice note.

What you are learning: The TTS pipeline converts text to Opus-encoded OGG and delivers it as a WhatsApp push-to-talk message. The channel adapter handles codec selection automatically. You configured three settings; the platform handled encoding, formatting, and delivery.

Exercise 2: Experience the Annoyance

With always mode still active, send these three messages in sequence:

text
1. Tell me about the benefits of AI agents for small businesses 2. OK 3. What is 2 + 2?

All three replies come as voice notes. Message 1 makes sense as audio. Messages 2 and 3 do not.

What you are learning: Blanket voice output degrades the user experience for short, functional replies. The right question is not "should the agent speak?" but "when should it speak?"

Exercise 3: Let the Agent Decide

Switch to tagged mode and add voice instructions to SOUL.md:

bash
openclaw config set messages.tts.auto tagged

Add to your SOUL.md:

text
## Voice Output Rules Use [[tts]] at the end of your reply when giving descriptions, explanations, or detailed answers. Use text only for confirmations, short answers, and anything containing numbers or links the user might need to copy.

Send the same three messages again. Does the agent choose voice for the description and text for the short answers?

What you are learning: tagged mode delegates modality decisions to the agent. The quality of those decisions depends on the instructions in SOUL.md and the capability of the underlying model. You are designing the agent's communication style, not just its knowledge.


What You Should Remember

Start Free, Upgrade Later

Microsoft Edge TTS is free, requires no API key, and produces good quality. Use it to prove the pipeline works. Upgrade to OpenAI TTS or ElevenLabs when voice quality matters for production.

Four Modes

always: every reply is audio (annoying for short confirmations). inbound: match the customer's modality (safe production default). tagged: the agent decides per-message using [[tts]] markers (best UX, requires SOUL.md instructions). off: text only.

The Agent as UX Designer

In tagged mode, the agent evaluates each response and chooses voice or text. Descriptions and explanations get voice. Confirmations, reference numbers, and links get text. The quality of this decision depends on the instructions in SOUL.md and the capability of the model.

The 1,500-Character Limit

Replies longer than 1,500 characters are auto-summarized or skipped. If you hear silence after a long reply, the text exceeded the limit. Adjust with /tts limit.


When Emma came back, James had his phone playing a voice note. The agent was describing a property listing in a warm, measured tone. The previous three messages in the chat were text: "Done.", "Confirmed.", and a booking reference number.

"You switched to tagged mode," Emma said. It was not a question.

"I lasted about four messages on always mode before I wanted to throw the phone." He held it up. "Every reply was audio. Even 'OK.' That is not useful."

"So what did you change?"

"Added rules to SOUL.md. Descriptions get voice. Confirmations get text. The agent picks." He paused. "It is basically the same thing I did with the warehouse crew. Voice notes for updates, text messages for part numbers."

Emma nodded slowly. "The agent is the UX designer now. Not the config file." She glanced at the caution block earlier in the chapter. "Tagged mode on secondary agents is where I am least confident. The caution in the docs is real."

James looked at the WhatsApp thread. Voice for descriptions, text for confirmations. One agent handling both. "What happens when the workload splits? Right now one agent does everything. Customer questions and my internal operations go through the same queue."

"Same problem as one receptionist handling walk-ins and phone calls at the same time," Emma said. "Module 9.1, Chapter 10. You add a second agent."