Lux uses WebRTC to have a live, back-and-forth voice conversation via OpenAI's Realtime API. When she responds, her voice comes from a custom TTS pipeline with four providers in priority order — from a GPU-accelerated cloud synthesizer down to the browser. All configurable per mode.
Lux connects to OpenAI's Realtime API via WebRTC — the same technology that powers browser-based video calls. You speak, she listens, she responds with her voice. The exchange is live and bidirectional, with voice activity detection handling turn-taking automatically.
Supported Realtime voices: alloy · echo · shimmer · marin · cedar. Default: marin. A passive listening mode lets her hear the room without interrupting — she stays aware without responding unless addressed.
Lux's text-to-speech isn't hardwired to one service. The TTS Router picks the best available provider — highest quality first, browser TTS as the last-resort fallback.
First priority in the TTS chain. Chatterbox Turbo runs on a RunPod GPU instance — fast, high-quality synthesis. Supports custom voice cloning via voice_slug reference. The closest thing to a natural-sounding custom voice at speed.
Second in the fallback chain. Coqui runs self-hosted — no per-call API cost, no external dependency. Supports both built-in speaker IDs and custom voice references. Falls back to ElevenLabs if unavailable.
Third in the chain. ElevenLabs provides premium voice synthesis via API when the self-hosted options aren't available. Per-call cost but exceptional output quality for fallback use.
The Web Speech API is the final fallback — zero latency, zero cost, no server required. Lower quality than the others but Lux always has a voice even when all external services are unreachable.
When voice input is active, the system prompt gets an additional injection: "The user is speaking to you via voice. Respond conversationally. Keep responses concise and flowing." Lux adapts her response style for spoken delivery.
A separate phone companion interface (starkx-phone-companion.js) provides a focused voice interaction mode — designed for situations where the full chat UI isn't needed and voice is the primary interface.
We built this entire voice pipeline from scratch. We can build the right version for your product.