# TTS — Help

Generate speech from text in three languages (English, German, French) with two operating modes, batch synthesis, voice cloning, and studio-grade post-processing.

This page covers everything you can do from the `/tts` panel on omni-demo. For HTTP integration examples (Next.js / NestJS / Python), see [`ddx-cuda-live-tts/TTS_API_USAGE.md`](../../../ddx-cuda-live-tts/TTS_API_USAGE.md).

---

## Realtime vs HQ

The TTS panel has two synthesis modes, selected from the segmented control at the top.

**Realtime** streams audio over a WebSocket as it's generated. First sound plays in ~300 ms (`TTFB`); visemes arrive in sliding-window batches you can size with the **Viseme window** slider. Use Realtime for live demos, voice agents, or anywhere you need the mouth to start moving before the sentence finishes.

**HQ** posts the whole text in one HTTP request, waits for synthesis to complete, then plays a single decoded audio blob with word-level alignment. Latency is higher (full utterance time), but you get loudness normalization, container/bit-depth choices, head/tail silence, deterministic seeds, and click-to-seek word chips. Use HQ for podcast cuts, voice-over exports, or anything you'll save to disk.

| | Realtime | HQ |
|---|---|---|
| Transport | `WS /v1/speak` | `POST /v1/synthesize/hq` |
| TTFB | ~300 ms | full-utterance |
| Visemes | sliding window (slider) | terminal, word-aligned |
| Studio settings | hidden | visible |
| Word chips | no | yes (click to seek) |

Switching modes mid-utterance is blocked while audio is playing — press **Stop** first.

---

## Studio settings

When you switch to **HQ** mode, a collapsible **Studio settings** panel appears below the main controls. It's hidden in Realtime because those parameters don't apply to a streaming WS contract.

Open it to set loudness target, container format, bit depth, and head/tail silence. The values you choose ride along with every HQ request until you change them — they're not saved across page reloads (yet).

Example: for a -16 LUFS, 24-bit FLAC export with 250 ms head silence, open Studio settings and set Loudness `EBU R128`, Target LUFS `-16`, Container `flac`, Bits `24-bit`, Head silence `250`.

---

## Loudness normalization

Three options:

- **None** — raw engine output. Levels vary by voice and prompt.
- **Peak** — ffmpeg single-pass peak normalize to −1 dBFS. Fast, prevents clipping, ignores Target LUFS.
- **EBU R128** — single-pass `loudnorm` to your chosen target LUFS with `TP=-1.5`, `LRA=11`. Slower (a few hundred ms extra), but the level you target is the level you get.

Pick **EBU R128** for anything that will be mixed with music or other voice tracks; pick **Peak** for fast batch exports where you just need consistent ceiling.

Target LUFS is only shown when EBU R128 is selected. Allowed values: `-23` (broadcast spec), `-16` (podcast loud), `-14` (streaming-platform default — Spotify / Apple Music).

---

## Container format

Four output containers, all written from the same 24 kHz PCM master:

| Format | Bits | Use case |
|---|---|---|
| `mp3` | 16 only | smallest file; lossy; max compatibility |
| `wav` | 16 / 24 | lossless; large; archival |
| `flac` | 16 / 24 | lossless compressed; ~50% of WAV |
| `opus` | 16 only | smallest lossless-grade speech; ideal for streaming |

The Studio panel **Container** dropdown chooses one. If you pick `mp3` or `opus`, the bit-depth selector is constrained to 16 by the server.

---

## Bit depth & sample rate

Bit depth is the dynamic range of each PCM sample:

- **16-bit** — 96 dB SNR; CD-quality. Default. Use for speech, demos, web playback.
- **24-bit** — 144 dB SNR; mastering grade. Use when you'll mix with other tracks, apply EQ, or hand off to a producer.

Sample rate is fixed at **24 kHz** for the engine and downsampled / upsampled by ffmpeg on the way out. The realtime WS contract negotiates `sample_rate` separately (16000 / 22050 / 24000 / 48000) — see [Realtime vs HQ](#realtime-vs-hq).

---

## Head & tail silence

Two number inputs (`0`–`5000` ms, step `50`) pad pure silence at the start and end of the rendered audio:

- **Head silence** — useful before a voice-over so the editor has a frame of pre-roll, or to stop a hard click at playback start.
- **Tail silence** — fades the utterance into quiet rather than truncating on the last phoneme. ~200 ms is usually enough; 1000 ms feels like a hand-off pause.

Both default to `0`. Server enforces the 5000 ms upper bound.

---

## SSML — prosody & phoneme

The **SSML lite** toggle (gear icon in Controls) lets you embed two tag families inline:

```xml
<prosody rate="slow" pitch="-2st">Please listen carefully.</prosody>
<phoneme alphabet="ipa" ph="ˈdjuːdɒks">Dudoxx</phoneme>
```

Supported `rate` values: `x-slow | slow | medium | fast | x-fast` (or any positive decimal, e.g. `0.85`).
Supported `pitch` values: `±Nst` (semitones, range `-12st`…`+12st`), `x-low | low | medium | high | x-high`.
Phoneme `alphabet` accepts `ipa` only; `ph` is the IPA string spoken in place of the child text.

When SSML lite is OFF, tags are stripped and the text inside is spoken verbatim — safe default for user-pasted content.

---

## Word alignment & visemes

Every HQ response carries an `alignment[]` of `{word, start, end}` triples plus a `visemes[]` array of `{viseme, start, duration}` frames. The panel uses both:

- **Word chips** render under the player; the chip whose `[start, end]` brackets `<audio>.currentTime` is highlighted. Click a chip to seek the player to that word's `start`.
- **VisemeFace** mouth-shape ticks against the same `<audio>.currentTime` via a single requestAnimationFrame loop — no buffering, plays immediately.

In Realtime mode visemes stream in sliding windows (default 2 s, range `0.5`–`5.0`). Lower window = tighter lip-sync, higher TTFB; higher window = more audio per batch, looser sync. Toggle **Emit visemes** off entirely if you only need audio (saves bandwidth + aligner CPU on the server).

---

## Batch synthesis

The **Batch** panel synthesizes up to 32 items in one HTTP round-trip via `POST /v1/synthesize/batch`.

Two input modes:

1. **Pipe** (default) — one item per line, `id|text`:
   ```
   intro|Welcome to Dudoxx Omni.
   pitch|We turn raw audio into structured records.
   close|Talk soon.
   ```
   If you omit the `id|`, a random id is generated.

2. **JSON** — paste an array of `{id, text}` objects:
   ```json
   [
     {"id": "intro", "text": "Welcome to Dudoxx Omni."},
     {"id": "pitch", "text": "We turn raw audio into structured records."}
   ]
   ```

Defaults at the top (voice, language, speed, loudness, target LUFS, bits, format) apply to every item — the per-request body is built server-side from your defaults plus each item's text.

Results render in input order: one `<audio>` + **Download** link per success, a red badge with `{error.code}: {error.message}` per failure. Blob URLs are revoked when you submit again or leave the page. Over 32 items shows a warning and disables Synthesize.

---

## Voice cloning (style + tier)

Two cloning slots are wired through the request body:

- **`ref_audio_b64`** — the primary voice prompt (3–10 s, 16 kHz+ mono). The engine matches timbre + cadence.
- **`style_ref_audio_b64`** — optional secondary clip whose *style* (emotion, intensity, prosody pattern) is transferred onto the primary voice.

Two tuning knobs:

- **`clone_strength`** (`0.0`–`1.0`, default `0.7`) — how strictly the clone tracks the prompt. Lower = more engine personality; higher = closer mimicry.
- **`clone_steps`** (`8`–`32`, default `16`) — ICL steps. More steps = sharper match at the cost of TTFB.

The web UI uses the engine defaults; programmatic clients can pass both fields in the request body. See [`TTS_API_ENDPOINTS.md`](../../../ddx-cuda-live-tts/TTS_API_ENDPOINTS.md) for the field schemas.

---

## Voice catalogue (31 voices, 3 languages)

The full live catalogue is at `GET /v1/voices` and ddx-web fetches it on every TTS / Translator page load (`cache: 'no-store'`) — new voices appear automatically without a deploy.

**English** (10): `ddx_bella` *F*, `ddx_heart` *F*, `ddx_adam` *M* (DDX clones), plus `en_eleanor`, `en_charlotte`, `en_victoria` *F* and `en_william`, `en_george`, `en_arthur` *M* (LibriVox-sourced 2026-05-17).

**German** (11): `de_katharina` *F*, `de_frenz`, `de_hans`, `de_karlsson`, `de_hokuspokus` *M* (legacy), plus `de_alice_anna`, `de_alice_maria`, `de_alice_klara` *F* and `de_grimm_max`, `de_grimm_otto`, `de_grimm_kurt` *M* (LibriVox-sourced 2026-05-17).

**French** (10): `fr_sonia`, `fr_ezwa`, `fr_nadine` *F*, `fr_jean` *M* (legacy), plus `fr_camille`, `fr_juliette`, `fr_margot` *F* and `fr_jules`, `fr_louis`, `fr_henri` *M* (LibriVox-sourced 2026-05-17).

**Open-license voices** (the `*_alice_*`, `*_grimm_*`, `en_william/george/arthur/eleanor/charlotte/victoria`, `fr_jules/louis/henri/camille/juliette/margot` prefixes) are derived from **public-domain LibriVox recordings**, preprocessed with a fixed pipeline (high-pass 80 Hz → spectral denoise → loudness-normalize to -16 LUFS → 20s trim → 24 kHz mono PCM-16). Each voice ships a `voices/<id>.<lang>.manifest.json` next to the WAV with the source URL, license, and processing parameters. License: **public domain** (LibriVox). For attribution in commercial use, the manifest's `license_url` field points to the source Internet Archive item.

The `VoiceSettingsPanel` side sheet groups voices by language and shows engine / gender / accent metadata. Click **Set default** to pin one to `sessionStorage['ddx-tts-voice-pref']`. The Translator board (`ddx-web/src/components/translator/`) and Batch panel both read the same catalogue.

---

## Spell-out tokens

The text normalizer expands numerals, units, and dates into the spoken form before synthesis. If a token MUST be spelled letter-by-letter (e.g. an acronym you don't want pronounced as a word, or a serial number), wrap it in the `spell_out` request field:

```jsonc
{
  "text": "Your case number is BFG-9000.",
  "spell_out": ["BFG-9000"]
}
```

The normalizer then emits `B F G dash nine zero zero zero` to the engine instead of the literal token. Multiple entries are matched case-insensitively in order.

---

## Usage receipts

Every HQ response carries a `usage` block:

```json
{
  "characters": 142,
  "audio_seconds": 9.83,
  "engine": "qwen3-12hz-0.6b",
  "gpu_seconds": 2.4
}
```

The panel shows a **Usage** card next to the player after each synthesis. `gpu_seconds` is hidden when the backend doesn't report it (MLX does not — CUDA does). Use the receipt to:

- Predict cost (`characters` for prompt-billed plans, `audio_seconds` for runtime-billed).
- Detect a stuck engine (`gpu_seconds` >> `audio_seconds` means the model is thrashing).
- Capacity-plan (`audio_seconds / wall_seconds` = real-time factor).

---

## ETag cache (advanced)

The HQ and `/v1/render` endpoints honor HTTP cache validation. Server hashes `{text, voice, language, speed, seed, quality, normalize_loudness, target_lufs, bits_per_sample, lexicon_hash, ssml_lite, spell_out, head_silence_ms, tail_silence_ms, ref_audio_sha256, style_ref_audio_sha256}` into a strong ETag.

Subsequent requests sending `If-None-Match: "<etag>"` for the same parameters get a `304 Not Modified` with zero body — useful for:

- **Idempotent retries** — a flaky network won't double-bill.
- **Browser caching** — same prompt across two tabs reuses one synthesis.
- **CDN warm-up** — pre-fetch popular prompts; subsequent users hit the edge.

The web UI does not surface this directly; pass `If-None-Match` from your own client and you'll see the 304 in DevTools.

---

## Reference

- Frontend component map: [`ddx-web/TTS_FRONTEND.md`](../../../ddx-web/TTS_FRONTEND.md)
- Backend wire contract: [`ddx-cuda-live-tts/TTS_API_ENDPOINTS.md`](../../../ddx-cuda-live-tts/TTS_API_ENDPOINTS.md)
- Integration recipes (Next.js / NestJS / Python): [`ddx-cuda-live-tts/TTS_API_USAGE.md`](../../../ddx-cuda-live-tts/TTS_API_USAGE.md)
- Frozen envelope schema: [`ddx-prd-specs/envelopes/schemas/tts-frame.schema.json`](../../../ddx-prd-specs/envelopes/schemas/tts-frame.schema.json)
