TCP protocol
The avatar accepts a single TCP connection on port 4500 (override with -AvatarPort=N on the game command line). A second concurrent connection is rejected — the backend is expected to drop and reconnect, not multiplex.
Frame format
Section titled “Frame format”Every frame on the wire has the same shape:
1 byte
Type
4 bytes LE
Payload length N
N bytes
Payload
offset 0
offset 1
offset 5 → 5 + N
| Type | Payload | Meaning |
|---|---|---|
0 | raw PCM | Audio chunk (48 kHz, signed 16-bit little-endian, mono). |
1 | UTF-8 JSON | Command — emotion, microexpression, gesture, gaze, config, listen, stop, reset. |
There is no acknowledgement frame. The backend produces, the avatar consumes. Backpressure is provided by the audio ring buffer, not by handshake.
Audio frames (Type 0)
Section titled “Audio frames (Type 0)”- Sample rate: 48000 Hz, exactly. Other rates are rejected at the lip-sync layer with a compliance warning.
- Sample format: signed 16-bit little-endian.
- Channels: 1 (mono).
- Encoding: raw PCM bytes — no WAV/MP3/Opus header.
For the chunk-cadence and silence semantics, see Lip-sync settings.
Command frames (Type 1)
Section titled “Command frames (Type 1)”The payload is a single UTF-8 JSON object. Examples:
{ "type": "emotion", "name": "joy", "intensity": 0.8 }{ "type": "microexpression", "name": "smirk" }{ "type": "anim_gesture", "name": "greet" }{ "type": "look_at", "target": "camera" }{ "type": "config", "key": "lipsync_chunk_size", "value": 480 }{ "type": "listen", "value": 1 }{ "type": "stop", "target": "speaking" }{ "type": "reset" }Every command is dispatched onto the game thread before any animation state is read or written. Backends do not need to think about UE threading rules — produce JSON, send the frame, the engine does the rest.
What the backend never sends
Section titled “What the backend never sends”- Resampled audio in any format other than 48 kHz mono 16-bit LE.
- Multiplexed channels.
- Concurrent connections from the same host.
- Frames with
Typeoutside{0, 1}. The TCP worker drops them without dispatching.
See also
Section titled “See also”- Lip-sync settings — the long-form integrator reference for the audio stream.
- Session lifecycle — when sessions start and end, and how
listen/stop/resetinteract.