Skip to content

TCP protocol

The avatar accepts a single TCP connection on port 4500 (override with -AvatarPort=N on the game command line). A second concurrent connection is rejected — the backend is expected to drop and reconnect, not multiplex.

Every frame on the wire has the same shape:

1 byte
Type
4 bytes LE
Payload length N
N bytes
Payload
offset 0
offset 1
offset 5 → 5 + N
TypePayloadMeaning
0raw PCMAudio chunk (48 kHz, signed 16-bit little-endian, mono).
1UTF-8 JSONCommand — emotion, microexpression, gesture, gaze, config, listen, stop, reset.

There is no acknowledgement frame. The backend produces, the avatar consumes. Backpressure is provided by the audio ring buffer, not by handshake.

  • Sample rate: 48000 Hz, exactly. Other rates are rejected at the lip-sync layer with a compliance warning.
  • Sample format: signed 16-bit little-endian.
  • Channels: 1 (mono).
  • Encoding: raw PCM bytes — no WAV/MP3/Opus header.

For the chunk-cadence and silence semantics, see Lip-sync settings.

The payload is a single UTF-8 JSON object. Examples:

{ "type": "emotion", "name": "joy", "intensity": 0.8 }
{ "type": "microexpression", "name": "smirk" }
{ "type": "anim_gesture", "name": "greet" }
{ "type": "look_at", "target": "camera" }
{ "type": "config", "key": "lipsync_chunk_size", "value": 480 }
{ "type": "listen", "value": 1 }
{ "type": "stop", "target": "speaking" }
{ "type": "reset" }

Every command is dispatched onto the game thread before any animation state is read or written. Backends do not need to think about UE threading rules — produce JSON, send the frame, the engine does the rest.

  • Resampled audio in any format other than 48 kHz mono 16-bit LE.
  • Multiplexed channels.
  • Concurrent connections from the same host.
  • Frames with Type outside {0, 1}. The TCP worker drops them without dispatching.
  • Lip-sync settings — the long-form integrator reference for the audio stream.
  • Session lifecycle — when sessions start and end, and how listen / stop / reset interact.