Building an open world in the browser, part 23: Fifty avatars and a voice in the room

By Oleg Sidorkin, CTO and Co-Founder of Cinevva

New here? Use the series guide. It explains what a spike is and links all parts.

Part 22 put a sky over the world. This part puts people in it. A third-person open world wants 50-plus characters visible at any moment, and it wants to hear the ones standing next to you. Spike 45 is the rendering side: getting that many animated avatars onto the GPU without melting the main thread. Spike 46 is the audio side: peer-to-peer voice that pans and attenuates with position, tuned to sound like a normal video call rather than a tech demo.

One draw call for fifty dancers

Open Spike 45 in a new tab ↗ · View source

The default Three.js path gives every character its own SkinnedMesh, its own AnimationMixer, its own bone-matrix upload, and its own draw call. At 50 avatars on the local Mac that was about 13 ms of pure JavaScript overhead per frame before the GPU did a single thing. The de-risking question for the whole multiplayer track was whether one batched-skinning architecture could get that under control and scale linearly with character count.

The answer is a split between where the animation is computed and where it's drawn. Three character classes share one FBX template. The local player is a normal Avatar, a full skeleton clone with its own mixer, the standard Three.js path, because there's only ever one of it. Every remote peer is a VirtualSkeleton: also a full clone with its own mixer running the same clips, but every SkinnedMesh node is stripped immediately after cloning so only the bones survive. It never enters the scene. Each frame, after the mixer updates and the matrices settle, it packs (bone.matrixWorld × boneInverse) for all 100 bones into a slot of a shared Float32Array. The BatchSkinnedRenderer then owns one InstancedMesh per geometry piece, all reading from a single StorageBufferAttribute of bone matrices sized maxInstances × numBones × mat4, which is 60 × 100 × 64 = 384 KB. A MeshStandardNodeMaterial with custom positionNode and normalNode reads four bone influences per vertex straight out of that storage buffer. The result is one storage upload and one draw call per geometry piece for the entire crowd, regardless of how many people are in it. Skinning lives in the vertex shader, and the per-avatar JavaScript cost drops to running a mixer and copying 100 matrices.

The HUD that measures this had to be rebuilt too. The old version flagged "over budget" when whole-frame GPU time crossed 3 ms, but the frame always includes the shadow map, the ground, and the local player's full skinned mesh, which together run 3 to 5 ms on real hardware no matter how many synthetic peers exist. The fix is a budget that calibrates itself: while no batched avatars are present, it captures the live GPU time as a baseline through a fast EMA, then freezes that baseline and grows the budget linearly at 0.06 ms per added avatar once synthetics appear. It reads PASS at idle on every machine and tightens proportionally as the crowd grows.

The bug was a face you couldn't see

First runs in Chrome showed silver shadows on the ground and no avatars at all, with a WGSL parse error: cannot index type 'f32' on a line trying to subscript object.nodeUniform2[i] where the uniform was declared as a scalar. The honest part of this story is that the first fix was wrong and worked anyway. The guess was that InstancedMesh's instance-matrix path was generating the bad code, and swapping in a StorageInstancedBufferAttribute made the error disappear in Chrome. But it disappeared because the new path emitted different shader code, not because it addressed the cause, which is the most dangerous kind of fix.

The real culprit was morph targets. The 3MIKE.fbx ships with blend-shape facial expressions, the cloned geometry inherits the morphAttributes, and Three.js's MorphNode.setup() declares morphTargetInfluences as a scalar float and then tries to .element(i) it inside a synthesized loop, which is exactly the scalar-subscript the compiler rejected. The fix is one line, clearing geometry.morphAttributes = {} on geometry that doesn't use morphs, so Three.js never injects the MorphNode at all. The accidental Chrome fix stayed in for a while and then bit back: on Safari, the storage-instanced path produced Vertex buffer is not big enough 256 times over, because Safari's WebGPU backend doesn't translate it cleanly. Reverting it was the right call, and the plain bone-matrix storage buffer, which is core WebGPU rather than a generated instance path, works fine everywhere. The lesson is the one worth carrying: when a fix works on one browser and you can't explain the mechanism, you've patched a symptom, so read the actual generated WGSL. A getCompilationInfo() shim added later in the spike turned Three.js's generic "module is not valid" into the real Tint error and paid for itself many times over.

A related framework-evasion trick sits next to it. Three.js detects the standard skinIndex and skinWeight attribute names and tries to inject its own SkinningNode, even on an InstancedMesh whose custom positionNode already does the skinning. Renaming those attributes to boneIndex and boneWeight hides them from the framework, and the custom TSL reads them under the new names.

A relay that forgets you between words

The first version synced peers over BroadcastChannel, a same-browser stand-in with the real wire format and cadence, and the protocol comment promised the swap to real transport would be one line. Cashing that promise meant an AvatarRoomDO, a 74-line Cloudflare Durable Object that doesn't even decode the 36-byte binary frame. It forwards each message as-is to every other peer in the room, because the sender's id is embedded in the frame and each receiver filters its own echo client-side. The relay has zero awareness of identity. Hibernating WebSockets make an idle room free: the DO drops out of memory between messages and the runtime restores the tagged sockets on the next packet. At 10 events per second per peer that's 36,000 DO requests per peer-hour, about half a cent, with egress free on Cloudflare and roughly 6 to 10 times cheaper than the equivalent AWS WebSocket shape.

The swap surfaced a state-machine bug worth keeping. A remote player kept walking after they'd stopped. The animation request guarded against this._state, the currently playing clip, instead of the last queued name, so when two network messages arrived in the same tick, a walk then an idle, the idle compared against a state that hadn't advanced yet and got silently dropped. The peer was stuck walking forever, because future idle packets were deduplicated upstream as unchanged. The fix is to always overwrite the pending name and let the transition helper short-circuit genuine same-state requests, which it already did. The class of bug is general: a dedup check against the wrong reference value quietly swallows the input that matters.

Safari needed two more guards. It opens the WebSocket faster than Chrome, so the first inbound peer message could arrive before the batch renderer finished constructing, dereferencing null; dropping messages while the renderer is absent is safe because peers re-broadcast every 100 ms. And 'gpu' in navigator returned true while requestAdapter() returned null, so Three.js silently fell back to WebGL2, where the storage-buffer skinning chain has no valid translation and spewed errors. Checking for a real adapter and asserting the backend is actually WebGPU turns a degraded render into a clear loading-screen message. There was even a WGSL dialect gap: Three.js emits the modern two-argument @interpolate(flat, either) that WebKit's compiler hasn't shipped, patched by rewriting the shader source on the way into createShaderModule to drop the second argument, which is free because flat interpolation carries the same value on every vertex regardless.

Voice that pans with the room

Open Spike 46 in a new tab ↗ · View source

Spike 46 is proximity voice: peer-to-peer WebRTC with HRTF spatial audio, scoped explicitly to match Google Meet and Microsoft Teams quality in a quiet-to-moderately-noisy room. A VoiceRoomDO handles signaling as a JSON relay, sending each new peer a roster, announcing joins and leaves, routing SDP and ICE to one specific peer by socket tag, and broadcasting position updates that drive the spatial panners. It stamps the sender id on every message so peers can't spoof each other, and the audio itself never touches the DO. One RTCPeerConnection per remote peer, with the lexicographically smaller peer id always making the offer so both sides agree on who initiates without a full perfect-negotiation implementation.

On the receive side each peer's audio runs through a PannerNode set to HRTF with inverse distance rolloff, and the AudioListener updates every frame from the local player's position and facing using forwardX = sin(facing), forwardZ = cos(facing), which matches the scene's atan2(wx, wz) facing convention. One Chrome quirk cost an hour: a MediaStream consumed only by Web Audio sometimes won't pull packets, so each stream also attaches to a hidden muted <audio> element to force the decoder to schedule. On the quality side, browsers default to roughly 32 kbps mono Opus, so the spike munges the fmtp line on every offer and answer to bump it to 128 kbps with in-band FEC enabled and DTX disabled, then calls setParameters with a high max bitrate to guarantee the encoder actually uses what the SDP advertises. FEC is the second-biggest audible win after the bitrate bump, recovering from packet loss without renegotiation.

Deleting your way to clean audio

The audio chain that shipped is much smaller than the one I started with, and shrinking it was the real lesson. The first version had a high-pass filter, a click limiter tuned to catch keyboard noise, a compressor, a noise gate, and a wet-dry crossfade, fronted by a floating panel with twelve-plus sliders. When the user reported audible keyboard clicks, the instinct was to tune the click limiter harder and drop the dry mix, a stack of bandaids. The structural answer was that once an ML denoiser is in the chain, the click limiter and the gate and most of the high-pass are all redundant, because RNNoise is trained on exactly keyboard and mouse and typing noise, and amplitude clipping is a strictly worse version of the same job. Production clients ship ML denoise, echo cancellation, automatic gain, and a soft compressor for level, and nothing else. So four stages came out, the slider panel came out, and the "choose your noise reduction" toggles came out, leaving one fixed pipeline.

Each surviving stage earns its place. Browser echo cancellation stays on because RNNoise doesn't do echo, and without it speaker-into-mic feedback is unbounded. Browser noise suppression goes off because stacking it on RNNoise produces artifacts on fricatives, so you pick one denoiser. Browser automatic gain stays on, because turning it off made the signal too quiet for the compressor to work with and Web Audio's DynamicsCompressorNode has no makeup-gain parameter to compensate; the browser's broad levelling and the spike's fast compressor operate on different timescales and coexist. RNNoise runs at 92 percent wet mixed with 8 percent dry, because it can over-suppress unvoiced consonants like s, sh, and f whose voice probability dips, and the small dry path preserves them at the cost of a little keystroke leak.

Two features round it out. Push-to-talk doesn't flip track.enabled, because that discards everything still in the pipeline buffers and chops the last syllable on key release. Instead a GainNode near the tail ramps with setTargetAtTime, fast attack so the first syllable survives and slow release so the last consonant drains, with the track left permanently enabled. And a five-second broadcast delay, requested as a radio-style feature, runs a bypass leg and a DelayNode leg crossfaded together, with a dump button that snaps the delayed output to silence and counts down on the HUD before audio resumes. Bundling the denoiser was its own small saga: the published RNNoise worklet uses bare-specifier imports that no CDN resolves, so the fix was a local esbuild bundle producing one self-contained 1.9 MB file with the WASM base64-inlined, committed to the repo and referenced by a URL relative to the module so it resolves under the dev server, the VitePress build, and the custom domain alike. If the worklet ever fails to load, the chain still produces audio through a plain high-pass and compressor, and the HUD shows the failure in red.

Technology referenced in this chapter

Batched GPU skinning for crowds. Remote avatars run a headless VirtualSkeleton (a full clone with the skinned meshes stripped, bones kept, its own mixer) that packs bone.matrixWorld × boneInverse for every bone into a shared StorageBufferAttribute. One InstancedMesh per geometry piece reads those matrices in a custom TSL positionNode/normalNode, so the entire crowd costs one storage upload and one draw call per piece, with per-avatar CPU work limited to a mixer update and a matrix copy. See GPU-driven LOD.

Reading the generated WGSL, not the symptom. A cannot index type 'f32' compile error traced to Three.js's MorphNode declaring morphTargetInfluences as a scalar and subscripting it, fixed by clearing morphAttributes on geometry that doesn't use morphs. A first fix that only changed which shader path was generated masked the cause and broke Safari later. Renaming skinIndex/skinWeight to boneIndex/boneWeight hides the attributes from Three.js's automatic SkinningNode injection so a custom skinning material owns the math.

Hibernating Durable Object relays. A pure-binary AvatarRoomDO forwards 36-byte frames to every other peer without decoding them, with sender identity embedded in the frame and self-echo filtered client-side. Hibernating WebSockets make an idle room free, and the shape costs about half a cent per peer-hour at 10 Hz, far below the equivalent managed-WebSocket pricing. A dedup guard that compared against the playing animation state rather than the last-queued one silently dropped stop messages and stuck remote players in a walk loop.

WebRTC proximity voice with HRTF. One RTCPeerConnection per peer with offer/answer roles decided by peer-id ordering, audio routed through a HRTF PannerNode with an AudioListener updated per frame from player facing, and Opus munged to 128 kbps with in-band FEC for resilience. A muted hidden <audio> element forces Chrome to pull packets from a Web-Audio-only stream.

Subtractive audio engineering. Matching Meet/Teams quality meant removing stages, not adding them: ML denoise plus echo cancellation plus automatic gain plus a soft compressor, with no gate and no click limiter, because an ML denoiser trained on keyboard noise makes amplitude clipping redundant. Push-to-talk ramps a tail GainNode with an asymmetric envelope instead of flipping the track so syllables aren't clipped, and the denoiser worklet ships as a single self-contained esbuild bundle to dodge bare-specifier import resolution.

Part 23 of 29. Previous: Part 22 - Clouds you can light, and culling that has to be fed Next: Part 24 - Saving a world, and wind you can see Series guide: /blog/2026-02-25-open-world-browser-series-guide

Building an open world in the browser, part 23: Fifty avatars and a voice in the room ​

One draw call for fifty dancers ​

The bug was a face you couldn't see ​

A relay that forgets you between words ​

Voice that pans with the room ​

Deleting your way to clean audio ​

Technology referenced in this chapter ​