Skip to content

Building an open world in the browser, part 21: A faster renderer that wasn't faster

By Oleg Sidorkin, CTO and Co-Founder of Cinevva

New here? Use the series guide. It explains what a spike is and links all parts.

Part 20 faked surface depth on a flat quad. This part is about a render-architecture decision, and it's the spike where the textbook answer turned out to be wrong for our hardware. The question: for cinematic-density alpha-tested grass under a third-person camera, do we need a visibility buffer before pushing to 200-player density? The going advice is an emphatic yes. We built it, measured it, and the answer was no.

The technique everyone recommends

Open Spike 40 in a new tab ↗ · View source

A visibility buffer splits rendering into two passes. Pass 1 rasterizes geometry and writes only triangle and instance IDs into a compact integer target plus depth, doing no shading at all. Pass 2 is a fullscreen pass that reads the IDs at each covered pixel, refetches that triangle's vertices, reconstructs the interpolated attributes, and shades each visible pixel exactly once. The pitch is perfect overdraw rejection: the depth test runs against fragments that did no shading work, so the expensive material only runs on what you actually see.

The spike runs two paths on one canvas and one device so the only variable is where the shading lives. The forward path is a normal MeshStandardNodeMaterial through three.js. The vis-buffer path is a raw two-pass WebGPU pipeline running outside three.js, reading three.js's grass texture straight off the backend, writing (instanceId, triId) into an RG32Uint target in pass 1 and resolving lighting in pass 2. Both share one ground-truth lighting setup.

Two implementation notes worth keeping. WebGPU still has no portable primitive_index builtin in fragment shaders, so the trick is to bake a per-vertex triangle ID onto un-indexed geometry and read it flat-interpolated, which costs 3× the vertex count but is negligible on a 12-vertex grass card. And sharing the canvas with three.js's renderer is mostly a non-event as long as you never reconfigure the context or touch the canvas dimensions, both of which three.js owns. Timing the forward path was the fiddly part, since three.js exposes no hook to inject GPU timestamp queries inside its render pass; the workaround brackets its work with two no-op timestamp passes submitted before and after, which the GPU runs in submission order.

The numbers go the wrong way

On an M-series Mac at ~1080p, with cross-card grass blades over an 80 m field:

At 50,000 instances the vis-buffer path won by 25%, 4.13 ms against forward's 5.51 ms. At 100,000 it was break-even. At 200,000 instances forward won by 44%, 5.44 ms against the vis-buffer's 7.80 ms. The vis-buffer path gets relatively worse as density climbs, which is the exact opposite of the folklore that says it wins precisely when overdraw is heavy.

Why forward holds up

Apple Silicon GPUs are tile-based deferred renderers, and that changes the whole calculation. Forward shading on a TBDR has a hidden-surface-removal stage that runs before the fragment shader: the rasterizer collects every fragment mapping to a tile, sorts them by depth, and only the survivors (post-alpha-test) ever reach the fragment shader. So the forward path is already paying most of the visibility buffer's "shade once per pixel" promise, for free, inside the hardware. As blades crowd the screen, more fragments get rejected at HSR before any shading fires, and forward's effective per-pixel cost stays roughly flat instead of growing with overdraw.

Pass 1 of the vis-buffer path gets that same TBDR benefit. The trouble is all in pass 2. Pass 2 reads each pixel's instance matrix out of a buffer that, at 200,000 instances, is 12.8 MB, far larger than any GPU cache. Screen-adjacent pixels usually belong to different grass instances (the scatter is a jittered grid, so neighboring blades have arbitrary instance IDs), so every wave hitting that buffer misses cache divergently. That incoherent random access hides about 4 ms per frame on its own. Forward dodges it entirely because the instance matrix arrives with the vertex through the per-instance attribute path, so by the time the fragment shader runs the transformed vertex data is already in tile-local registers, no megabyte-scale random read required.

This is exactly the cost Nanite's material-classification pass exists to amortize: bin pixels by instance and dispatch sorted compute waves so each wave's reads are coherent. We don't have that. A back-of-envelope says sorting pixels by instance would drop that 4 ms to maybe 1.5 to 2 ms and push the crossover out to 400,000 to 500,000 instances. But that's stacking optimizations on an architecture that isn't winning here in the first place.

The honest conclusion, and the audit that earned it

For alpha-tested cross-card foliage on Apple Silicon WebGPU, the forward path with three.js's TSL pipeline is already at or below vis-buffer cost, and the vis-buffer plumbing buys nothing visible until well past 200,000 instances and only if you also add a sorting or binning pass. The practical call for the production engine is to keep the forward plus LOD plus imposter stack from the earlier spikes and not invest in vis-buffer infrastructure until either we target discrete NVIDIA or AMD GPUs as the dominant deployment (where overdraw cost is more linear) or we move to a meshlet architecture where the vis-buffer is the natural output anyway.

Because that result is counterintuitive, the conclusion is only worth anything if the comparison is fair, so the spike got a full audit pass. Several real bugs surfaced and got fixed: a blade-scale slider that silently desynced the two paths, half the forward blades rendering pitch-dark from anti-parallel normals (fixed with the canonical upward-normal foliage trick), and the vis-buffer reading about 2× too bright from a hand-picked Lambert factor instead of the energy-conserving 1/π, a hardcoded ambient term, and missing tone mapping. The fix copied three.js's exact ACES filmic curve into WGSL and reads the light colors and intensities off the actual scene lights each frame. The remaining known gap, missing direct specular in pass 2, biases the comparison toward the vis-buffer, meaning forward is doing strictly more per-pixel work and still winning at high density. That makes the headline conclusion conservative, not optimistic. The one caveat that stands: this is all M-series-specific, and the crossover may well invert on a discrete card, so it's worth re-running before committing the stack for non-Apple targets.

Technology referenced in this chapter

Visibility buffer rendering. Pass 1 rasterizes geometry and writes only triangle and instance IDs plus depth, doing no shading. Pass 2 is a fullscreen resolve that reads the IDs per covered pixel, refetches the source triangle, reconstructs perspective-correct barycentric attributes, and shades each visible pixel once. Since WebGPU lacks a portable fragment primitive_index, the triangle ID is baked as a flat-interpolated per-vertex attribute on un-indexed geometry.

TBDR hidden-surface removal versus deferred resolve. On a tile-based deferred GPU (Apple Silicon), forward shading already rejects occluded fragments before the fragment shader runs, so it captures most of the visibility buffer's shade-once benefit for free, and its per-pixel cost stays roughly flat as overdraw grows. A vis-buffer resolve pass instead pays for incoherent random access into a large per-instance buffer (12.8 MB at 200k instances), which dominates at high density unless pixels are sorted or binned by instance first, the way Nanite's material classification does.

Sharing a canvas with three.js's WebGPURenderer. Raw WebGPU command buffers interleave correctly with three.js submissions on the shared queue as long as you never call context.configure() again or write canvas.width/height, both of which the renderer owns. Forward-path GPU timing, which three.js doesn't expose a hook for, can be bracketed by two no-op timestamp render passes submitted around its render call, since the GPU runs command buffers in submission order.

Validating a counterintuitive benchmark. A surprising performance result is only as trustworthy as the fairness of the comparison. Auditing both paths to identical scene content and shading (matched ACES tone mapping, energy-conserving Lambert, lights read from the same objects, identical blade scale) is what turned "vis-buffer is slower" from a likely measurement artifact into a defensible conclusion, with the one remaining asymmetry biased in the conservative direction.


Part 21 of 29. Previous: Part 20 - Faking depth on a flat plane Next: Part 22 - Clouds you can fly through, and culling that pays off Series guide: /blog/2026-02-25-open-world-browser-series-guide