Skip to content

How modern engines beat foliage overdraw

By Oleg Sidorkin, CTO and Co-Founder of Cinevva

Dense forest canopy with overlapping leaf cards stacked behind each other in front of a setting sun, suggesting heavy overdraw

A forest is one of the worst things you can ask a GPU to render. Each leaf is a textured quad with an alpha mask cut out of it. Dozens of those quads stack on top of each other along every view ray. The rasterizer has no way to know in advance which fragments will pass the alpha test, so the standard early-Z optimization that saves opaque scenes is mostly turned off. The result is that a single screen pixel can run the full leaf shader 10 or 15 times before the frame settles. That's foliage overdraw, and it's the single most expensive part of an open-world frame in any game shipped in the last decade.

The good news is that it's a solved problem. Not in the sense that anyone fixed it with one trick, but in the sense that there's a stack of seven or eight techniques that, combined, take effective overdraw from 8-15x in a dense forest down to 1-2x. Every modern engine ships some version of this stack. Here's what's in it, why each piece exists, and the canonical references for each.

1. Why foliage overdraw is so painful

In a typical opaque scene, the GPU does early depth rejection: before the pixel shader even runs, the hardware checks the existing depth buffer and skips fragments that are already behind something. This is essentially free and it's what keeps the cost of dense geometry sane.

Alpha-tested foliage breaks this. The fragment shader has to actually run to evaluate the alpha texture and call discard (or clip) on the masked-out pixels. The hardware can't know whether a fragment will be killed until after the shader runs, so on most GPUs the use of discard anywhere in a shader disables early-Z for that draw call entirely. On tile-based GPUs (mobile, M-series, some consoles) it can disable Hi-Z and depth compression for the whole frame. The pixel shader runs once for every triangle that covers a pixel, and most of those evaluations end with a discard.

Stack 10 leaf quads in front of a single screen pixel and the leaf shader runs 10 times. Multiply by 4 million pixels and the cost is brutal. Marco Salvi's writeup To Early-Z, or Not To Early-Z is the gentlest tour of why this happens at the hardware level, and it's the right place to start if you've never thought about it before.

Side view of a single screen pixel ray passing through twelve overlapping leaf cards, with each card highlighted to show stacked alpha-tested fragments

2. The depth prepass for masked geometry

The biggest single win, and the technique every modern engine ships, is splitting the foliage draw into two passes. The first pass writes only depth, with a minimal shader that does the alpha test, discards the masked pixels, and writes nothing else. The second pass renders the full material with depth testing set to "equal" and depth writes off. Each visible pixel now shades the full BRDF exactly once, no matter how many leaf quads were stacked behind it.

This sounds like double work because you're touching the same triangles twice, but the prepass shader is so cheap (one texture sample, one discard, one depth write) that the savings on the main pass dwarf the extra cost. In a dense forest the main pass goes from running the full leaf shader 10-15 times per pixel to running it exactly once.

Unreal, Frostbite, Decima, and the modern Rage and Dunia branches all do this. It's also the key reason masked materials are still cheaper than translucent materials in any of these engines.

Deep dives:

3. Aggressive LODs and octahedral imposters

The second-biggest win is never drawing the leaves at all when you don't have to. Foliage assets ship with several LOD tiers. The closest is the full mesh with individual leaf cards. At medium range, the leaves collapse into denser composite cards (a clump of 30 leaves becomes 1 textured card with the same silhouette). Beyond a distance threshold, the entire tree becomes an imposter: a small piece of geometry textured with pre-rendered views of the tree from many angles.

The modern imposter format is the octahedral imposter: an 8-faced piece of geometry textured with an atlas of views captured from points on a sphere using octahedral mapping. At runtime, the shader picks the closest two or three pre-rendered views based on the camera direction and blends between them. The result is a few-triangle stand-in that looks 3D from any angle and can have proper shading, normal maps, and even wind animation. Ryan Brucks' implementation, originally a community plugin and now part of Unreal, is the reference. Microsoft Flight Simulator's billion-tree forests are essentially octahedral imposters everywhere except the camera.

The bigger structural win is that imposters are opaque or near-opaque at distance. The 30-leaf composite card is a single masked quad instead of 30 quads. The whole-tree imposter is a few faces, not thousands. Far-field overdraw collapses to almost nothing.

Deep dives:

4. Cluster and GPU-driven culling

Even with a depth prepass, the prepass itself has a cost: it still has to touch every triangle of every visible (or potentially visible) tree. Modern engines push that cost down with GPU-driven cluster culling, which throws away whole groups of triangles before the rasterizer ever sees them.

The pipeline looks like this: every mesh is pre-split into clusters of 64 or 128 triangles with a tight bounding box and a cone of normals. At render time, a compute shader walks the instance list, frustum-tests each instance, then frustum-tests every cluster of every visible instance, then Hi-Z occlusion-tests each surviving cluster against the previous frame's depth pyramid. Whole branches of trees that are hidden behind a hill or in front of another tree get culled before any vertex shader runs. The output is a compact list of "draw these clusters" arguments fed straight into a single DrawIndirect call.

This is what makes a forest of 10,000 trees render in milliseconds instead of seconds. Ubisoft's Assassin's Creed Unity talk introduced this pipeline in production form (20-40% triangles culled, 30-80% shadow triangles culled, 10x more instances on screen than the previous generation), and Wihlidal's Frostbite talk took it further. UE5 Nanite is the visible end-state of this trajectory: cluster culling all the way down to the pixel.

Deep dives:

5. Front-to-back instance and cluster sorting

Once the prepass is doing its job, order starts to matter. The prepass writes depth, but only for fragments that pass the alpha test. If you draw the back of the forest first and the front last, every front fragment overwrites a back fragment, and the prepass shader still runs for the back. If you draw front-to-back, each successive draw fills more of the depth buffer with smaller values, and Hi-Z rejects more and more of the back fragments before the shader runs.

This is why almost every modern engine sorts foliage instances by distance to camera before issuing the prepass. The sort is cheap (a few hundred thousand instances on the GPU using a radix sort), and it turns the prepass itself into a self-pruning operation. Cluster-level culling sorts at the meshlet granularity for the same reason. The depth prepass and front-to-back ordering are the kind of pair where each is good and the combination is great.

Deep dives:

6. Dithered LOD transitions and hashed alpha

The other big trap is fading. The naive way to transition between two LODs (or to fade an instance in or out as the camera approaches) is alpha blending. But blended geometry can't write to the depth buffer, which kicks every fading tree into the slow translucent path and breaks the prepass. The solution is to keep the geometry in the masked path and do the fade inside the alpha test.

Two main techniques:

  • Dithered LOD transitions sample a 4x4 or 8x8 Bayer pattern (or a screen-space blue-noise texture) and use that as a per-pixel cutoff modifier. A tree at 50% blend has a checkerboard of pixels surviving; the missing pixels are filled in by the next LOD's complementary checkerboard. TAA resolves the checker into a smooth blend across two or three frames. Cheap, stable, blends with everything else in the engine.
  • Hashed alpha testing (Wyman & McGuire, I3D 2017) replaces the fixed 0.5 alpha threshold with a per-pixel hashed threshold in [0,1). Distant alpha geometry that would normally vanish entirely (because the mipmapped alpha drops below 0.5) keeps a stable scattering of surviving pixels. TAA again does the cleanup.

Both techniques keep foliage in the opaque/masked path where the depth prepass works, so you don't pay the full translucent rendering cost just to fade something in. Alpha to coverage is the MSAA-era cousin of the same idea: convert alpha into a sub-pixel coverage mask, get partial transparency without leaving the masked path. The catch is that A2C only really shines with MSAA, which most modern deferred renderers no longer use.

Deep dives:

7. Reducing shading cost on masked pixels

Even with a perfect prepass and perfect culling, you still have to shade every visible foliage pixel once. Engines reduce that cost too:

  • Cheaper BRDF. Foliage is matte and doesn't really need a full Cook-Torrance specular path. A wrapped-Lambertian diffuse plus a one-line specular approximation is plenty.
  • Lower-frequency normal maps. Leaves are noisy already. A 256x256 normal map looks the same as a 1024x1024 one at typical viewing distances and saves bandwidth.
  • No parallax, no anisotropy, no clearcoat. The PBR feature menu gets switched off for leaves.
  • Two-sided thin transmission instead of full subsurface scattering. Leaves transmit light from behind, but you can fake it with a single back-light dot product.
  • Half-resolution shading in some engines. Foliage gets shaded at 1/4 or 1/2 pixel rate and upscaled. Stochastic noise from TAA hides the resampling.
  • Skip detail textures and decals on masked materials by default.

Each of these is a small win individually. Combined, masked-foliage shaders can run 2-3x faster than the equivalent opaque material.

Deep dives:

8. Visibility buffers and Nanite for masked materials

The cleanest answer to overdraw is to decouple shading from rasterization entirely. A visibility buffer rasterizes geometry into a thin buffer (just triangle ID and instance ID per pixel), then runs the full material as a deferred pass that reads the visibility buffer and shades each pixel exactly once. There is no overdraw at the shading stage, by construction. Burns and Hunt's 2013 paper introduced this; UE5 Nanite is the production-quality realization, including for masked foliage as of UE 5.5.

Nanite's twist is that it does this with cluster-level virtualized geometry, so the rasterizer itself runs on a software path for sub-pixel triangles and keeps overdraw bounded. Masked materials in Nanite need the "programmable raster" feature: the alpha test runs during the visibility-buffer pass, but the material shading still happens once per visible pixel in the deferred resolve. The result is that very dense Nanite foliage, once notorious for being slow on masked materials, is now competitive with opaque or even cheaper, because the overdraw at shading time is zero. There's a tradeoff: it's often better to push high-poly opaque tree geometry through Nanite than to keep low-poly masked-card trees, because the masked path adds programmable-raster cost.

Deep dives:

9. Separate shadow representations for foliage

Shadow maps for alpha-tested foliage are the second-most-expensive shadow problem in any open-world frame, after large cascades. Every cascade needs its own depth prepass, every prepass runs the alpha test, and the cost stacks fast across 4 cascades and dozens of light frusta. So most engines don't render foliage shadows the same way they render foliage color.

Common substitutions:

  • Mesh distance-field shadows (UE Lumen, custom engines). Each mesh has a precomputed signed distance field. A short cone trace through the SDF gives a soft shadow without ever touching the alpha-tested mesh. Particularly nice for trees because the SDF captures the canopy silhouette as a single solid blob and ignores the per-leaf detail.
  • Lower-resolution cascades for foliage. Foliage shadows go into a half-resolution slice and get up-sampled with edge-aware filtering. The eye doesn't notice the resolution drop because the shadows are already soft.
  • Alpha-to-coverage shadow maps with MSAA. On engines that still have MSAA in their shadow path, A2C gives smooth-edged foliage shadows without the full alpha test cost.
  • Capsule shadows for the trunk and big branches, distance-field for the canopy, full alpha test only at the closest cascade. Different representations for different distances, blended in the lighting pass.
  • WPO disable distance. Wind-driven World Position Offset is killed beyond a threshold so that cached shadow data stays valid across frames. UE's Virtual Shadow Maps lean heavily on this.

Deep dives:

10. Wind, animation, and shadow caching

A subtle related problem: most foliage moves. Wind-driven vertex animation (World Position Offset in UE, equivalent in Frostbite and Decima) means the foliage geometry isn't stable from frame to frame, which breaks shadow caching and reprojection. Modern engines fight this in two ways:

  • Cap WPO at distance. Beyond a threshold, the wind animation amplitude smoothly goes to zero. The eye can't see the sway that far away anyway, and the shadow caches stay valid.
  • Bake the wind into the cluster bounds. Cluster bounding boxes are inflated by the maximum WPO offset so culling stays conservative without re-uploading per frame.
  • Per-instance phase offsets. Identical trees use a per-instance random seed to offset the wind phase, so a forest doesn't sway in lockstep without paying for unique animation per tree.

This is the kind of detail that doesn't show up in technique lists but is the difference between a 3 ms forest and a 9 ms forest in practice.

Deep dives:

11. The combined math

None of these tricks is a silver bullet. The interesting thing is what happens when you stack them:

  • A naive masked-foliage pass on a dense forest scene measures 8-15x effective overdraw. Every visible pixel runs the leaf shader 8 to 15 times.
  • Add a depth prepass and the main pass drops to ~1x overdraw, but the prepass itself still touches everything.
  • Add front-to-back sorting and the prepass starts pruning itself.
  • Add cluster-level GPU culling and the prepass touches only what could possibly be visible.
  • Add LOD chains and imposters and the count of visible quads drops by an order of magnitude beyond 30 m.
  • Add the visibility-buffer / Nanite path and the shading stage truly runs once per pixel even on dense overlap.
  • Add distance-field shadows and the shadow cost stops scaling with the alpha test.

The headline result, repeated across the talks linked above: effective overdraw collapses from 8-15x to 1-2x, and total foliage frame cost falls by 4-6x in a dense forest scene. That's the entire reason modern open-world games can render forests at 60+ fps on consumer hardware.

Diagram showing the foliage rendering pipeline as five stacked stages: GPU culling, sorted prepass, masked color pass with depth-equal, visibility-buffer shading, and separate distance-field shadow path

12. What this means for the browser

Most of this stack maps cleanly onto WebGPU. We've already shipped GPU-driven culling, indirect dispatch, Hi-Z occlusion, and front-to-back sorted prepasses in the open-world browser engine. The depth prepass for masked geometry is straightforward: WebGPU supports depth-equal testing and discard in fragment shaders, with the same caveats about early-Z. Octahedral imposters port mechanically, the math is just sphere-to-octahedron unwrapping and atlas indexing.

The harder pieces are the modern ones. A visibility buffer in WebGPU means writing a 32-bit triangle ID into a render target and resolving materials in a fullscreen compute pass; the building blocks exist but the orchestration is involved. Mesh distance-field shadows want a 3D texture per asset and a short cone trace, both of which are in WebGPU's reach. Hashed alpha and dithered LODs are one shader function each.

The path forward for browser foliage is the same as for everything else in this stack: ship the cheap, robust pieces first (prepass, sorted instances, LODs, imposters, hashed alpha) and add the heavy machinery (visibility buffer, mesh-SDF shadows) on top. The browser hardware floor is finally high enough that there is no architectural reason a WebGPU forest shouldn't look and run like a console one. There are just engineering reasons, and engineering reasons are the kind we like.

Further reading across the whole stack

If you want one source that pulls all of this together, the SIGGRAPH "Advances in Real-Time Rendering in Games" archive (advances.realtimerendering.com) has the canonical foliage and GPU-driven-rendering talks going back to 2014. Adrian Courrèges' GPU profiling articles include frame-by-frame breakdowns of GTA V and Horizon Zero Dawn that show every pass discussed here in production order. For the alpha-test math specifically, Chris Wyman's research page has the hashed-alpha and stochastic-transparency papers with reference shaders. And the Real-Time Rendering, 4th edition chapters on transparency, sampling, and depth handling are still the textbook starting point.