How modern engines beat foliage overdraw
By Oleg Sidorkin, CTO and Co-Founder of Cinevva

A forest is one of the worst things you can ask a GPU to render. Each leaf is a textured quad with an alpha mask cut out of it. Dozens of those quads stack on top of each other along every view ray. The rasterizer has no way to know in advance which fragments will pass the alpha test, so the standard early-Z optimization that saves opaque scenes is mostly turned off. The result is that a single screen pixel can run the full leaf shader 10 or 15 times before the frame settles. That's foliage overdraw, and it's the single most expensive part of an open-world frame in any game shipped in the last decade.
The good news is that it's a solved problem. Not in the sense that anyone fixed it with one trick, but in the sense that there's a stack of seven or eight techniques that, combined, take effective overdraw from 8-15x in a dense forest down to 1-2x. Every modern engine ships some version of this stack. Here's what's in it, why each piece exists, and the canonical references for each.
1. Why foliage overdraw is so painful
In a typical opaque scene, the GPU does early depth rejection: before the pixel shader even runs, the hardware checks the existing depth buffer and skips fragments that are already behind something. This is essentially free and it's what keeps the cost of dense geometry sane.
Alpha-tested foliage breaks this. The fragment shader has to actually run to evaluate the alpha texture and call discard (or clip) on the masked-out pixels. The hardware can't know whether a fragment will be killed until after the shader runs, so on most GPUs the use of discard anywhere in a shader disables early-Z for that draw call entirely. On tile-based GPUs (mobile, M-series, some consoles) it can disable Hi-Z and depth compression for the whole frame. The pixel shader runs once for every triangle that covers a pixel, and most of those evaluations end with a discard.
Stack 10 leaf quads in front of a single screen pixel and the leaf shader runs 10 times. Multiply by 4 million pixels and the cost is brutal. Marco Salvi's writeup To Early-Z, or Not To Early-Z is the gentlest tour of why this happens at the hardware level, and it's the right place to start if you've never thought about it before.

2. The depth prepass for masked geometry
The biggest single win, and the technique every modern engine ships, is splitting the foliage draw into two passes. The first pass writes only depth, with a minimal shader that does the alpha test, discards the masked pixels, and writes nothing else. The second pass renders the full material with depth testing set to "equal" and depth writes off. Each visible pixel now shades the full BRDF exactly once, no matter how many leaf quads were stacked behind it.
This sounds like double work because you're touching the same triangles twice, but the prepass shader is so cheap (one texture sample, one discard, one depth write) that the savings on the main pass dwarf the extra cost. In a dense forest the main pass goes from running the full leaf shader 10-15 times per pixel to running it exactly once.
Unreal, Frostbite, Decima, and the modern Rage and Dunia branches all do this. It's also the key reason masked materials are still cheaper than translucent materials in any of these engines.
Deep dives:
- Pettineo, To Early-Z, or Not To Early-Z (the canonical writeup of how
discardinteracts with Hi-Z and early-Z). - Wihlidal, Optimizing the Graphics Pipeline with Compute (GDC 2016, the Frostbite depth-prepass and prepass-driven culling architecture).
- Sanders, Between Tech and Art: The Vegetation of Horizon Zero Dawn (GDC 2018, with Decima's two-pass foliage rendering).
- Persson, A couple of notes about Z (still one of the clearest explanations of depth-equal testing and prepass economics).
3. Aggressive LODs and octahedral imposters
The second-biggest win is never drawing the leaves at all when you don't have to. Foliage assets ship with several LOD tiers. The closest is the full mesh with individual leaf cards. At medium range, the leaves collapse into denser composite cards (a clump of 30 leaves becomes 1 textured card with the same silhouette). Beyond a distance threshold, the entire tree becomes an imposter: a small piece of geometry textured with pre-rendered views of the tree from many angles.
The modern imposter format is the octahedral imposter: an 8-faced piece of geometry textured with an atlas of views captured from points on a sphere using octahedral mapping. At runtime, the shader picks the closest two or three pre-rendered views based on the camera direction and blends between them. The result is a few-triangle stand-in that looks 3D from any angle and can have proper shading, normal maps, and even wind animation. Ryan Brucks' implementation, originally a community plugin and now part of Unreal, is the reference. Microsoft Flight Simulator's billion-tree forests are essentially octahedral imposters everywhere except the camera.
The bigger structural win is that imposters are opaque or near-opaque at distance. The 30-leaf composite card is a single masked quad instead of 30 quads. The whole-tree imposter is a few faces, not thousands. Far-field overdraw collapses to almost nothing.
Deep dives:
- Brucks, Octahedral Impostors (the canonical reference, with the math and the UE implementation).
- Halen, Octahedral Impostors in Unreal Engine (the integrated UE workflow).
- Häggström, Real-Time Rendering of Vegetation (a clean thesis covering LOD chains, imposters, and the math behind them).
- Crytek, SpeedTree integration in CryEngine 3 (GPU Gems 3, still the best primer on tree LOD chains).
4. Cluster and GPU-driven culling
Even with a depth prepass, the prepass itself has a cost: it still has to touch every triangle of every visible (or potentially visible) tree. Modern engines push that cost down with GPU-driven cluster culling, which throws away whole groups of triangles before the rasterizer ever sees them.
The pipeline looks like this: every mesh is pre-split into clusters of 64 or 128 triangles with a tight bounding box and a cone of normals. At render time, a compute shader walks the instance list, frustum-tests each instance, then frustum-tests every cluster of every visible instance, then Hi-Z occlusion-tests each surviving cluster against the previous frame's depth pyramid. Whole branches of trees that are hidden behind a hill or in front of another tree get culled before any vertex shader runs. The output is a compact list of "draw these clusters" arguments fed straight into a single DrawIndirect call.
This is what makes a forest of 10,000 trees render in milliseconds instead of seconds. Ubisoft's Assassin's Creed Unity talk introduced this pipeline in production form (20-40% triangles culled, 30-80% shadow triangles culled, 10x more instances on screen than the previous generation), and Wihlidal's Frostbite talk took it further. UE5 Nanite is the visible end-state of this trajectory: cluster culling all the way down to the pixel.
Deep dives:
- Haar and Aaltonen, GPU-Driven Rendering Pipelines (SIGGRAPH 2015, the foundational Assassin's Creed Unity talk).
- Wihlidal, Optimizing the Graphics Pipeline with Compute (GDC 2016, Frostbite's GPU-driven prepass).
- Karis, Stubbe, Wihlidal, A Deep Dive into Nanite Virtualized Geometry (SIGGRAPH 2021, cluster culling at meshlet granularity).
- Liktor, Geometry Rendering Pipeline Architecture at Activision (the Call of Duty version of cluster culling, 2021).
5. Front-to-back instance and cluster sorting
Once the prepass is doing its job, order starts to matter. The prepass writes depth, but only for fragments that pass the alpha test. If you draw the back of the forest first and the front last, every front fragment overwrites a back fragment, and the prepass shader still runs for the back. If you draw front-to-back, each successive draw fills more of the depth buffer with smaller values, and Hi-Z rejects more and more of the back fragments before the shader runs.
This is why almost every modern engine sorts foliage instances by distance to camera before issuing the prepass. The sort is cheap (a few hundred thousand instances on the GPU using a radix sort), and it turns the prepass itself into a self-pruning operation. Cluster-level culling sorts at the meshlet granularity for the same reason. The depth prepass and front-to-back ordering are the kind of pair where each is good and the combination is great.
Deep dives:
- Persson, Depth in-depth (the architectural notes on depth-buffer ordering, prepass economics, and Hi-Z behavior).
- Giesen, A trip through the graphics pipeline (the deep technical explanation of how Hi-Z reject rate depends on draw order).
- Wihlidal, Optimizing the Graphics Pipeline with Compute (GDC 2016, includes Frostbite's GPU-side instance sort).
6. Dithered LOD transitions and hashed alpha
The other big trap is fading. The naive way to transition between two LODs (or to fade an instance in or out as the camera approaches) is alpha blending. But blended geometry can't write to the depth buffer, which kicks every fading tree into the slow translucent path and breaks the prepass. The solution is to keep the geometry in the masked path and do the fade inside the alpha test.
Two main techniques:
- Dithered LOD transitions sample a 4x4 or 8x8 Bayer pattern (or a screen-space blue-noise texture) and use that as a per-pixel cutoff modifier. A tree at 50% blend has a checkerboard of pixels surviving; the missing pixels are filled in by the next LOD's complementary checkerboard. TAA resolves the checker into a smooth blend across two or three frames. Cheap, stable, blends with everything else in the engine.
- Hashed alpha testing (Wyman & McGuire, I3D 2017) replaces the fixed 0.5 alpha threshold with a per-pixel hashed threshold in [0,1). Distant alpha geometry that would normally vanish entirely (because the mipmapped alpha drops below 0.5) keeps a stable scattering of surviving pixels. TAA again does the cleanup.
Both techniques keep foliage in the opaque/masked path where the depth prepass works, so you don't pay the full translucent rendering cost just to fade something in. Alpha to coverage is the MSAA-era cousin of the same idea: convert alpha into a sub-pixel coverage mask, get partial transparency without leaving the masked path. The catch is that A2C only really shines with MSAA, which most modern deferred renderers no longer use.
Deep dives:
- Wyman and McGuire, Hashed Alpha Testing (I3D 2017, the canonical hashed-alpha paper).
- Castaño, Computing Alpha Mipmaps (The Witness blog, the right way to mip alpha-tested textures so distant trees don't vanish).
- Yuksel, Alpha Distribution for Alpha Testing (a more recent improvement on alpha mipmaps).
- NVIDIA, Anti-Aliased Alpha Testing (a survey of A2C, hashed alpha, and dithered alternatives).
7. Reducing shading cost on masked pixels
Even with a perfect prepass and perfect culling, you still have to shade every visible foliage pixel once. Engines reduce that cost too:
- Cheaper BRDF. Foliage is matte and doesn't really need a full Cook-Torrance specular path. A wrapped-Lambertian diffuse plus a one-line specular approximation is plenty.
- Lower-frequency normal maps. Leaves are noisy already. A 256x256 normal map looks the same as a 1024x1024 one at typical viewing distances and saves bandwidth.
- No parallax, no anisotropy, no clearcoat. The PBR feature menu gets switched off for leaves.
- Two-sided thin transmission instead of full subsurface scattering. Leaves transmit light from behind, but you can fake it with a single back-light dot product.
- Half-resolution shading in some engines. Foliage gets shaded at 1/4 or 1/2 pixel rate and upscaled. Stochastic noise from TAA hides the resampling.
- Skip detail textures and decals on masked materials by default.
Each of these is a small win individually. Combined, masked-foliage shaders can run 2-3x faster than the equivalent opaque material.
Deep dives:
- Lagarde and de Rousiers, Moving Frostbite to Physically Based Rendering 3.0, sections on foliage and translucency (SIGGRAPH 2014, the canonical PBR adjustments for thin two-sided materials).
- Jimenez, Next Generation Character Rendering (the wrapped-Lambertian and back-translucency math, originally for skin but ported widely to leaves).
- Sanders, Between Tech and Art: The Vegetation of Horizon Zero Dawn (GDC 2018, with Decima's foliage shader simplifications).
8. Visibility buffers and Nanite for masked materials
The cleanest answer to overdraw is to decouple shading from rasterization entirely. A visibility buffer rasterizes geometry into a thin buffer (just triangle ID and instance ID per pixel), then runs the full material as a deferred pass that reads the visibility buffer and shades each pixel exactly once. There is no overdraw at the shading stage, by construction. Burns and Hunt's 2013 paper introduced this; UE5 Nanite is the production-quality realization, including for masked foliage as of UE 5.5.
Nanite's twist is that it does this with cluster-level virtualized geometry, so the rasterizer itself runs on a software path for sub-pixel triangles and keeps overdraw bounded. Masked materials in Nanite need the "programmable raster" feature: the alpha test runs during the visibility-buffer pass, but the material shading still happens once per visible pixel in the deferred resolve. The result is that very dense Nanite foliage, once notorious for being slow on masked materials, is now competitive with opaque or even cheaper, because the overdraw at shading time is zero. There's a tradeoff: it's often better to push high-poly opaque tree geometry through Nanite than to keep low-poly masked-card trees, because the masked path adds programmable-raster cost.
Deep dives:
- Burns and Hunt, The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading (JCGT 2013, the original paper).
- Karis, Stubbe, Wihlidal, A Deep Dive into Nanite Virtualized Geometry (SIGGRAPH 2021, the production architecture).
- Epic, Nanite GPU Driven Materials (GDC 2024, with the masked-material and programmable-raster pipeline).
- Notes from "Nanite GPU Driven Materials" (a clean third-party walkthrough of the same talk).
9. Separate shadow representations for foliage
Shadow maps for alpha-tested foliage are the second-most-expensive shadow problem in any open-world frame, after large cascades. Every cascade needs its own depth prepass, every prepass runs the alpha test, and the cost stacks fast across 4 cascades and dozens of light frusta. So most engines don't render foliage shadows the same way they render foliage color.
Common substitutions:
- Mesh distance-field shadows (UE Lumen, custom engines). Each mesh has a precomputed signed distance field. A short cone trace through the SDF gives a soft shadow without ever touching the alpha-tested mesh. Particularly nice for trees because the SDF captures the canopy silhouette as a single solid blob and ignores the per-leaf detail.
- Lower-resolution cascades for foliage. Foliage shadows go into a half-resolution slice and get up-sampled with edge-aware filtering. The eye doesn't notice the resolution drop because the shadows are already soft.
- Alpha-to-coverage shadow maps with MSAA. On engines that still have MSAA in their shadow path, A2C gives smooth-edged foliage shadows without the full alpha test cost.
- Capsule shadows for the trunk and big branches, distance-field for the canopy, full alpha test only at the closest cascade. Different representations for different distances, blended in the lighting pass.
- WPO disable distance. Wind-driven World Position Offset is killed beyond a threshold so that cached shadow data stays valid across frames. UE's Virtual Shadow Maps lean heavily on this.
Deep dives:
- Epic, Distance Field Soft Shadows in Unreal Engine (the canonical UE reference for mesh DF shadows).
- Wright, Lumen: Real-Time Global Illumination in Unreal Engine 5 (SIGGRAPH 2022, with Lumen's mesh-SDF integration for foliage).
- Epic, Virtual Shadow Maps (the modern UE5 shadow architecture, with the foliage-specific WPO disable and caching rules).
- Persson, Practical Cascaded Shadow Maps (the still-canonical CSM reference with notes on alpha-tested casters).
10. Wind, animation, and shadow caching
A subtle related problem: most foliage moves. Wind-driven vertex animation (World Position Offset in UE, equivalent in Frostbite and Decima) means the foliage geometry isn't stable from frame to frame, which breaks shadow caching and reprojection. Modern engines fight this in two ways:
- Cap WPO at distance. Beyond a threshold, the wind animation amplitude smoothly goes to zero. The eye can't see the sway that far away anyway, and the shadow caches stay valid.
- Bake the wind into the cluster bounds. Cluster bounding boxes are inflated by the maximum WPO offset so culling stays conservative without re-uploading per frame.
- Per-instance phase offsets. Identical trees use a per-instance random seed to offset the wind phase, so a forest doesn't sway in lockstep without paying for unique animation per tree.
This is the kind of detail that doesn't show up in technique lists but is the difference between a 3 ms forest and a 9 ms forest in practice.
Deep dives:
- Sanders, Between Tech and Art: The Vegetation of Horizon Zero Dawn (GDC 2018, with the Decima wind-animation pipeline and shadow caching).
- McAuley, Rendering the World of Far Cry 4 (GDC 2015, includes the wind-grid sampling for vegetation).
- Epic, Foliage and Virtual Shadow Maps (the UE5 community guidance on WPO disable distance and VSM caching).
11. The combined math
None of these tricks is a silver bullet. The interesting thing is what happens when you stack them:
- A naive masked-foliage pass on a dense forest scene measures 8-15x effective overdraw. Every visible pixel runs the leaf shader 8 to 15 times.
- Add a depth prepass and the main pass drops to ~1x overdraw, but the prepass itself still touches everything.
- Add front-to-back sorting and the prepass starts pruning itself.
- Add cluster-level GPU culling and the prepass touches only what could possibly be visible.
- Add LOD chains and imposters and the count of visible quads drops by an order of magnitude beyond 30 m.
- Add the visibility-buffer / Nanite path and the shading stage truly runs once per pixel even on dense overlap.
- Add distance-field shadows and the shadow cost stops scaling with the alpha test.
The headline result, repeated across the talks linked above: effective overdraw collapses from 8-15x to 1-2x, and total foliage frame cost falls by 4-6x in a dense forest scene. That's the entire reason modern open-world games can render forests at 60+ fps on consumer hardware.

12. What this means for the browser
Most of this stack maps cleanly onto WebGPU. We've already shipped GPU-driven culling, indirect dispatch, Hi-Z occlusion, and front-to-back sorted prepasses in the open-world browser engine. The depth prepass for masked geometry is straightforward: WebGPU supports depth-equal testing and discard in fragment shaders, with the same caveats about early-Z. Octahedral imposters port mechanically, the math is just sphere-to-octahedron unwrapping and atlas indexing.
The harder pieces are the modern ones. A visibility buffer in WebGPU means writing a 32-bit triangle ID into a render target and resolving materials in a fullscreen compute pass; the building blocks exist but the orchestration is involved. Mesh distance-field shadows want a 3D texture per asset and a short cone trace, both of which are in WebGPU's reach. Hashed alpha and dithered LODs are one shader function each.
The path forward for browser foliage is the same as for everything else in this stack: ship the cheap, robust pieces first (prepass, sorted instances, LODs, imposters, hashed alpha) and add the heavy machinery (visibility buffer, mesh-SDF shadows) on top. The browser hardware floor is finally high enough that there is no architectural reason a WebGPU forest shouldn't look and run like a console one. There are just engineering reasons, and engineering reasons are the kind we like.
Further reading across the whole stack
If you want one source that pulls all of this together, the SIGGRAPH "Advances in Real-Time Rendering in Games" archive (advances.realtimerendering.com) has the canonical foliage and GPU-driven-rendering talks going back to 2014. Adrian Courrèges' GPU profiling articles include frame-by-frame breakdowns of GTA V and Horizon Zero Dawn that show every pass discussed here in production order. For the alpha-test math specifically, Chris Wyman's research page has the hashed-alpha and stochastic-transparency papers with reference shaders. And the Real-Time Rendering, 4th edition chapters on transparency, sampling, and depth handling are still the textbook starting point.