Tiled deferred shading tricks

Last update: 26 May 2014.

This post will cover some of my humble findings when implementing a tiled deferred rendering with MSAA support.
I will also update it occasionally.

Recap:
————————–
Deferred shading is a well-known rendering technique: we first render the scene to a ‘G-Buffer’, containing geometry and material data (e.g. position/depth, normals, surface glossiness etc) and then compute all the lighting and shading in screen space [1].

Pros:
– reduced shading overdraw: only one complex shader invocation per pixel (+ additional per pixel inside of each light’s influence area); you can do the same with Z-Prepass in forward, but it will cost you 2x drawcalls.
– lighting is decoupled from materials/objects.
– G-Buffer is anyway required for many advanced effects, which are difficult to substitute with something else (e.g. SSAO, SSR).

Cons:
– doesn’t handle semi-transparency, you have to draw transparent stuff forward-style.
– can be bandwidth-heavy, requires tight G-Buffer packing, Crysis 3 is a good example [2], also stencil culling is extremely useful for selecting pixels, only affected by light. The less you repaint your render targets, the better.
– overly compressed G-Buffer can exhibit artifacts (e.g. Unity).

– difficult to integrate with MSAA, many deferred games just use post-AA (e.g. FXAA), however, quality is far from MSAA due to lack of sub-pixel data. The latest approach is to perform an edge detection into stencil and then do per-sample shading only on these edges and simple per-pixel everywhere else, also used in Crysis 3 [2][3], however this approach suffers from bad locality of edge pixels on the screen.

Tiled shading is an approach, where we divide our screen into tiles, that are bigger than pixels (e.g. 8×8, 16×16), test which lights affect these tiles and then shade pixels with only lights that belong to their tiles [4]. Simply, ‘divide and conquer’. However, for good culling quality, access to depth buffer is necessary.
Tiled shading can be implemented multiple ways and used with both forward and deferred approaches, notable examples are BF3 tiled deferred (with code!) [5], Forward+ [6], and Clustered shading [7].
I’ll divide all these approaches into 2 groups: tiled forward and tiled deferred.

Tiled forward pros:
– decouples lighting from materials and objects (like deferred).
– works with MSAA.
– can work with semi-transparency!
– each drawcall can implement its own lighting model (unlike deferred, where we have to fit all lighting models in one shader).

Tiled forward cons:
– requires Z-Prepass for good light culling (a lot of false positives otherwise).
– heavy shaders can be slow on small triangles [8][9].

Tiled deferred pros:
– reduces bandwidth cost by replacing old multi-pass light accumulation.
– light can be accumulated at better precision in a single register (in classic deferred you accumulate usually in 16 or 10 bit textures, because full 32-bit float is too heavy).
– can reuse the same per-tile data to shade transparency the tiled forward way.

Tiled deferred cons:
– still hard to do MSAA.
– still have to be careful with G-Buffer size.
————————-

Now back to topic. So I decided to develop a tiled deferred renderer with MSAA. I packed my G-Buffer (best fit normals is your best friend [10]) and arrived at DirectCompute land.

The most fun thing is that you can actually perform ALL rendering in a single compute shader, after you have G-Buffer. Light culling, shading, edge detection, AA resolve and everything can be fit into one CS, which is very nice, because we can reuse a lot of data without reloading it in every stage. Compute shaders are beautiful and I really recommend you to look into BF3 paper [5] to see how you can switch from per-pixel processing to per-light and generally process data in any unimaginable patterns.

The must read paper is also Andrew Lauritzen’s “Deferred Rendering for Current and Future Rendering Pipelines” [8].

There is also a very helpful code: http://visual-computing.intel-research.net/art/publications/deferred_rendering/

Lauritzen proposed an interesting idea of dealing with MSAA: instead of branching on each pixel and selecting per-pixel or per-sample (if it’s on the edge) shading, you find all edge pixels, collect them into some array and then distribute the processing of this array to all threads. This way it is more parallel: first all threads shade per-pixel, then they all process remaining edge samples.

Now onto my tricks.

Trick 1: resolve in the same CS.

Lauritzen’s method of redistributing per-sample shading is great, however, where do we output these sample values? As we try to distribute samples uniformly across all threads in a thread group, each thread now may output values completely randomly, into different samples of different pixels. In his sample code, Lauritzen addresses this by having a ‘Flat’ framebuffer, with the size of  GBufferWidth * GBufferHeight * MSAASamples and element size of uint2 (RG+BA 16 bit) which is resolved later. However, this can be quite costly.

Instead, we can allocate a small array for an each thread group, like

groupshared uint2 msaaAccumBuffer[BLOCKSIZE * BLOCKSIZE];

When you do per-pixel shading, you simply save the result there:

msaaAccumBuffer[groupIndex] = PackColor(color);

However, for each edge-pixel found, with per-sample shading required, you output scaled value:

float weight = 1.0 / numSamples;
msaaAccumBuffer[groupIndex] = PackColor(color * weight);

And when you process redistributed edge samples, you also scale them and accumulate in this array:

uint2 packed = PackColor(color * weight);
InterlockedAdd(msaaAccumBuffer[g.y].x, packed.x, tmp1);
InterlockedAdd(msaaAccumBuffer[g.y].y, packed.y, tmp2);

CS can do InterlockedAdd only for int/uint, and can’t work with floats. Instead, we scale float color channels into big uints and pack RGBA into uint2 with 16-bit per channel. The trick is that even when packed, addition will still work correctly and we can directly accumulate all samples into one anti-aliased color without any further resolve – 1 add per two channels.

When all samples are shaded, you unpack:

renderTarget[id.xy] = UnpackColor(msaaAccumBuffer[groupIndex]);

Packing/Unpacking:

// Look for PackUint2/UnpackUint2 in Lauritzen's code
uint2 PackColor(float4 color)
{
    uint4 colori = color * 65535;
    return uint2(PackUint2(colori.xy), PackUint2(colori.zw));
}

float4 UnpackColor(uint2 packed)
{
    uint2 RG = UnpackUint2(packed.x);
    uint2 BA = UnpackUint2(packed.y);
    return float4(RG, BA) / 65535;
}

So, it turns 1.0f into 65535 (uint). Why not just 255? Because, we accumulate these uints, small and scaled, and we have to have better precision for small values to get correctly looking sum.

Note, that I accumulate already tonemapped clamped colors – it is required to not break anti-aliasing [11].

 

Trick 2: Ray vs Thickened Cone for spotlight culling

Culling of non-point lights for tiled shading seems to be a poorly documented area. However, from what I’ve heard, most people implement light culling by checking intersections of tile frustum (a thin one, coming through tile vertices) with some geometric shape, like OBB or sphere around light, but frustum-cone intersection is not very easy and cheap thing to do, so you have to overestimate the number of tiles affected.

However, if you’d have a single ray and not a tile frustum, things become much easier and computationally cheaper.

The biggest problem of replacing a thin frustum with a ray is that ray is very small and doesn’t cover the whole tile and can easily miss the primitive, but we can solve it by ‘thickening’ primitives based on the distance.

Here’s the code, I came up for this kind of culling. Note: it can be optimized further, e.g. using something cheaper instead of matrix multiplies to transform from space to space, but you should get the idea:
http://pastebin.com/Ld7sfBbN

(Something’s very very wrong with wordpress text formatting. It makes code completely unreadable, so I had to use pastebin. Fuck you, wordpress).

The result should look like what you get from stencil light culling, but per tile:
coneculling
If you’re interested in math behind it, I actually found the useful formulas in [12]

The above code works fine when you’re close to the light source, but it does not account for mentioned thickening and will look buggy, when you move far enough.
As I use matrices to transform into cone space and back, I actually do the thickening on the CPU by tweaking these matrices.
What I currently do, is definitely not an ideal solution by any means, but it’s still kind of works: the idea is to find the most distant point on the cone (we can approximate it as a sphere this time) and then somehow calculate the thickening amount from distance between this point and the camera.
It is C# Unity-specific code (yes, I’m trying to glue my new renderer to it), but you should be able to understand it:

// Finding scale for the unit cone from its length and angle - without thickening
// Can be done once, unless light dynamically changes its shape
float baseRadius = length * Mathf.Sin(angle * Mathf.Deg2Rad * 0.5f);
lightScale.Add(new Vector3(baseRadius*Mathf.PI, baseRadius*Mathf.PI, length));
float lightMaxScale = Mathf.Max(Mathf.Max(lightScale[i].x, lightScale[i].y), lightScale[i].z);
-----------------
// Thickening
Vector3 lightEnd = lights[i].transform.position + lights[i].transform.forward * lights[i].range;
Vector3 lightCenter = (lights[i].transform.position + lightEnd) * 0.5f;
Vector3 vecToLight = lightCenter - camera.transform.position;

// Black magic starts
float distToFarthestPoint = Mathf.Sqrt(vecToLight.magnitude + lightMaxScale * 0.5); // don't ask me about the sqrt
float posOffset = distToFarthestPoint * 0.2f;
lights[i].transform.position -= lights[i].transform.forward * posOffset;
lights[i].transform.localScale = lightScale[i] + new Vector3(posOffset, posOffset, posOffset*2);
// Black magic ends. I don't like these 0.2 and 2 and sqrt and all, and will think further about making it all more meaningful. But it kinda thickens.

 

————————–
[1]
http://en.wikipedia.org/wiki/Deferred_shading

[2]
Tiago Sousa, Rendering Technologies from Crysis 3
http://www.slideshare.net/TiagoAlexSousa/rendering-technologies-from-crysis-3-gdc-2013

[3]
Nicolas Thibieroz, Deferred Shading Optimizations
http://developer.amd.com/gpu_assets/Deferred%20Shading%20Optimizations.pps

[4]
Ola Olsson and Ulf Assarsson, Tiled Shading
http://www.cse.chalmers.se/~uffe/tiled_shading_preprint.pdf

[5]
Johan Andersson, DirectX 11 Rendering in Battlefield 3
http://dice.se/wp-content/uploads/GDC11_DX11inBF3_Public.pdf

[6]
Jay McKee, Technology Behind AMD’s “Leo Demo”
http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/AMD_Demos_LeoDemoGDC2012.ppsx

[7]
Emil Persson, Practical Clustered Shading
http://www.humus.name/Articles/PracticalClusteredShading.pdf

[8]
Andrew Lauritzen, Deferred Rendering for Current and Future Rendering Pipelines
http://bps10.idav.ucdavis.edu/talks/12-lauritzen_DeferredShading_BPS_SIGGRAPH2010.pdf

[9]
Kayvon Fatahalian, Solomon Boulos, James Hegarty
Reducing Shading on GPUs using Quad-Fragment Merging
http://graphics.stanford.edu/papers/fragmerging/shade_sig10.pdf

[10]
Anton Kaplanyan, CryENGINE 3: reaching the speed of light
http://www.crytek.com/cryengine/presentations/CryENGINE3-reaching-the-speed-of-light

[11]
Emil Persson – Custom Resolve
http://www.humus.name/index.php?page=3D&ID=77

[12]
Vjeux – Javascript Ray Tracer
http://blog.vjeux.com/2012/javascript/javascript-ray-tracer.html