Rendering painted world in JG

Here’s a little breakdown and implementation details of the real-time painted world in my last demo – JG.

Here’s the video

Here’s the demo


(Click for Russian version)

The “painted” effect wasn’t planned. Originally I only had an idea to render a natural scenery of a certain kind, and I wasn’t ready to spend a whole lot of time on it. It became clear to me, a “realistic” approach won’t work, resulting in either very mediocre visuals (due to engine limitations and the complexity of real-time vegetation modeling), or a whole year of trying to catch up with Crysis. So it wasn’t the way.

What I really wanted is to preserve the atmosphere, the feeling, avoiding ruining it with technical limitations.

So I have to render something very complex without killing myself and players’ computers, what do I do? Intuition said: “bake everything”. I recalled seeing outdoor 3D scans: even with bad geometry (or even as point clouds), they still looked quite convincing, thanks to right colors being in right places, with all nice and filtered real-life lighting already integrated into everything. Unfortunately, the time of year was absolutely the opposite of desired, so I wasn’t able to try my mad photogrammetry skills.
But what if we “scan” a realistic offline 3D scene? Vue surfaced in my memory as something that movie/exterior visualization folks use to produce nice renderings of nature. I had no idea what to expect from it, but I tried.

I took a sample scene, rendered it from several viewpoints and put those into Agisoft Photoscan to reconstruct some approximate geometry with baked lighting. And… alas, no luck. Complex vegetation structure and anti-aliasing weren’t the best traits for shape reconstruction.
Then it hit me. What does Agisoft do? It generates depth maps, then a point cloud out of multiple depths. But I can render a depth map right in Vue, so why do I need to reconstruct?

Being familiar with deferred rendering and depth->position conversion, I was able to create a point cloud out of Vue renderings. Not quite easily, though: Vue’s depth appeared to have some non-conventional encoding. Luckily, I finally found an answer to it.

And from this:


With some MaxScript magic, we get this:


Which is a solid single textured mesh.

Hard part is over, now I only needed to repeat the process until I get a relatively hole-free scene. Finally it’s time to have some fun with shaders 🙂

Each projected pixel acts as a camera-facing quad, textured with one of those stroke textures:


Almost. There was a bug in my atlas reading code, so some quads only had a fraction of stroke on them. However, it actually looked better, than the intended version, so I left the bug. It’s now a feature 🙂

Quads size obviously depends on depth, becoming larger with distance. It was quite important to not mix together small and large quads, so I had to carefully choose viewpoints.

Test scene looked promising, so I started to work on the one I wanted:


I made the house, fence and terrain from scratch. Plants were taken from various existing packs. Then I assembled the final composition out of this stuff. I lost count on the amount of renderings I had to do to cover all playable area:


Some had to be photoshopped a little to get rid of dark spots and to add more colors:


At first, I had troubles with getting the lighting right, so I had a lot of these black spots to fix, then I actually managed to tune it better. Final scene is actually a mix of different approaches, because I didn’t have the time to re-render everything with different settings, and because it actually looked less monotonous.

Some early screenshots:

At this moment I also had stroke direction set up properly, what was pretty important, as uniform strokes had very unnatural look. At first, I tried to generate stroke direction procedurally (similar to how you generate normal map from a height map), but it wasn’t sufficient. It was obvious to me how some strokes must lay, for example, I really wanted vertical strokes for the grass and fence strokes following the shape of the fence. Not being able to direct it with purely procedural approach, I simply decided to manually paint stroke direction in additional textures. Final version uses manual direction near the camera and procedural for distant quads. Here’re some examples of direction maps:


To be honest, painting vectors with colors in Photoshop wasn’t the most exciting thing to do, but still, it was the quickest way I could think of 😀

The difference was quite obvious. Here’s uniform direction on the left, changed on the right:


And this is it. The point cloud nature of the scene also allowed me to have some fun in the ending part, making quads behave like a surreal particle system. All motion was done in vertex shader.

I hope it was somewhat interesting to read, at least I’ll not forget the technique myself 🙂


Bonus information

Recently I was asked how to fill inevitable holes between quads. The way I did here is simple – I just used very rough underlying geometry:


GPU cubemap filtering

Prefiltered cubemaps are widely used in games. The idea is to match cubemap blurriness at each mip to appearance of the BRDF you use.

It’s all started with Modified Cubemapgen
Then there’s also cmft which is quite faster
Finally there’s Lys

All these tools are separate utilities which also have a command-line support (except Lys).

Sometimes you want to filter the cubemap in engine, without exporting stuff/running separate utils and you want it to be fast. Cubemapgen may take minutes to filter, cmft is much better here, but sometimes you still need more control and performance. Sometimes you also have to filter it in something like WebGL, where running separate utils isn’t very acceptable (yes, there is Emscripten, but the resulting code is often too fat).

At first I thought that this is a very slow and complicated process (why otherwise cubemapgen does it sooo slow?) and was quite afraid of implementing it myself, but turns out, it’s not THAT hard.

Here’s the result:

You may notice some blurred seam appearance on the rightmost sphere in all versions – this is the inevitable consequence of using the seamless cubemap filtering trick on a DX9-level GAPI. DX10+ doesn’t have this problem and looks smoother. For WebGL implementation I had to support DX9, because latest Firefox still uses it to emulate WebGL (Chrome and IE11 render through DX11).

As you can see, Blinn-Phong importance sampled version looks extremely close to cubemapgen offline output. This version takes around 1.5 sec on GTX560 to compute from 128×128 source cubemap. “Simple” version takes even less than 1 sec.

So, how it is done?
We don’t just blur the cubemap. We imagine a sphere, reflecting it and associate each mip with different roughness value. Rougher surfaces have more noisy microfacets reflecting light in many directions at each visible point. The rougher the surface, the wider will be the cone of incoming light:
So what we should do is to average this cone of lighting.
The simple averaging is not enough though – there is also a complicated weight factor depending on deviation from the original normal based on many factors. This factor actually makes your BRDF to have the falloff it has, instead of just being an averaged circle.

So what, do you have to cast a million rays and then multiply each by a weight? No.
Importance sampling is a technique, the idea of which is to just generate the rays with density depending on your BRDF’s shape. Usually you’ll end up having more rays with direction similar to the surface normal and less deviating rays. Simply averaging the lighting from each ray will naturally give you more intensity near the center, because there were more rays there.

Here’s a difference between “Simple” version (top), which uses simple cosine-weighted distribution of rays and Blinn-Phong version (bottom):

As you can see, getting the correct ray distribution can be important for getting nice highlight falloff instead of just circular spots.

The Blinn-Phong is, of course, not ideal and quite old. GGX is considered more realistic, but I haven’t used it yet.

Different BRDF have different requirements for the ray count. That’s the point of having a “Simple” version – despite of having less correct highlights, it requires MUCH less rays for an artifact-free result (because it’s more uniform).

So the basic algorithm is:
– generate uniformly random directions on the hemisphere;
– focus the directions depending on your BRDF intensity-direction relation at given roughness;
– render lower mip cubemap faces, while reading higher mip data;
– for each direction
—transform the direction to be relative to the original lookup vector;
—light += texCUBE(higher mip, transformed direction)
–light /= numDirs;

You don’t have to pre-generate the directions though – all this can be done in shader sufficiently fast.
Here is a good read on generating points on the hemisphere. It even has an interactive demo.
The Hammersley method relies on bitwise operations though, which are not always accessible (not in WebGL/DX9). In such old-school GAPIs you have to either precompute the directions, or use some other way to generate random numbers. There are many other ways, one you can see in my source code, others on ShaderToy or somewhere else. Such randoms will likely be less uniform.

Ideally, when writing each mip, you should probably read the highest available mip with different cones of rays, but it’s not very fast, and you’ll get some significant aliasing trying to sample a high resolution cubemap with a very wide cone of limited rays.
Instead, it is actually quite sufficient to read just a 2x larger mip. This larger mip must resemble the highest mip as much as possible, so something simple like automatic mipmap generation (bilinear downsample) will do. Note, that you must NOT cone-sample the mip that was already cone-sampled, because you’ll get noticeable over-blurring.

My version of filtering is now in the PlayCanvas engine repository, and it’s open source.
Sadly, there’s a bug in ANGLE, which prevents us from using it on Win Chrome/FF, ironically only IE11 works correctly.

The source in here:

The latest version with additional bells and whistles will be always here:

Chamfer normals trick


A little trick how to chamfer geometry edges with just one face and get good results. It was originally invented by Arenshi, a good friend of mine, I just made a script to automate it and decided to share the knowledge.

The problem is that default face smoothing behaviour in modeling programs (at least in 3dsmax) is simple averaging and it gives you quite buggy looking results if you only use one face for chamfer, so you are forced to use many polygons, and it is still not perfect:


The solution is to manually select each original face after chamfer and straigten it’s normals:


This way your faces remain straight, while chamfers interpolate between them. And you can use just a single face! This is very useful for game low-poly models.

Finally, here’s a tiny MaxScript that automatically chamfers all edges based on angle between polygons and fixes all normals this way:

I haven’t time to add any UI to it, so to use, you have to select an Editable Poly object and run it. Chamfer amount and threshold angle are hardcoded and set to 1 (chamferAmount variable) and 0.85 (chamferAngle variable, range is from 0 to 1).
Script is not greatly optimized, but usually runs quite fast.

Third person camera

Many people complained about jerky and jumpy camera behaviour in the prototype of my game, Faded. I wasn’t happy with it myself either, I just had to implement so many things with no enough time to make each one perfect. Recently I decided to finally fix it.

Third person cameras are very different in every game, from simple orbiting + collision to some attempts to make it more “cinematic”. The idea of making a “cinematic” one was also my original diploma thesis, however after a few tests I abandoned it and changed the topic of my thesis to something more familiar (real-time rendering) because I was unsure if those experiments will yield any good results, so it was just risky.

Let’s start with basic problems.

Problem 1: occlusion
95% of answers for it you’ll find up googling is “throw a ray from character to camera and position camera at picked point!“. It’s a good starting point of course, but you just can’t leave it this way, there are plenty of reasons why it’s a bad idea:
– your camera’s near plane has size, while ray has zero thickness, so you have a chance of seeing through walls;
– camera will jump from point to point abruptly.
Positioning camera to “pickedPosition + pickedNormal * radiusAroundNearPlane” is still insufficient, as can be seen here:

Luckily most physics engines support “thick” rays. If you use Unity/PhysX, use SphereCast.
There are still a few problems however:
– if spherecast already intersects a wall at its origin, it will move through it further;
– you still have abrupt jumps.


The alternative way is just to use a physical sphere and move it to the desired camera position accounting for all collisions, but the sphere can just get stuck in some concave level geometry.

To fix the first spherecast problem, you can do following:
– project the sphere to the opposite direction of the character-camera ray. So the origin of the ray is still character, by the direction is inverted;
– use picked point that is far enough as new ray origin. If nothing is picked, just use origin + invDir * farEnough;
– do SphereCast as usual, but with new origin. This way you will get rid of sphere intersecting nearby walls.
Code for Unity:

The remaining problem is abrupt camera teleportation. How do other games deal with it? Let’s see:

Watch Dogs seems to use the simplest method – just teleporting camera at thick ray’s projected position. I can also see a quick interpolation of camera distance from close-up back to default.

L.A. Noire has more pronounced smoothed distance interpolation when the occlusion is gone. Sudden appearance of occlusion still makes abrupt movement though. The most interesting thing in L.A. Noire is the way camera follows you when you don’t move mouse. It can move around corners very intelligently. Not sure how it’s implemented, perhaps it uses AI navigation system?

Hitman Absolution tries to move camera as smoothly as possible, sliding along obstacles, before they’re in front of camera.
I think it’s a good solution, and I decided to implement it.

So here’s the idea:


Use two spherecasts. One thin (with radius to encapsulate near plane) and one thick. Then:
– project thick collision point (green point) onto ray. You’ll get red point;
– get direction from thick collision point to projected point, multiply it by thin radius and offset projected point back by it. This way you’ll get thick collision point projected onto thin capsule (cyan point);
– Get distance from cyan point to green point. Divide it by (thickRadius – thinRadius). You’ll get the [0-1] number representing how close the obstacle is to thin spherecast. Use it for lerping camera distance.
Code for Unity:

I think that’s quite enough for camera occlusion. You can still try to make camera even smarter at walking around corners as in Noire, but I think it’s an overkill for now. Later I’ll maybe get back to this topic.

Problem 2: composition
Now onto some “cinematic” stuff. First 3rd person games had characters mostly centered on the screen. As games evolved, overall image aesthetics started to become more important. Many photographers will agree that it’s not always the best idea to lock objects dead center – it’s just doesn’t look interesting. The basic rule you (and most importantly, computer) can apply is The Rule of Thirds. Most games today use it to simply put the character a little bit to the side.


However, can we implement a more dynamic composition search, that is not just dead locked on character being aligned to one line? And how is it supposed to look?

The best references here, in my opinion, are steadicam shots, because these are most closely related to game third-person cameras.
Take a look at some:

As you can see camera changes the focus point and distance quite dynamically, and it looks very interesting. What is not great in context of games, is that camera lags behind characters, so they see something earlier, than the camera.
Camera mainly focuses on character’s points of interest. Also what should be noted is the height of the camera, which is mostly static and not orbiting around at different heights.

Here are results of my first tests (year ago) that implemented some of the ideas:

The middle part is boring and sucks though.
The idea was to mark important objects in the level and make camera adapt to them, aligning everything by rule of thirds together. That’s what debug view can reveal:

Unity 2014-10-14 16-32-19-43

As you can see, the “important” objects marked as green 2D boxes. These boxes are the actual input data for the algorithm. The first box always represents main character.

The algorithm itself is not ideal though and it takes designer’s time to decide which objects should be marked as important to ensure interesting camera movement. The code is a bit dirty and still work in progress, so I’m not sure about posting it here right now. However, if you find it interesting, just tell me, and I’ll post.

Here are the results so far together with smooth occlusion avoidance:

Designing an Ubershader system

OK, so you probably know what ubershaders are? Unfortunately there is no wiki on this term, but mostly by it we mean very fat shaders containing all possible features with compile-time branching that allows them to be then specialized into any kind of small shader with a limited amount of tasks. But it can be implemented very differently, so here I’ll share my experience on this.


So, you can use #ifdef, #define and #include in your shaders? Or you’re going to implement it yourself? Anyway, it’s the first idea anyone has.

Why it sucks:
  • Too many #ifdefs make your code hard to read. You have to scroll the whole ubershader to see some scattered compile-time logic.
  • How do you say “compile this shader with 1 spot light and that shader with 2 directional lights”? Or 2 decals instead of 6? One PCF shadow and one hard? You can’t specify it with #ifdefs elegantly, only by copy-pasting code making it even less readable.

Terrible real-life examples: 1, 2

Code generation from strings

Yet another approach I came across and have seen in some projects. Basically you use your language of choice and use branching and loops to generate new shader string.

Why it sucks:
  • Mixing shader language with other languages looks like total mess
  • Quotes, string additions, spaces inside strings and \n’s are EVERYWHERE flooding your vision
  • Still have to scroll a lot to understand the logic

Terrible real-life examples: 1, 2

Code generation from modules

So you take your string-based code generation and try to decouple all shader code from engine code as much as possible. And you definitely don’t want to have hundreds of files with 1-2 lines each, so you start to think how to accomplish it.
So you make some small code chunks like this one, some of them are interchangeable, some contain keywords to replace before adding.

Why naive approach sucks:
  • All chunks share the same scope, can lead to conflicts
  • You aren’t sure what data is available for each chunk
  • Takes time to understand what generated shader actually does

Code generation from modules 2.0

So you need some structure. The approach I found works best is:

struct internalData {
some data

void shaderChunk1(inout internalData data) {
float localVar;
read/write data

float4 main() {
internalData data;
return colorCombinerShaderChunk(data);

So you just declare an r/w struct for all intermediate and non local data, like diffuse/specular light accumulation, global UV offset or surface normal used for most effects.
Each shader chunk is then a processing function working with that struct and a call to it, put between other calls. Most compilers will optimize out unused struct members, so basically you should end up with some pretty fast code, and it’s easy to change the parts of your shader. Shader body also looks quite descriptive and doesn’t require you to scroll a lot.
The working example of such system is my contribution to PlayCanvas engine: 1, 2

Examples of generated code: vertex, pixel, pixel 2, pixel 3

So, I’m not saying this is the best approach. But for me, it’s the easiest one to use/debug/maintain so far.

On @femfreq and violence in games

I actually made this blog to write about purely technical stuff mostly, but all my twitter is so full of these hot debates, so I decided to declare my point of view, just in case.

If you have no idea what I’m talking about, the short story is:
There is a girl called Anita Sarkeesian, who makes videos, pointing to the problems of female characters in games, specifically objectification of women, over-sexualization. violence towards them and often lack of character development. You can watch it here:
The videos made kind of huge response, both positive and negative, from both gamers, developers and random bystanders.

My opinion is that the problem is not limited to just female characters in games, but actually much more global.
What I agree with Anita is the problem of characters. I played quite many games, but I can remember only 3 types of women I’ve seen:

1. Damsel in distress.
2. Ridiculously strong warriors, usually without much armor and visible muscles, but with big tits, of course. Think of Tomb Raider, Remember Me and so on.
3. Third plan character, who you’ll immediately forget.
A bit boring, isn’t it?

But the violence problem is much deeper. It’s not just about female characters. The problem is that actually most games are made of violence. Violence is the main option of progressing through the game in most AAA titles. It was this way for many years, since first characters appeared on the screen. The enemies had to die, and the player had to win. And since female characters appeared in games, it didn’t change too much. You always have the attack button, and you can attack anyone. And while Anita complains about violence towards over-sexualized women, I complain about violence in general, victim’s gender doesn’t matter much. In most games you would kill much more men (usually armed) than women, and men just can’t be “sexualized”, I have no idea how it may ever look, but instead I see the problem in violence itself.

I definitely don’t want too look like some Jack Thompson and I won’t blame games like this. After all, I’m a gamer myself, and I like parts of experience games give to me. I love this industry and I hate censorship.

In the beginning of 2000s, game industry was making huge leaps forward. I was amazed at how fast graphics, physics, and at the same time, plots and characters were evolving. There were new revelations in this medium, new genres, and I felt like I was witnessing a revolution. A revolution of art, a revolution, which will give us a new, incredibly beautiful, realistic yet interesting, interactive experience. Games like Mafia, Morrowind and Half-Life 2 made me feel so.
Did it happen? Not really.

While the technical side of games was evolving, the core ideology stayed the same. There were rare exceptions, like Quantic Dream games or Pathologic, the games that tried to do something out of familiar bounds of expectations.

But most games are still bent on murder. On routine murder that doesn’t affect the story much, that doesn’t suppose to invoke any emotions. I feel like in many RPGs, the only thing that changes on different play sessions, with different characters and different stats and abilities – is the way how you kill. As simple as that. Kill is your primary action and many games seem to differ mostly in the ways you do it.

And even for an indie game developer, it’s actually quite tempting to repeat this pattern. It’s easier to make characters shoot and get killed than to invent something, that really touches you and try to implement it in a game. And when I add characters to the game, that’s actually the first thing I do: shooting and killing. Because it feels like a mechanic, you can play it, it gives challenge, and you’ve seen it so many times, you realize its implementation quite well. Decals, particles, ragdolls, familiar stuff!

But then I feel it’s not the thing I want to leave after myself. What do I love about games? I love to feel myself in a different life, a different place, being a different person and experience emotions of living this way which will then become a part of me. I want to feel my abilities and the consequences of using them. I felt something like this in the latest Deus Ex, Fahrenheit and Pathologic, but it wasn’t absolutely perfect. I don’t want it to feel like a Mario game, I don’t want infinitely kill some enemies. And at the same time I want it to be interactive. I want to do what I want and not the designer. I want to have the ability to kill, because this way, it becomes YOUR choice, it provokes drama and emotions, but I don’t want it to be the only way. I don’t want to play as superhuman, who decides everyone’s fate.

I don’t even fucking know what I want. And this is very sad, considering I already made a demo of my game, which in the end turned out to be the same crap I don’t like in games in some of its aspects. I know it should be different.

I think this medium is still very young and we’re capable of making something totally new.

Tiled deferred shading tricks

Last update: 26 May 2014.

This post will cover some of my humble findings when implementing a tiled deferred rendering with MSAA support.
I will also update it occasionally.

Deferred shading is a well-known rendering technique: we first render the scene to a ‘G-Buffer’, containing geometry and material data (e.g. position/depth, normals, surface glossiness etc) and then compute all the lighting and shading in screen space [1].

– reduced shading overdraw: only one complex shader invocation per pixel (+ additional per pixel inside of each light’s influence area); you can do the same with Z-Prepass in forward, but it will cost you 2x drawcalls.
– lighting is decoupled from materials/objects.
– G-Buffer is anyway required for many advanced effects, which are difficult to substitute with something else (e.g. SSAO, SSR).

– doesn’t handle semi-transparency, you have to draw transparent stuff forward-style.
– can be bandwidth-heavy, requires tight G-Buffer packing, Crysis 3 is a good example [2], also stencil culling is extremely useful for selecting pixels, only affected by light. The less you repaint your render targets, the better.
– overly compressed G-Buffer can exhibit artifacts (e.g. Unity).

– difficult to integrate with MSAA, many deferred games just use post-AA (e.g. FXAA), however, quality is far from MSAA due to lack of sub-pixel data. The latest approach is to perform an edge detection into stencil and then do per-sample shading only on these edges and simple per-pixel everywhere else, also used in Crysis 3 [2][3], however this approach suffers from bad locality of edge pixels on the screen.

Tiled shading is an approach, where we divide our screen into tiles, that are bigger than pixels (e.g. 8×8, 16×16), test which lights affect these tiles and then shade pixels with only lights that belong to their tiles [4]. Simply, ‘divide and conquer’. However, for good culling quality, access to depth buffer is necessary.
Tiled shading can be implemented multiple ways and used with both forward and deferred approaches, notable examples are BF3 tiled deferred (with code!) [5], Forward+ [6], and Clustered shading [7].
I’ll divide all these approaches into 2 groups: tiled forward and tiled deferred.

Tiled forward pros:
– decouples lighting from materials and objects (like deferred).
– works with MSAA.
– can work with semi-transparency!
– each drawcall can implement its own lighting model (unlike deferred, where we have to fit all lighting models in one shader).

Tiled forward cons:
– requires Z-Prepass for good light culling (a lot of false positives otherwise).
– heavy shaders can be slow on small triangles [8][9].

Tiled deferred pros:
– reduces bandwidth cost by replacing old multi-pass light accumulation.
– light can be accumulated at better precision in a single register (in classic deferred you accumulate usually in 16 or 10 bit textures, because full 32-bit float is too heavy).
– can reuse the same per-tile data to shade transparency the tiled forward way.

Tiled deferred cons:
– still hard to do MSAA.
– still have to be careful with G-Buffer size.

Now back to topic. So I decided to develop a tiled deferred renderer with MSAA. I packed my G-Buffer (best fit normals is your best friend [10]) and arrived at DirectCompute land.

The most fun thing is that you can actually perform ALL rendering in a single compute shader, after you have G-Buffer. Light culling, shading, edge detection, AA resolve and everything can be fit into one CS, which is very nice, because we can reuse a lot of data without reloading it in every stage. Compute shaders are beautiful and I really recommend you to look into BF3 paper [5] to see how you can switch from per-pixel processing to per-light and generally process data in any unimaginable patterns.

The must read paper is also Andrew Lauritzen’s “Deferred Rendering for Current and Future Rendering Pipelines” [8].

There is also a very helpful code:

Lauritzen proposed an interesting idea of dealing with MSAA: instead of branching on each pixel and selecting per-pixel or per-sample (if it’s on the edge) shading, you find all edge pixels, collect them into some array and then distribute the processing of this array to all threads. This way it is more parallel: first all threads shade per-pixel, then they all process remaining edge samples.

Now onto my tricks.

Trick 1: resolve in the same CS.

Lauritzen’s method of redistributing per-sample shading is great, however, where do we output these sample values? As we try to distribute samples uniformly across all threads in a thread group, each thread now may output values completely randomly, into different samples of different pixels. In his sample code, Lauritzen addresses this by having a ‘Flat’ framebuffer, with the size of  GBufferWidth * GBufferHeight * MSAASamples and element size of uint2 (RG+BA 16 bit) which is resolved later. However, this can be quite costly.

Instead, we can allocate a small array for an each thread group, like

groupshared uint2 msaaAccumBuffer[BLOCKSIZE * BLOCKSIZE];

When you do per-pixel shading, you simply save the result there:

msaaAccumBuffer[groupIndex] = PackColor(color);

However, for each edge-pixel found, with per-sample shading required, you output scaled value:

float weight = 1.0 / numSamples;
msaaAccumBuffer[groupIndex] = PackColor(color * weight);

And when you process redistributed edge samples, you also scale them and accumulate in this array:

uint2 packed = PackColor(color * weight);
InterlockedAdd(msaaAccumBuffer[g.y].x, packed.x, tmp1);
InterlockedAdd(msaaAccumBuffer[g.y].y, packed.y, tmp2);

CS can do InterlockedAdd only for int/uint, and can’t work with floats. Instead, we scale float color channels into big uints and pack RGBA into uint2 with 16-bit per channel. The trick is that even when packed, addition will still work correctly and we can directly accumulate all samples into one anti-aliased color without any further resolve – 1 add per two channels.

When all samples are shaded, you unpack:

renderTarget[id.xy] = UnpackColor(msaaAccumBuffer[groupIndex]);


// Look for PackUint2/UnpackUint2 in Lauritzen's code
uint2 PackColor(float4 color)
    uint4 colori = color * 65535;
    return uint2(PackUint2(colori.xy), PackUint2(;

float4 UnpackColor(uint2 packed)
    uint2 RG = UnpackUint2(packed.x);
    uint2 BA = UnpackUint2(packed.y);
    return float4(RG, BA) / 65535;

So, it turns 1.0f into 65535 (uint). Why not just 255? Because, we accumulate these uints, small and scaled, and we have to have better precision for small values to get correctly looking sum.

Note, that I accumulate already tonemapped clamped colors – it is required to not break anti-aliasing [11].


Trick 2: Ray vs Thickened Cone for spotlight culling

Culling of non-point lights for tiled shading seems to be a poorly documented area. However, from what I’ve heard, most people implement light culling by checking intersections of tile frustum (a thin one, coming through tile vertices) with some geometric shape, like OBB or sphere around light, but frustum-cone intersection is not very easy and cheap thing to do, so you have to overestimate the number of tiles affected.

However, if you’d have a single ray and not a tile frustum, things become much easier and computationally cheaper.

The biggest problem of replacing a thin frustum with a ray is that ray is very small and doesn’t cover the whole tile and can easily miss the primitive, but we can solve it by ‘thickening’ primitives based on the distance.

Here’s the code, I came up for this kind of culling. Note: it can be optimized further, e.g. using something cheaper instead of matrix multiplies to transform from space to space, but you should get the idea:

(Something’s very very wrong with wordpress text formatting. It makes code completely unreadable, so I had to use pastebin. Fuck you, wordpress).

The result should look like what you get from stencil light culling, but per tile:
If you’re interested in math behind it, I actually found the useful formulas in [12]

The above code works fine when you’re close to the light source, but it does not account for mentioned thickening and will look buggy, when you move far enough.
As I use matrices to transform into cone space and back, I actually do the thickening on the CPU by tweaking these matrices.
What I currently do, is definitely not an ideal solution by any means, but it’s still kind of works: the idea is to find the most distant point on the cone (we can approximate it as a sphere this time) and then somehow calculate the thickening amount from distance between this point and the camera.
It is C# Unity-specific code (yes, I’m trying to glue my new renderer to it), but you should be able to understand it:

// Finding scale for the unit cone from its length and angle - without thickening
// Can be done once, unless light dynamically changes its shape
float baseRadius = length * Mathf.Sin(angle * Mathf.Deg2Rad * 0.5f);
lightScale.Add(new Vector3(baseRadius*Mathf.PI, baseRadius*Mathf.PI, length));
float lightMaxScale = Mathf.Max(Mathf.Max(lightScale[i].x, lightScale[i].y), lightScale[i].z);
// Thickening
Vector3 lightEnd = lights[i].transform.position + lights[i].transform.forward * lights[i].range;
Vector3 lightCenter = (lights[i].transform.position + lightEnd) * 0.5f;
Vector3 vecToLight = lightCenter - camera.transform.position;

// Black magic starts
float distToFarthestPoint = Mathf.Sqrt(vecToLight.magnitude + lightMaxScale * 0.5); // don't ask me about the sqrt
float posOffset = distToFarthestPoint * 0.2f;
lights[i].transform.position -= lights[i].transform.forward * posOffset;
lights[i].transform.localScale = lightScale[i] + new Vector3(posOffset, posOffset, posOffset*2);
// Black magic ends. I don't like these 0.2 and 2 and sqrt and all, and will think further about making it all more meaningful. But it kinda thickens.



Tiago Sousa, Rendering Technologies from Crysis 3

Nicolas Thibieroz, Deferred Shading Optimizations

Ola Olsson and Ulf Assarsson, Tiled Shading

Johan Andersson, DirectX 11 Rendering in Battlefield 3

Jay McKee, Technology Behind AMD’s “Leo Demo”

Emil Persson, Practical Clustered Shading

Andrew Lauritzen, Deferred Rendering for Current and Future Rendering Pipelines

Kayvon Fatahalian, Solomon Boulos, James Hegarty
Reducing Shading on GPUs using Quad-Fragment Merging

Anton Kaplanyan, CryENGINE 3: reaching the speed of light

Emil Persson – Custom Resolve

Vjeux – Javascript Ray Tracer