原文参见 Reverse engineering the rendering of The Witcher 3: Index。作者目测就是波兰人。这个系列总共15篇,作者发布跨度接近两年。其实介绍的算法并不复杂,前几篇甚至感觉很基础。但是方法非常新奇。经典的逆向工程文章大多是截几个图,描述一下做法便了事,这篇直接逆向工程shader的汇编代码,非常开眼。作者甚至自己写了个工具可以直接将HLSL代码转成汇编代码。不愧是波兰蠢驴。
本文是为上篇,翻译原作的第1-8篇,其余留待下篇。
第一部分, Tonemapping
大多数当代3A游戏中,肯定有一个渲染阶段是 Tonemapping 。快速回忆一下,现实世界中有很大的亮度范围。但是我们的计算机屏幕只能显示有限的范围,比如每像素8bit,只有0-255. 这就是有了 Tonemapping 的原因。因为它允许我们把很宽的亮度范围转换到有限的。这个过程通常有两个输入,颜色超过1的浮点数HDR图像,和场景平均亮度(后者有多种方式计算,比如通过人眼适应来模拟人眼的行为,但这里不重要了)
下一步是包括获取一个曝光值,计算曝光颜色,并通过Tonemapping曲线处理。这里是事情开始变得麻烦的地方,出现了一个新概念,比如“白点”(White Point),中性灰(Middle Gray). 有几个流行的曲线和MJP的文章 “A Closer Look at Tone Mapping” 研究了这件事。
诚实地说,我自己在代码里实现Tonemaping时遇到了一些困难。不过好在有 几个在线的例子很有用。说回重点,有些考虑了HDR亮度,白点和中性灰。有一些没有。我想要一个“经过考验”的实现。
最近我开始研究巫师3的渲染,这个游戏有些很棒的渲染技巧。同时它的剧情,音乐,游戏性,所有东西都很棒。

当然在我开始之前,这篇是一个研究巫师3渲染的短系列之一。他绝对不是想要全面,就像Adrian Courrege的GTA5图形研究系列,至少是现在。我们从逆向工程Tonemapping开始吧。
我们会用RenderDoc的一帧截屏开始。这时Novigrad城一个任务的截图,所有效果最大。

经过一些研究后,我找到的Tonemapping那帧。像我描述过的那样,它有一个HDR颜色帧缓冲(贴图0,全分辨率),一个平均亮度的场景(贴图1,1×1浮点,用compute shader计算的

我们来看看编译过的pixel shader
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
ps_5_0 dcl_globalFlags refactoringAllowed dcl_constantbuffer cb3[17], immediateIndexed dcl_resource_texture2d (float,float,float,float) t0 dcl_resource_texture2d (float,float,float,float) t1 dcl_input_ps_siv v0.xy, position dcl_output o0.xyzw dcl_temps 4 0: ld_indexable(texture2d)(float,float,float,float) r0.x, l(0, 0, 0, 0), t1.xyzw 1: max r0.x, r0.x, cb3[4].y 2: min r0.x, r0.x, cb3[4].z 3: max r0.x, r0.x, l(0.000100) 4: mul r0.y, cb3[16].x, l(11.200000) 5: div r0.x, r0.x, r0.y 6: log r0.x, r0.x 7: mul r0.x, r0.x, cb3[16].z 8: exp r0.x, r0.x 9: mul r0.x, r0.y, r0.x 10: div r0.x, cb3[16].x, r0.x 11: ftou r1.xy, v0.xyxx 12: mov r1.zw, l(0, 0, 0, 0) 13: ld_indexable(texture2d)(float,float,float,float) r0.yzw, r1.xyzw, t0.wxyz 14: mul r0.xyz, r0.yzwy, r0.xxxx 15: mad r1.xyz, cb3[7].xxxx, r0.xyzx, cb3[7].yyyy 16: mul r2.xy, cb3[8].yzyy, cb3[8].xxxx 17: mad r1.xyz, r0.xyzx, r1.xyzx, r2.yyyy 18: mul r0.w, cb3[7].y, cb3[7].z 19: mad r3.xyz, cb3[7].xxxx, r0.xyzx, r0.wwww 20: mad r0.xyz, r0.xyzx, r3.xyzx, r2.xxxx 21: div r0.xyz, r0.xyzx, r1.xyzx 22: mad r0.w, cb3[7].x, l(11.200000), r0.w 23: mad r0.w, r0.w, l(11.200000), r2.x 24: div r1.x, cb3[8].y, cb3[8].z 25: add r0.xyz, r0.xyzx, -r1.xxxx 26: max r0.xyz, r0.xyzx, l(0, 0, 0, 0) 27: mul r0.xyz, r0.xyzx, cb3[16].yyyy 28: mad r1.y, cb3[7].x, l(11.200000), cb3[7].y 29: mad r1.y, r1.y, l(11.200000), r2.y 30: div r0.w, r0.w, r1.y 31: add r0.w, -r1.x, r0.w 32: max r0.w, r0.w, l(0) 33: div o0.xyz, r0.xyzx, r0.wwww 34: mov o0.w, l(1.000000) 35: ret |
需要注意的是,首先,载入的亮度不是用到的。它被美术设定的最大最小值限定。这很简单,为了避免场景的过曝或欠曝。这听起来很显然,但我从来没这么做过。第二,任何熟悉Tonemaping曲线的人都会认出这个11.2,它是John Hable在 神秘海域2Tonemapping曲线中定义的白点。

A-F的参数是从CBuffer载入的。好了,还有三个参数。cb3_v16.xyz。我们研究一下他它是啥。
一些建议,我认为x是白度(white scale)或者中性灰,因为它被乘了11.2。然后它被用于计算曝光调整。y,我叫他u2数值乘数,我们很快会来研究它z,指数系数,用来计算log/mul/exp。
另外cb3_v4.yz,可允许的最大最小亮度cb3_v7.xyz,神海2曲线的A-C值cb3_v8.xyz,神海2曲线的D-F值
现在是最难的部分,写HLSL shader会给我们编译程序及,这很有技巧,而且代码越长越南,幸运的是之前我写了个工具可以很快吧hlsl转成程序码。女士们先生们,有请D3DShaderDisassembler!
这是最后的代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
cbuffer cBuffer : register (b3) { float4 cb3_v0; float4 cb3_v1; float4 cb3_v2; float4 cb3_v3; float4 cb3_v4; float4 cb3_v5; float4 cb3_v6; float4 cb3_v7; float4 cb3_v8; float4 cb3_v9; float4 cb3_v10; float4 cb3_v11; float4 cb3_v12; float4 cb3_v13; float4 cb3_v14; float4 cb3_v15; float4 cb3_v16, cb3_v17; } Texture2D TexHDRColor : register (t0); Texture2D TexAvgLuminance : register (t1); struct VS_OUTPUT_POSTFX { float4 Position : SV_Position; }; float3 U2Func( float A, float B, float C, float D, float E, float F, float3 x ) { return ((x*(A*x+C*B)+D*E)/(x*(A*x+B)+D*F)) - E/F; } float3 ToneMapU2Func( float A, float B, float C, float D, float E, float F, float3 color, float numMultiplier ) { float3 numerator = U2Func( A, B, C, D, E, F, color ); numerator = max( numerator, 0 ); numerator.rgb *= numMultiplier; float3 denominator = U2Func( A, B, C, D, E, F, 11.2 ); denominator = max( denominator, 0 ); return numerator / denominator; } float4 ToneMappingPS( VS_OUTPUT_POSTFX Input) : SV_Target0 { float avgLuminance = TexAvgLuminance.Load( int3(0, 0, 0) ); avgLuminance = clamp( avgLuminance, cb3_v4.y, cb3_v4.z ); avgLuminance = max( avgLuminance, 1e-4 ); float scaledWhitePoint = cb3_v16.x * 11.2; float luma = avgLuminance / scaledWhitePoint; luma = pow( luma, cb3_v16.z ); luma = luma * scaledWhitePoint; luma = cb3_v16.x / luma; float3 HDRColor = TexHDRColor.Load( uint3(Input.Position.xy, 0) ).rgb; float3 color = ToneMapU2Func( cb3_v7.x, cb3_v7.y, cb3_v7.z, cb3_v8.x, cb3_v8.y, cb3_v8.z, luma*HDRColor, cb3_v16.y); return float4(color, 1); } |
一个我的工具的截图证明它

我相信这合理的实现了巫师3的Tonemapping,至少从机器码的角度。我已经在我的框架中实现了它,工作的相当好。
我说“相当”,因为我也不知道为啥ToneMapU2Func的除数要和0取最大,除以0不是undefined嘛
我们基本可以结束了,但我在这里发现了Tonemapping的另一个变种,美丽的日落,最低的图形设置 。

我们检查一下,shader的汇编代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
ps_5_0 dcl_globalFlags refactoringAllowed dcl_constantbuffer cb3[18], immediateIndexed dcl_resource_texture2d (float,float,float,float) t0 dcl_resource_texture2d (float,float,float,float) t1 dcl_input_ps_siv v0.xy, position dcl_output o0.xyzw dcl_temps 5 0: ld_indexable(texture2d)(float,float,float,float) r0.x, l(0, 0, 0, 0), t1.xyzw 1: max r0.y, r0.x, cb3[9].y 2: max r0.x, r0.x, cb3[4].y 3: min r0.x, r0.x, cb3[4].z 4: min r0.y, r0.y, cb3[9].z 5: max r0.xy, r0.xyxx, l(0.000100, 0.000100, 0.000000, 0.000000) 6: mul r0.z, cb3[17].x, l(11.200000) 7: div r0.y, r0.y, r0.z 8: log r0.y, r0.y 9: mul r0.y, r0.y, cb3[17].z 10: exp r0.y, r0.y 11: mul r0.y, r0.z, r0.y 12: div r0.y, cb3[17].x, r0.y 13: ftou r1.xy, v0.xyxx 14: mov r1.zw, l(0, 0, 0, 0) 15: ld_indexable(texture2d)(float,float,float,float) r1.xyz, r1.xyzw, t0.xyzw 16: mul r0.yzw, r0.yyyy, r1.xxyz 17: mad r2.xyz, cb3[11].xxxx, r0.yzwy, cb3[11].yyyy 18: mul r3.xy, cb3[12].yzyy, cb3[12].xxxx 19: mad r2.xyz, r0.yzwy, r2.xyzx, r3.yyyy 20: mul r1.w, cb3[11].y, cb3[11].z 21: mad r4.xyz, cb3[11].xxxx, r0.yzwy, r1.wwww 22: mad r0.yzw, r0.yyzw, r4.xxyz, r3.xxxx 23: div r0.yzw, r0.yyzw, r2.xxyz 24: mad r1.w, cb3[11].x, l(11.200000), r1.w 25: mad r1.w, r1.w, l(11.200000), r3.x 26: div r2.x, cb3[12].y, cb3[12].z 27: add r0.yzw, r0.yyzw, -r2.xxxx 28: max r0.yzw, r0.yyzw, l(0, 0, 0, 0) 29: mul r0.yzw, r0.yyzw, cb3[17].yyyy 30: mad r2.y, cb3[11].x, l(11.200000), cb3[11].y 31: mad r2.y, r2.y, l(11.200000), r3.y 32: div r1.w, r1.w, r2.y 33: add r1.w, -r2.x, r1.w 34: max r1.w, r1.w, l(0) 35: div r0.yzw, r0.yyzw, r1.wwww 36: mul r1.w, cb3[16].x, l(11.200000) 37: div r0.x, r0.x, r1.w 38: log r0.x, r0.x 39: mul r0.x, r0.x, cb3[16].z 40: exp r0.x, r0.x 41: mul r0.x, r1.w, r0.x 42: div r0.x, cb3[16].x, r0.x 43: mul r1.xyz, r1.xyzx, r0.xxxx 44: mad r2.xyz, cb3[7].xxxx, r1.xyzx, cb3[7].yyyy 45: mul r3.xy, cb3[8].yzyy, cb3[8].xxxx 46: mad r2.xyz, r1.xyzx, r2.xyzx, r3.yyyy 47: mul r0.x, cb3[7].y, cb3[7].z 48: mad r4.xyz, cb3[7].xxxx, r1.xyzx, r0.xxxx 49: mad r1.xyz, r1.xyzx, r4.xyzx, r3.xxxx 50: div r1.xyz, r1.xyzx, r2.xyzx 51: mad r0.x, cb3[7].x, l(11.200000), r0.x 52: mad r0.x, r0.x, l(11.200000), r3.x 53: div r1.w, cb3[8].y, cb3[8].z 54: add r1.xyz, -r1.wwww, r1.xyzx 55: max r1.xyz, r1.xyzx, l(0, 0, 0, 0) 56: mul r1.xyz, r1.xyzx, cb3[16].yyyy 57: mad r2.x, cb3[7].x, l(11.200000), cb3[7].y 58: mad r2.x, r2.x, l(11.200000), r3.y 59: div r0.x, r0.x, r2.x 60: add r0.x, -r1.w, r0.x 61: max r0.x, r0.x, l(0) 62: div r1.xyz, r1.xyzx, r0.xxxx 63: add r0.xyz, r0.yzwy, -r1.xyzx 64: mad o0.xyz, cb3[13].xxxx, r0.xyzx, r1.xyzx 65: mov o0.w, l(1.000000) 66: ret |
看上去很吓人,但还不坏。快速的分析注意到调用了两次神海2的方程。翻译成HLSL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
cbuffer cBuffer : register (b3) { float4 cb3_v0; float4 cb3_v1; float4 cb3_v2; float4 cb3_v3; float4 cb3_v4; float4 cb3_v5; float4 cb3_v6; float4 cb3_v7; float4 cb3_v8; float4 cb3_v9; float4 cb3_v10; float4 cb3_v11; float4 cb3_v12; float4 cb3_v13; float4 cb3_v14; float4 cb3_v15; float4 cb3_v16, cb3_v17; } Texture2D TexHDRColor : register (t0); Texture2D TexAvgLuminance : register (t1); float3 U2Func( float A, float B, float C, float D, float E, float F, float3 x ) { return ((x*(A*x+C*B)+D*E)/(x*(A*x+B)+D*F)) - E/F; } float3 ToneMapU2Func( float A, float B, float C, float D, float E, float F, float3 color, float numMultiplier ) { float3 numerator = U2Func( A, B, C, D, E, F, color ); numerator = max( numerator, 0 ); numerator.rgb *= numMultiplier; float3 denominator = U2Func( A, B, C, D, E, F, 11.2 ); denominator = max( denominator, 0 ); return numerator / denominator; } struct VS_OUTPUT_POSTFX { float4 Position : SV_Position; }; float getExposure(float avgLuminance, float minLuminance, float maxLuminance, float middleGray, float powParam) { avgLuminance = clamp( avgLuminance, minLuminance, maxLuminance ); avgLuminance = max( avgLuminance, 1e-4 ); float scaledWhitePoint = middleGray * 11.2; float luma = avgLuminance / scaledWhitePoint; luma = pow( luma, powParam); luma = luma * scaledWhitePoint; float exposure = middleGray / luma; return exposure; } float4 ToneMappingPS( VS_OUTPUT_POSTFX Input) : SV_Target0 { float avgLuminance = TexAvgLuminance.Load( int3(0, 0, 0) ); float exposure1 = getExposure( avgLuminance, cb3_v9.y, cb3_v9.z, cb3_v17.x, cb3_v17.z); float exposure2 = getExposure( avgLuminance, cb3_v4.y, cb3_v4.z, cb3_v16.x, cb3_v16.z); float3 HDRColor = TexHDRColor.Load( uint3(Input.Position.xy, 0) ).rgb; float3 color1 = ToneMapU2Func( cb3_v11.x, cb3_v11.y, cb3_v11.z, cb3_v12.x, cb3_v12.y, cb3_v12.z, exposure1*HDRColor, cb3_v17.y); float3 color2 = ToneMapU2Func( cb3_v7.x, cb3_v7.y, cb3_v7.z, cb3_v8.x, cb3_v8.y, cb3_v8.z, exposure2*HDRColor, cb3_v16.y); float3 finalColor = lerp( color2, color1, cb3_v13.x ); return float4(finalColor, 1); } |
所以我们有两套参数,计算两套tonemap处理的颜色,最后混合起来,很聪明。
第二部分,人眼适应(Eye Adaptation)
欢迎来到这个迷你系列的第二篇,这里我揭秘了巫师3的渲染。这次很简单了。
第一部分里,我介绍了tonemapping怎么做的。当我解释理论基础时,简单提到了人眼适应。这次我们来讲讲这个怎么处理的。
但是,什么是人眼适应?为什么我们需要它?Wikipedia知道。我们想象自己在一个暗室里,或者洞穴。当你出去时,外面是亮的。这里亮度的来源是太阳。
在暗处我们的瞳孔放大,使得更多光线进入视网膜。当变亮时,瞳孔变小,我们也会眨眼,有因为“疼”。这个变化不是理科的,眼睛需要适应这个亮度的变化。这也就是我们在实时渲染时为什么要用人眼使用。
一个缺乏人眼适应的很好的例子是DX官方SDK的 HDRToneMappingCS11,平均亮度的突变时不让人愉悦和比不自然的。
我们开始吧。为了统一,还是分析这一帧

通常人眼适应在Tonemapping之前,巫师3也不例外。

我们看看pixel shader的状态有两个输入,R32_Float类型,1×1像素贴图0是前帧的平均亮度贴图1是这帧的平均亮度。

我们看看代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
ps_5_0 dcl_globalFlags refactoringAllowed dcl_constantbuffer cb3[1], immediateIndexed dcl_sampler s0, mode_default dcl_sampler s1, mode_default dcl_resource_texture2d (float,float,float,float) t0 dcl_resource_texture2d (float,float,float,float) t1 dcl_output o0.xyzw dcl_temps 1 0: sample_l(texture2d)(float,float,float,float) r0.x, l(0, 0, 0, 0), t1.xyzw, s1, l(0) 1: sample_l(texture2d)(float,float,float,float) r0.y, l(0, 0, 0, 0), t0.yxzw, s0, l(0) 2: ge r0.z, r0.y, r0.x 3: add r0.x, -r0.y, r0.x 4: movc r0.z, r0.z, cb3[0].x, cb3[0].y 5: mad o0.xyzw, r0.zzzz, r0.xxxx, r0.yyyy 6: ret |
仅仅七行。我们来解释一下
- 获取当帧平均亮度
- 获取前帧平均亮度
- 一个测试,当帧比前帧变量还是变暗了
- 计算两帧亮度差值
- 计算变化速度,这里根据变量还是变暗有不同的数值。这很聪明,可以想象变亮变暗时变化速度不一致。但每帧这两个数值基本一致,大概0.11到0.35 最后计算适应亮度适应亮度 = 速度 * 差值 + 前帧亮度
- 结束
最后HLSL代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
// The Witcher 3 eye adaptation shader cbuffer cBuffer : register (b3) { float4 cb3_v0; } struct VS_OUTPUT_POSTFX { float4 Position : SV_Position; }; SamplerState samplerPointClamp : register (s0); SamplerState samplerPointClamp2 : register (s1); Texture2D TexPreviousAvgLuminance : register (t0); Texture2D TexCurrentAvgLuminance : register (t1); float4 TW3_EyeAdaptationPS(VS_OUTPUT_POSTFX Input) : SV_TARGET { //获取当前亮度 float currentAvgLuminance = TexCurrentAvgLuminance.SampleLevel( samplerPointClamp2, float2(0.0, 0.0), 0 ); float previousAvgLuminance = TexPreviousAvgLuminance.SampleLevel( samplerPointClamp, float2(0.0, 0.0), 0 ); // 比例系数,根据亮度提高还是降低会有区别 float adaptationSpeedFactor = (currentAvgLuminance <= previousAvgLuminance) ? cb3_v0.x : cb3_v0.y; // 计算适应亮度 float adaptedLuminance = lerp( previousAvgLuminance, currentAvgLuminance, adaptationSpeedFactor ); return adaptedLuminance; } |
生成的汇编是一样的,只不过这里我建议返回float而不是float4,不必浪费带宽。
所以人眼适应就是这么做的,很简单吧?
第三部分,色差特效(Chormatic Aberration)
欢迎来到第三集,这里我们揭秘巫师3的渲染技巧。
今天我们研究下色差特效。
色差效果是廉价镜头的一个效果,当镜头对不同波长的折射率不一致时,就会产生这种变形。不是所有人都喜欢它,但这里它很轻微,因此不影响游戏性。然而你也可以关了它。


看出差别了吗?我也没有。看看另一个场景

这个好多了,效果很微妙。然而我很好奇它怎么实现的。
实现
首先找到pixel shader的那个drawcall实际上色差效果是庞大的后处理的一小部分,它包括色差效果,暗角和gamma校正,在一个pixel shader中。
我们看看代码吧
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
ps_5_0 dcl_globalFlags refactoringAllowed dcl_constantbuffer cb3[18], immediateIndexed dcl_sampler s1, mode_default dcl_resource_texture2d (float,float,float,float) t0 dcl_input_ps_siv v0.xy, position dcl_input_ps linear v1.zw dcl_output o0.xyzw dcl_temps 4 0: mul r0.xy, v0.xyxx, cb3[17].zwzz 1: mad r0.zw, v0.xxxy, cb3[17].zzzw, -cb3[17].xxxy 2: div r0.zw, r0.zzzw, cb3[17].xxxy 3: dp2 r1.x, r0.zwzz, r0.zwzz 4: sqrt r1.x, r1.x 5: add r1.y, r1.x, -cb3[16].y 6: mul_sat r1.y, r1.y, cb3[16].z 7: sample_l(texture2d)(float,float,float,float) r2.xyz, r0.xyxx, t0.xyzw, s1, l(0) 8: lt r1.z, l(0), r1.y 9: if_nz r1.z 10: mul r1.y, r1.y, r1.y 11: mul r1.y, r1.y, cb3[16].x 12: max r1.x, r1.x, l(0.000100) 13: div r1.x, r1.y, r1.x 14: mul r0.zw, r0.zzzw, r1.xxxx 15: mul r0.zw, r0.zzzw, cb3[17].zzzw 16: mad r0.xy, -r0.zwzz, l(2.000000, 2.000000, 0.000000, 0.000000), r0.xyxx 17: sample_l(texture2d)(float,float,float,float) r2.x, r0.xyxx, t0.xyzw, s1, l(0) 18: mad r0.xy, v0.xyxx, cb3[17].zwzz, -r0.zwzz 19: sample_l(texture2d)(float,float,float,float) r2.y, r0.xyxx, t0.xyzw, s1, l(0) 20: endif ... |
以及CBuffer的数值

好了我们来理解下。
cb3_v17.xy时色差效果的核心,他计算一个像素坐标到色差中心的向量和长度,然后计算一些数值,测试,分支。
当色差效果实现时,我们用cbuffer的数值计算偏移,然后扭曲RG通道。
通常,靠近屏幕边角效果更明显。第10行很有意思,它让像素靠近,当我们增强色差时。
我很乐于分享我的实现,当然对于变量名不要介意,并且注意这个效果是在gamma矫正之前。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
void ChromaticAberration( float2 uv, inout float3 color ) { // 用户定义量 float2 chromaticAberrationCenter = float2(0.5, 0.5); float chromaticAberrationCenterAvoidanceDistance = 0.2; float fA = 1.25; float fChromaticAbberationIntensity = 30; float fChromaticAberrationDistortionSize = 0.75; // 计算向量 float2 chromaticAberrationOffset = uv - chromaticAberrationCenter; chromaticAberrationOffset = chromaticAberrationOffset / chromaticAberrationCenter; float chromaticAberrationOffsetLength = length(chromaticAberrationOffset); // 防止从中心开始,减了一个量 float chromaticAberrationOffsetLengthFixed = chromaticAberrationOffsetLength - chromaticAberrationCenterAvoidanceDistance; float chromaticAberrationTexel = saturate(chromaticAberrationOffsetLengthFixed * fA); float fApplyChromaticAberration = (0.0 < chromaticAberrationTexel); if (fApplyChromaticAberration) { chromaticAberrationTexel *= chromaticAberrationTexel; chromaticAberrationTexel *= fChromaticAberrationDistortionSize; chromaticAberrationOffsetLength = max(chromaticAberrationOffsetLength, 1e-4); float fMultiplier = chromaticAberrationTexel / chromaticAberrationOffsetLength; chromaticAberrationOffset *= fMultiplier; chromaticAberrationOffset *= g_Viewport.zw; chromaticAberrationOffset *= fChromaticAbberationIntensity; float2 offsetUV = -chromaticAberrationOffset * 2 + uv; color.r = TexColorBuffer.SampleLevel(samplerLinearClamp, offsetUV, 0).r; offsetUV = uv - chromaticAberrationOffset; color.g = TexColorBuffer.SampleLevel(samplerLinearClamp, offsetUV, 0).g; } } |
这里我增加了”fChromaticAberrationIntensity”来增强偏移的大小。因此,效果的强度是它控制。巫师3中它是1。
下面是强度40时

第四部分,暗角
暗角是游戏中用的最广泛的后处理之一,它在摄影中也很常见。微弱的暗角可以产生很好的效果。有几种暗角,比如UE使用自然的
让我们回到巫师3,这里有一个实时的对比,它来自NVIDIA的性能分析。
请注意左上的天空比其它部分更暗,我之后会提及。
实现细节

首先,最初版本的巫师3和DLC血与酒中暗角的实现有点区别。前者使用pixel shader预计算的反向梯度。而后者用了一个预计算的256*256贴图

我会用血与酒的shader像很多其他游戏,巫师3的暗角是最后后处理计算的,我们看看汇编
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
... 44: log r0.xyz, r0.xyzx 45: mul r0.xyz, r0.xyzx, l(0.454545, 0.454545, 0.454545, 0.000000) 46: exp r0.xyz, r0.xyzx 47: mul r1.xyz, r0.xyzx, cb3[9].xyzx 48: sample_indexable(texture2d)(float,float,float,float) r0.w, v1.zwzz, t2.yzwx, s2 49: log r2.xyz, r1.xyzx 50: mul r2.xyz, r2.xyzx, l(2.200000, 2.200000, 2.200000, 0.000000) 51: exp r2.xyz, r2.xyzx 52: dp3 r1.w, r2.xyzx, cb3[6].xyzx 53: add_sat r1.w, -r1.w, l(1.000000) 54: mul r1.w, r1.w, cb3[6].w 55: mul_sat r0.w, r0.w, r1.w 56: mad r0.xyz, -r0.xyzx, cb3[9].xyzx, cb3[7].xyzx 57: mad r0.xyz, r0.wwww, r0.xyzx, r1.xyzx ... |
有意思,看上去暗角用了gamma(46行)和线性(51行)计算。48行采样了暗角的贴图

cb3[9].xyz跟暗角无关,每一帧它都是1,1,1 它可能跟淡入淡出效果有关。
暗角主要有三个参数透明度(cb3[6].w),影响强度。0是没有,1是最强。据我观察巫师3里面它是1,血与酒中在0.15左右颜色(cb3[7].xyz),巫师会改变颜色,不必是黑色,通常它是(3/255,4/255,5/255)权重。
这很有意思,我常常见到平的暗角,像这样:

但用了权重有了有意思的效果

权重接近1,抓一帧血与酒的cbuffer,这时为甚么明亮的像素并没有太受影响。
蒙版计算会差值图像颜色和暗角颜色。
代码,这里是我实现的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
/* // The Witcher 3 vignette. // // 输入颜色是gamma,输出也是 */ float3 Vignette_TW3( in float3 gammaColor, in float3 vignetteColor, in float3 vignetteWeights, in float vignetteOpacity, in Texture2D texVignette, in float2 texUV ) { // 根据线性空间颜色计算暗角权重 float vignetteWeight = dot( GammaToLinear( gammaColor ), vignetteWeights ); // 保证暗角在0-1 vignetteWeight = saturate( 1.0 - vignetteWeight ); // Multiply by opacity vignetteWeight *= vignetteOpacity; // 获取暗角蒙版,这里你也可以自己计算 float sampledVignetteMask = texVignette.Sample( samplerLinearClamp, texUV ).x; // 最后的暗角蒙版 float finalInvVignetteMask = saturate( vignetteWeight * sampledVignetteMask ); // 合成进gamma float3 Color = lerp( gammaColor, vignetteColor, finalInvVignetteMask ); // Return final color return Color; } |
附加:计算梯度巫师3用了反向梯度而不是采样预计算的贴图,我们看看汇编
1 2 3 4 5 6 7 8 9 |
35: add r2.xy, v1.zwzz, l(-0.500000, -0.500000, 0.000000, 0.000000) 36: dp2 r1.w, r2.xyxx, r2.xyxx 37: sqrt r1.w, r1.w 38: mad r1.w, r1.w, l(2.000000), l(-0.550000) 39: mul_sat r2.w, r1.w, l(1.219512) 40: mul r2.z, r2.w, r2.w 41: mul r2.xy, r2.zwzz, r2.zzzz 42: dp4 r1.w, l(-0.100000, -0.105000, 1.120000, 0.090000), r2.xyzw 43: min r1.w, r1.w, l(0.940000) |
幸运的是这很简单
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
float TheWitcher3_2015_Mask( in float2 uv ) { float distanceFromCenter = length( uv - float2(0.5, 0.5) ); float x = distanceFromCenter * 2.0 - 0.55; x = saturate( x * 1.219512 ); // 1.219512 = 100/82 float x2 = x * x; float x3 = x2 * x; float x4 = x2 * x2; float outX = dot( float4(x4, x3, x2, x), float4(-0.10, -0.105, 1.12, 0.09) ); outX = min( outX, 0.94 ); return outX; } |
基本上来说是计算到中心的距离,然后做些魔法(乘,saturate….就有了一个多项式

第五部分,醉酒效果
晚上
夜里
一开始我们看到了两次旋转的图像,在我们不清醒时候很常见。距离中心越远,旋转越强。我放了第二个视频,可以清楚看到星星的旋转。
第二部分,可能第一眼看不太出来,还有一点放大缩小。
很明显这个是后处理,然而它在管线中的位置不一定清楚。原来它就在tonemapping之后,motionblur之前。
我们看看汇编代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
ps_5_0 dcl_globalFlags refactoringAllowed dcl_constantbuffer cb0[2], immediateIndexed dcl_constantbuffer cb3[3], immediateIndexed dcl_sampler s0, mode_default dcl_resource_texture2d (float,float,float,float) t0 dcl_input_ps_siv v1.xy, position dcl_output o0.xyzw dcl_temps 8 0: mad r0.x, cb3[0].y, l(-0.100000), l(1.000000) 1: mul r0.yz, cb3[1].xxyx, l(0.000000, 0.050000, 0.050000, 0.000000) 2: mad r1.xy, v1.xyxx, cb0[1].zwzz, -cb3[2].xyxx 3: dp2 r0.w, r1.xyxx, r1.xyxx 4: sqrt r1.z, r0.w 5: mul r0.w, r0.w, l(10.000000) 6: min r0.w, r0.w, l(1.000000) 7: mul r0.w, r0.w, cb3[0].y 8: mul r2.xyzw, r0.yzyz, r1.zzzz 9: mad r2.xyzw, r1.xyxy, r0.xxxx, -r2.xyzw 10: mul r3.xy, r0.xxxx, r1.xyxx 11: mad r3.xyzw, r0.yzyz, r1.zzzz, r3.xyxy 12: add r3.xyzw, r3.xyzw, cb3[2].xyxy 13: add r2.xyzw, r2.xyzw, cb3[2].xyxy 14: mul r0.x, r0.w, cb3[0].x 15: mul r0.x, r0.x, l(5.000000) 16: mul r4.xyzw, r0.xxxx, cb3[0].zwzw 17: mad r5.xyzw, r4.zwzw, l(1.000000, 0.000000, -1.000000, -0.000000), r2.xyzw 18: sample_indexable(texture2d)(float,float,float,float) r6.xyzw, r5.xyxx, t0.xyzw, s0 19: sample_indexable(texture2d)(float,float,float,float) r5.xyzw, r5.zwzz, t0.xyzw, s0 20: add r5.xyzw, r5.xyzw, r6.xyzw 21: mad r6.xyzw, r4.zwzw, l(0.707000, 0.707000, -0.707000, -0.707000), r2.xyzw 22: sample_indexable(texture2d)(float,float,float,float) r7.xyzw, r6.xyxx, t0.xyzw, s0 23: sample_indexable(texture2d)(float,float,float,float) r6.xyzw, r6.zwzz, t0.xyzw, s0 24: add r5.xyzw, r5.xyzw, r7.xyzw 25: add r5.xyzw, r6.xyzw, r5.xyzw 26: mad r6.xyzw, r4.zwzw, l(0.000000, 1.000000, -0.000000, -1.000000), r2.xyzw 27: mad r2.xyzw, r4.xyzw, l(-0.707000, 0.707000, 0.707000, -0.707000), r2.xyzw 28: sample_indexable(texture2d)(float,float,float,float) r7.xyzw, r6.xyxx, t0.xyzw, s0 29: sample_indexable(texture2d)(float,float,float,float) r6.xyzw, r6.zwzz, t0.xyzw, s0 30: add r5.xyzw, r5.xyzw, r7.xyzw 31: add r5.xyzw, r6.xyzw, r5.xyzw 32: sample_indexable(texture2d)(float,float,float,float) r6.xyzw, r2.xyxx, t0.xyzw, s0 33: sample_indexable(texture2d)(float,float,float,float) r2.xyzw, r2.zwzz, t0.xyzw, s0 34: add r5.xyzw, r5.xyzw, r6.xyzw 35: add r2.xyzw, r2.xyzw, r5.xyzw 36: mul r2.xyzw, r2.xyzw, l(0.062500, 0.062500, 0.062500, 0.062500) 37: mad r5.xyzw, r4.zwzw, l(1.000000, 0.000000, -1.000000, -0.000000), r3.zwzw 38: sample_indexable(texture2d)(float,float,float,float) r6.xyzw, r5.xyxx, t0.xyzw, s0 39: sample_indexable(texture2d)(float,float,float,float) r5.xyzw, r5.zwzz, t0.xyzw, s0 40: add r5.xyzw, r5.xyzw, r6.xyzw 41: mad r6.xyzw, r4.zwzw, l(0.707000, 0.707000, -0.707000, -0.707000), r3.zwzw 42: sample_indexable(texture2d)(float,float,float,float) r7.xyzw, r6.xyxx, t0.xyzw, s0 43: sample_indexable(texture2d)(float,float,float,float) r6.xyzw, r6.zwzz, t0.xyzw, s0 44: add r5.xyzw, r5.xyzw, r7.xyzw 45: add r5.xyzw, r6.xyzw, r5.xyzw 46: mad r6.xyzw, r4.zwzw, l(0.000000, 1.000000, -0.000000, -1.000000), r3.zwzw 47: mad r3.xyzw, r4.xyzw, l(-0.707000, 0.707000, 0.707000, -0.707000), r3.xyzw 48: sample_indexable(texture2d)(float,float,float,float) r4.xyzw, r6.xyxx, t0.xyzw, s0 49: sample_indexable(texture2d)(float,float,float,float) r6.xyzw, r6.zwzz, t0.xyzw, s0 50: add r4.xyzw, r4.xyzw, r5.xyzw 51: add r4.xyzw, r6.xyzw, r4.xyzw 52: sample_indexable(texture2d)(float,float,float,float) r5.xyzw, r3.xyxx, t0.xyzw, s0 53: sample_indexable(texture2d)(float,float,float,float) r3.xyzw, r3.zwzz, t0.xyzw, s0 54: add r4.xyzw, r4.xyzw, r5.xyzw 55: add r3.xyzw, r3.xyzw, r4.xyzw 56: mad r2.xyzw, r3.xyzw, l(0.062500, 0.062500, 0.062500, 0.062500), r2.xyzw 57: mul r0.x, cb3[0].y, l(8.000000) 58: mul r0.xy, r0.xxxx, cb3[0].zwzz 59: mad r0.z, cb3[1].y, l(0.020000), l(1.000000) 60: mul r1.zw, r0.zzzz, r1.xxxy 61: mad r1.xy, r1.xyxx, r0.zzzz, cb3[2].xyxx 62: mad r3.xy, r1.zwzz, r0.xyxx, r1.xyxx 63: mul r0.xy, r0.xyxx, r1.zwzz 64: mad r0.xy, r0.xyxx, l(2.000000, 2.000000, 0.000000, 0.000000), r1.xyxx 65: sample_indexable(texture2d)(float,float,float,float) r1.xyzw, r1.xyxx, t0.xyzw, s0 66: sample_indexable(texture2d)(float,float,float,float) r4.xyzw, r0.xyxx, t0.xyzw, s0 67: sample_indexable(texture2d)(float,float,float,float) r3.xyzw, r3.xyxx, t0.xyzw, s0 68: add r1.xyzw, r1.xyzw, r3.xyzw 69: add r1.xyzw, r4.xyzw, r1.xyzw 70: mad r2.xyzw, -r1.xyzw, l(0.333333, 0.333333, 0.333333, 0.333333), r2.xyzw 71: mul r1.xyzw, r1.xyzw, l(0.333333, 0.333333, 0.333333, 0.333333) 72: mul r0.xyzw, r0.wwww, r2.xyzw 73: mad o0.xyzw, cb3[0].yyyy, r0.xyzw, r1.xyzw 74: ret |
用了两个CBuffer,我们看看数值


有几个有意思的。
- cb0_v0.x, 逝去的时间
- cb0_v1.xyzw,视口纹素尺寸
- cb3_v0.x,像素的旋转,一直是1
- cb3_v0.y, 醉酒效果的强度。开启后它会从0长到1
- cv3_v1.xy,像素偏移。这是sin/cos一对。因此可以使用sincos(time)
- cb3_v2.xy,效果的中心,一般是(0.5,0.5)
我们这里像更关注理解它如何工作,而不是重写汇编。 我们从前几行开始
1 2 3 4 5 6 |
ps_5_0 0: mad r0.x, cb3[0].y, l(-0.100000), l(1.000000) 1: mul r0.yz, cb3[1].xxyx, l(0.000000, 0.050000, 0.050000, 0.000000) 2: mad r1.xy, v1.xyxx, cb0[1].zwzz, -cb3[2].xyxx 3: dp2 r0.w, r1.xyxx, r1.xyxx 4: sqrt r1.z, r0.w |
第0行是放大系数,很快你就会知道为何。之后是旋转偏移,就是把sincos乘0.05
第2-4行,计算uv到中心的向量,然后是平方距离和距离
缩放贴图坐标
继续看
1 2 3 4 5 6 |
8: mul r2.xyzw, r0.yzyz, r1.zzzz 9: mad r2.xyzw, r1.xyxy, r0.xxxx, -r2.xyzw 10: mul r3.xy, r0.xxxx, r1.xyxx 11: mad r3.xyzw, r0.yzyz, r1.zzzz, r3.xyxy 12: add r3.xyzw, r3.xyzw, cb3[2].xyxy 13: add r2.xyzw, r2.xyzw, cb3[2].xyxy |
因为我们pack的方式,我们只需要分析一对浮点数。
开始,r0.yz是旋转偏移,r1.z是到中心的距离,r1.xy是到中心的向量,r0.x是缩放系数。
我们假定缩放系数是1,可以这么写
1 2 3 4 5 6 |
r2.xy = (texel - center) * zoomFactor - rotationOffsets * distanceFromCenter + center; But zoomFactor = 1.0: r2.xy = texel - center - rotationOffsets * distanceFromCenter + center; r2.xy = texel - rotationOffsets * distanceFromCenter; |
类似r3.xy
1 2 3 |
10: mul r3.xy, r0.xxxx, r1.xyxx 11: mad r3.xyzw, r0.yzyz, r1.zzzz, r3.xyxy 12: add r3.xyzw, r3.xyzw, cb3[2].xyxy |
1 2 3 4 5 6 |
r3.xy = rotationOffsets * distanceFromCenter + zoomFactor * (texel - center) + center But zoomFactor = 1.0: r3.xy = rotationOffsets * distanceFromCenter + texel - center + center r3.xy = texel + rotationOffsets * distanceFromCenter |
很好。现在我们有了贴图UV,旋转偏移,但是缩放系数呢?我们看看第0行基本上就是 zoomFactor = 1.0 – 0.1 * drunkAmount
最大醉度时候是0.9
1 2 |
baseTexcoordsA = 0.9 * texel + 0.1 * center + rotationOffsets * distanceFromCenter baseTexcoordsB = 0.9 * texel + 0.1 * center - rotationOffsets * distanceFromCenter |
更直观的,这仅仅是归一化的坐标和距离混合结果,为了放大图片。最好的理解方式是自己写写,这里有一个shadertoy可以试试看 here
坐标偏移
1 2 3 4 5 6 7 8 |
2: mad r1.xy, v1.xyxx, cb0[1].zwzz, -cb3[2].xyxx 3: dp2 r0.w, r1.xyxx, r1.xyxx 5: mul r0.w, r0.w, l(10.000000) 6: min r0.w, r0.w, l(1.000000) 7: mul r0.w, r0.w, cb3[0].y 14: mul r0.x, r0.w, cb3[0].x 15: mul r0.x, r0.x, l(5.000000) // texcoords offset intensity 16: mul r4.xyzw, r0.xxxx, cb3[0].zwzw // texcoords offset |
产生了一些梯度。我们叫它偏移强度蒙版。实际上它有两个。一个是r0.w,另一个是五倍强的r0.x(15行)。后一个是纹素尺寸的倍数,
采样-旋转
接下来是一系列采样,实际上有两个8次采样。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
static const float2 pointsAroundPixel[8] = { float2(1.0, 0.0), float2(-1.0, 0.0), float2(0.707, 0.707), float2(-0.707, -0.707), float2(0.0, 1.0), float2(0.0, -1.0), float2(-0.707, 0.707), float2(0.707, -0.707) }; float4 colorA = 0; float4 colorB = 0; int i=0; [unroll] for (i = 0; i < 8; i++) { colorA += TexColorBuffer.Sample( samplerLinearClamp, baseTexcoordsA + texcoordsOffset * pointsAroundPixel[i] ); } colorA /= 16.0; [unroll] for (i = 0; i < 8; i++) { colorB += TexColorBuffer.Sample( samplerLinearClamp, baseTexcoordsB + texcoordsOffset * pointsAroundPixel[i] ); } colorB /= 16.0; float4 rotationPart = colorA + colorB; |
我们往uv上加偏移,偏移基于像素周边单位圆,呈上前面的便宜强度。离像素越远,半径越大。采样八次,这在星星上很显眼。 这些值是pointsAroundPixel

采样 -缩放部分
第二部分是是缩放,我们看看汇编
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
56: mad r2.xyzw, r3.xyzw, l(0.062500, 0.062500, 0.062500, 0.062500), r2.xyzw // the rotation part is stored in r2 register 57: mul r0.x, cb3[0].y, l(8.000000) 58: mul r0.xy, r0.xxxx, cb3[0].zwzz 59: mad r0.z, cb3[1].y, l(0.020000), l(1.000000) 60: mul r1.zw, r0.zzzz, r1.xxxy 61: mad r1.xy, r1.xyxx, r0.zzzz, cb3[2].xyxx 62: mad r3.xy, r1.zwzz, r0.xyxx, r1.xyxx 63: mul r0.xy, r0.xyxx, r1.zwzz 64: mad r0.xy, r0.xyxx, l(2.000000, 2.000000, 0.000000, 0.000000), r1.xyxx 65: sample_indexable(texture2d)(float,float,float,float) r1.xyzw, r1.xyxx, t0.xyzw, s0 66: sample_indexable(texture2d)(float,float,float,float) r4.xyzw, r0.xyxx, t0.xyzw, s0 67: sample_indexable(texture2d)(float,float,float,float) r3.xyzw, r3.xyxx, t0.xyzw, s0 68: add r1.xyzw, r1.xyzw, r3.xyzw 69: add r1.xyzw, r4.xyzw, r1.xyzw |
这里有三次贴图采样,三个不同的坐标,我们分析下坐标怎么计算的但首先,这部分的输入是
1 2 3 4 |
float zoomInOutScalePixels = drunkEffectAmount * 8.0; // line 57 float2 zoomInOutScaleNormalizedScreenCoordinates = zoomInOutScalePixels * texelSize.xy; // line 58 float zoomInOutAmplitude = 1.0 + 0.02*cos(time); // line 59 float2 zoomInOutfromCenterToTexel = zoomInOutAmplitude * fromCenterToTexel; // line 60 |
计算了像素偏移(8*纹素),后来加到基础UV。强度在0.98到1.02之间震荡,从而有缩放的效果,就像旋转中的缩放系数
我们从第一对开始,r1.xy(61行)
1 2 3 4 5 6 7 8 9 10 |
r1.xy = fromCenterToTexel * amplitude + center r1.xy = (TextureUV - Center) * amplitude + Center // you can insert here zoomInOutfromCenterToTexel r1.xy = TextureUV * amplitude - Center * amplitude + Center r1.xy = TextureUV * amplitude + Center * 1.0 - Center * amplitude r1.xy = TextureUV * amplitude + Center * (1.0 - amplitude) r1.xy = lerp( TextureUV, Center, amplitude); So: float2 zoomInOutBaseTextureUV = lerp(TextureUV, Center, amplitude); |
看看第二对 r3.xy(62行)
1 2 3 4 5 6 7 |
r3.xy = (amplitude * fromCenterToTexel) * zoomInOutScaleNormalizedScreenCoordinates + zoomInOutBaseTextureUV So: float2 zoomInOutAddTextureUV0 = zoomInOutBaseTextureUV + zoomInOutfromCenterToTexel*zoomInOutScaleNormalizedScreenCoordinates; |
看看第三对 r0.xy (63,64行)
1 2 3 4 5 6 |
r0.xy = zoomInOutScaleNormalizedScreenCoordinates * (amplitude * fromCenterToTexel) * 2.0 + zoomInOutBaseTextureUV So: float2 zoomInOutAddTextureUV1 = zoomInOutBaseTextureUV + 2.0*zoomInOutfromCenterToTexel*zoomInOutScaleNormalizedScreenCoordinates |
三个贴图采样加在一起,结果存在r1寄存器。值得注意的是pixel shader用的是clamp的边缘采样方式
全加起来
我们现在有了r2寄存器的旋转结果和r1寄存器的三次缩放效果。看看汇编的最后
1 2 3 4 5 |
70: mad r2.xyzw, -r1.xyzw, l(0.333333, 0.333333, 0.333333, 0.333333), r2.xyzw 71: mul r1.xyzw, r1.xyzw, l(0.333333, 0.333333, 0.333333, 0.333333) 72: mul r0.xyzw, r0.wwww, r2.xyzw 73: mad o0.xyzw, cb3[0].yyyy, r0.xyzw, r1.xyzw 74: ret |
另外一个输入,r0.w是强度蒙版,cb3[0].y是醉酒的强度
我们看看怎么工作的
我最开始写了个暴力的
1 2 3 4 |
float4 finalColor = intensityMask * (rotationPart - zoomingPart); finalColor = drunkIntensity * finalColor + zoomingPart; return finalColor; |
当然从来没人这么写,我用纸和笔写下了方程
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
finalColor = effectAmount * [intensityMask * (rotationPart - zoomPart)] + zoomPart finalColor = effectAmount * intensityMask * rotationPart - effectAmount * intensityMask * zoomPart + zooomPart - Let t = effectAmount * intensityMask - So we have: finalColor = t * rotationPart - t * zoomPart + zoomPart finalColor = t * rotationPart + zoomPart - t * zoomPart finalColor = t * rotationPart + (1.0 - t) * zoomPart finalColor = lerp( zoomingPart, rotationPart, t ) - Finally: finalColor = lerp(zoomingPart, rotationPart, intensityMask * drunkIntensity); |
第六部分,锐化
巫师3中锐化有两个预设:低和高。我会之后讨论它们的区别




如果你想看交互的对比,请去这里
像你看到的那样,效果在草和植被上很明显
这篇中我们会研究刚开始游戏的这帧,我选择它因为它有地形和天空
它的输入需要颜色缓冲(LDR的,在tonemapping之后)和深度缓冲
我们看看pixel shader的汇编
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
ps_5_0 dcl_globalFlags refactoringAllowed dcl_constantbuffer cb3[3], immediateIndexed dcl_constantbuffer cb12[23], immediateIndexed dcl_sampler s0, mode_default dcl_resource_texture2d (float,float,float,float) t0 dcl_resource_texture2d (float,float,float,float) t1 dcl_input_ps_siv v0.xy, position dcl_output o0.xyzw dcl_temps 7 0: ftoi r0.xy, v0.xyxx 1: mov r0.zw, l(0, 0, 0, 0) 2: ld_indexable(texture2d)(float,float,float,float) r0.x, r0.xyzw, t1.xyzw 3: mad r0.x, r0.x, cb12[22].x, cb12[22].y 4: mad r0.y, r0.x, cb12[21].x, cb12[21].y 5: max r0.y, r0.y, l(0.000100) 6: div r0.y, l(1.000000, 1.000000, 1.000000, 1.000000), r0.y 7: mad_sat r0.y, r0.y, cb3[1].z, cb3[1].w 8: add r0.z, -cb3[1].x, cb3[1].y 9: mad r0.y, r0.y, r0.z, cb3[1].x 10: add r0.y, r0.y, l(1.000000) 11: ge r0.x, r0.x, l(1.000000) 12: movc r0.x, r0.x, l(0), l(1.000000) 13: mul r0.z, r0.x, r0.y 14: round_z r1.xy, v0.xyxx 15: add r1.xy, r1.xyxx, l(0.500000, 0.500000, 0.000000, 0.000000) 16: div r1.xy, r1.xyxx, cb3[0].zwzz 17: sample_l(texture2d)(float,float,float,float) r2.xyz, r1.xyxx, t0.xyzw, s0, l(0) 18: lt r0.z, l(0), r0.z 19: if_nz r0.z 20: div r3.xy, l(0.500000, 0.500000, 0.000000, 0.000000), cb3[0].zwzz 21: add r0.zw, r1.xxxy, -r3.xxxy 22: sample_l(texture2d)(float,float,float,float) r4.xyz, r0.zwzz, t0.xyzw, s0, l(0) 23: mov r3.zw, -r3.xxxy 24: add r5.xyzw, r1.xyxy, r3.zyxw 25: sample_l(texture2d)(float,float,float,float) r6.xyz, r5.xyxx, t0.xyzw, s0, l(0) 26: add r4.xyz, r4.xyzx, r6.xyzx 27: sample_l(texture2d)(float,float,float,float) r5.xyz, r5.zwzz, t0.xyzw, s0, l(0) 28: add r4.xyz, r4.xyzx, r5.xyzx 29: add r0.zw, r1.xxxy, r3.xxxy 30: sample_l(texture2d)(float,float,float,float) r1.xyz, r0.zwzz, t0.xyzw, s0, l(0) 31: add r1.xyz, r1.xyzx, r4.xyzx 32: mul r3.xyz, r1.xyzx, l(0.250000, 0.250000, 0.250000, 0.000000) 33: mad r1.xyz, -r1.xyzx, l(0.250000, 0.250000, 0.250000, 0.000000), r2.xyzx 34: max r0.z, abs(r1.z), abs(r1.y) 35: max r0.z, r0.z, abs(r1.x) 36: mad_sat r0.z, r0.z, cb3[2].x, cb3[2].y 37: mad r0.x, r0.y, r0.x, l(-1.000000) 38: mad r0.x, r0.z, r0.x, l(1.000000) 39: dp3 r0.y, l(0.212600, 0.715200, 0.072200, 0.000000), r2.xyzx 40: dp3 r0.z, l(0.212600, 0.715200, 0.072200, 0.000000), r3.xyzx 41: max r0.w, r0.y, l(0.000100) 42: div r1.xyz, r2.xyzx, r0.wwww 43: add r0.y, -r0.z, r0.y 44: mad r0.x, r0.x, r0.y, r0.z 45: max r0.x, r0.x, l(0) 46: mul r2.xyz, r0.xxxx, r1.xyzx 47: endif 48: mov o0.xyz, r2.xyzx 49: mov o0.w, l(1.000000) 50: ret |
50行还行
锐化量计算
第一步读取了深度buffer。注意巫师3用了反向深度(1-近,0-远),另外如你所知硬件深度是非线性的,参考这篇文章
第3-6行计算了硬件深度到近-远值,方法很有意思。看着cbuffer的数值

当最近裁剪为0.2,最远为5000,我相信你可以这样计算cb12_v21.xy
1 2 |
cb12_v21.y = 1/near cb12_v21.x = -(1.0 / near) + (1.0 / near) * (near / far) |
cb12_v21.y = 1/nearcb12_v21.x = -(1.0 / near) + (1.0 / near) * (near / far)
这部分在巫师3的代码里很常见,所以我 猜这 有个函数
当我们有了视锥深度,第7行使用scale/bias创建一个混合系数

cb3_v1.xy分别是在近和远距离锐化的强度。我们叫它sharpenNear和sharpenFar好了。这是锐化高和低预设的唯一差别
现在是时候看看系数了,第8,9行仅仅lerp了这两个系数。它是干什么的?有了他们我就可以在Geralt近处和远处不一样的强度了

虽然看上去不是很明显,但我们根据距离差值出近处(2.177151)和远处(1.913)不同的强度。计算玩我们加了1。它是干嘛的?后面会有更详细的解释
在锐化过程中我们不想影响天空,用一个深度测试很容易做到。因为天空盒的深度是1.
1 2 |
// Do not perform sharpen on sky float fSkyboxTest = (fDepth >= 1.0) ? 0 : 1; |
我们乘上天空的影响

这个乘法在13行
1 2 |
// Calculate final sharpen amount float fSharpenAmount = fSharpenIntensity * fSkyboxTest; |
采样像素中心
这里是SV_Position重要的一个地方:半像素偏移 左上角的像素不是0,0而是0.5,0.5
这里我们想采样像素中心,所以看看14-16行
1 2 3 4 5 6 7 8 |
// Sample the center of the pixel. // Get rid of "half-pixel" offset from SV_Position.xy. float2 uvCenter = trunc( Input.Position.xy ); // Add half-pixel to make sure we will sample the center of the pixel uvCenter += float2(0.5, 0.5); uvCenter /= g_Viewport.xy |
之后我们就用uvCenter采样贴图。别担心,效果是常见的一致,即SV_Position.xy/ViewportSize.xy
锐化还是不锐化取决于fSharpenAmount
1 2 3 4 5 6 7 8 9 10 11 12 |
// Get the value of current pixel float3 colorCenter = TexColorBuffer.SampleLevel( samplerLinearClamp, uvCenter, 0 ).rgb; // Final result float3 finalColor = colorCenter; if ( fSharpenAmount > 0 ) { // do the sharpening here... } return float4( finalColor, 1 ); |
锐化
我们看看核心算法,基本上就是
- 在像素的四个角落采样四遍输入颜色buffer
- 取平均
- 计算中心和中心平均值的差别
- 找到差值的绝对值
- 通过scale和bias调整绝对值
- 用它决定效果强弱
- 计算中心颜色和平均颜色的亮度
- 用中心颜色除以亮度
- 用上面的量计算新的亮度
- 新的亮度乘上中心颜色
看上去不少,并且对我不太好理解,因为我没做过锐化
我们从最简单开始,我们这样做的四次贴图采样

所有采样用的bilinear (D3D11_FILTER_MIN_MAG_LINEAR_MIP_POINT).
HLSL里到底怎么做的?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
float2 uvCorner; float2 uvOffset = float2( 0.5, 0.5 ) / g_Viewport.xy; // remember about division! float3 colorCorners = 0; // Top left corner // -0,5, -0.5 uvCorner = uvCenter - uvOffset; colorCorners += TexColorBuffer.SampleLevel( samplerLinearClamp, uvCorner, 0 ).rgb; // Top right corner // +0.5, -0.5 uvCorner = uvCenter + float2(uvOffset.x, -uvOffset.y); colorCorners += TexColorBuffer.SampleLevel( samplerLinearClamp, uvCorner, 0 ).rgb; // Bottom left corner // -0.5, +0.5 uvCorner = uvCenter + float2(-uvOffset.x, uvOffset.y); colorCorners += TexColorBuffer.SampleLevel( samplerLinearClamp, uvCorner, 0 ).rgb; // Bottom right corner // +0.5, +0.5 uvCorner = uvCenter + uvOffset; colorCorners += TexColorBuffer.SampleLevel( samplerLinearClamp, uvCorner, 0 ).rgb; |
现在我们有了四个采样了
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
// Calculate the average of four corners float3 averageColorCorners = colorCorners / 4.0; // Calculate the color difference float3 diffColor = colorCenter - averageColorCorners; // Find max absolute RGB component of the difference float fDiffColorMaxComponent = max( abs(diffColor.x), max( abs(diffColor.y), abs(diffColor.z) ) ); // Adjust this factor float fDiffColorMaxComponentScaled = saturate( fDiffColorMaxComponent * sharpenLumScale + sharpenLumBias ); // Calculate how much pixel will be sharpened. // Note the "1.0" here - this is why we added "1.0" before to fSharpenIntensity. float fPixelSharpenAmount = lerp(1.0, fSharpenAmount, fDiffColorMaxComponentScaled); // Calculate luminance of "center" of the pixel and luminance of average value. float lumaCenter = dot( LUMINANCE_RGB, finalColor ); float lumaCornersAverage = dot( LUMINANCE_RGB, averageColorCorners ); // divide "centerColor" by its luma float3 fColorBalanced = colorCenter / max( lumaCenter, 1e-4 ); // Calc the new luma float fPixelLuminance = lerp(lumaCornersAverage, lumaCenter, fPixelSharpenAmount); // Calc the output color finalColor = fColorBalanced * max(fPixelLuminance, 0.0); } return float4(finalColor, 1.0); |
这根据中心和边缘的最大绝对值计算出了边缘
看看效果

最后的代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
struct VS_OUTPUT_POSTFX { float4 Position : SV_Position; float2 TextureUV : TEXCOORD0; }; cbuffer SharedPixelConsts : register (b12) { // not important here, just to get propers offsets // in the final assembly float4 cb12_v0, cb12_v1, cb12_v2, cb12_v3, cb12_v4; float4 cb12_v5, cb12_v6, cb12_v7, cb12_v8; float4 cb12_v9, cb12_v10, cb12_v11, cb12_v12, cb12_v13; float4 cb12_v14, cb12_v15, cb12_v16, cb12_v17, cb12_v18; float4 cb12_v19, cb12_v20; float2 g_cameraNearFar; float2 pad01; float2 g_depthScaleFactors; float2 pad02; } // This effects uses bilateral sampling, because // of use D3D11_FILTER_MIN_MAG_LINEAR_MIP_POINT SamplerState samplerLinearClamp : register (s0); static float3 LUMINANCE_RGB = float3(0.2126, 0.7152, 0.0722); Texture2D TexColorBuffer : register (t0); Texture2D TexDepthBuffer : register (t1); cbuffer cbSharpen : register (b3) { // High settings: float4 g_Viewport; // 2.0, 1.80, 0.025, -0.25 float sharpenNear; float sharpenFar; float sharpenDistanceScale; float sharpenDistanceBias; float sharpenLumScale; float sharpenLumBias; float2 pad001; } float getFrustumDepth(in float fDepth) { // Map to wider range fDepth = fDepth * g_cameraNearFar.x + g_cameraNearFar.y; return 1.0 / max(fDepth, 0.0001); } // Sharpen filter // Perform this after tonemapping. float4 SharpenPS( in VS_OUTPUT_POSTFX Input ) : SV_TARGET { // Load hardware depth value in this pixel int3 pixelPosition = int3(Input.Position.xy, 0); float fDepth = TexDepthBuffer.Load( pixelPosition ).x; // from [1,0] to [0,1] range fDepth = fDepth * g_depthScaleFactors.x + g_depthScaleFactors.y; // Get view space distance [near-far] [0.2 - 8000.0] float fScaledDepth = getFrustumDepth(fDepth); // Create near/far sharpen mask float fNearFarSharpenMask = saturate( fScaledDepth * sharpenDistanceScale + sharpenDistanceBias ); // Interpolate between near and far intensity using mask calculated above float fSharpenIntensity = lerp(sharpenNear, sharpenFar, fNearFarSharpenMask); // We add 1.0 here which is value of sharpening which does not affect pixel. fSharpenIntensity += 1.0; // Do not perform sharpen on skybox float fSkyboxTest = (fDepth >= 1.0) ? 0 : 1; // Calculate final sharpen amount float fSharpenAmount = fSharpenIntensity * fSkyboxTest; // Sample the center of the pixel. // Get rid of "half-pixel" offset from SV_Position.xy. float2 uvCenter = trunc( Input.Position.xy ); // Add half-pixel to make sure we will sample the center of the pixel uvCenter += float2(0.5, 0.5); uvCenter /= g_Viewport.zw; // Get the value of current pixel float3 colorCenter = TexColorBuffer.SampleLevel( samplerLinearClamp, uvCenter, 0 ).rgb; // Final result, currently the current pixel float3 finalColor = colorCenter; if ( fSharpenAmount > 0 ) { // The pixel passes to sharpening phase. // Sample the four corners around the center of the pixel, calculate the average value, // then calculate the difference and the max abs component of the difference. float2 uvCorner; float2 uvOffset = float2( 0.5, 0.5 ) / g_Viewport.xy; float3 colorCorners = 0; // Top left corner // -0,5, -0.5 uvCorner = uvCenter - uvOffset; colorCorners += TexColorBuffer.SampleLevel( samplerLinearClamp, uvCorner, 0 ).rgb; // Top right corner // +0.5, -0.5 uvCorner = uvCenter + float2(uvOffset.x, -uvOffset.y); colorCorners += TexColorBuffer.SampleLevel( samplerLinearClamp, uvCorner, 0 ).rgb; // Bottom left corner // -0.5, +0.5 uvCorner = uvCenter + float2(-uvOffset.x, uvOffset.y); colorCorners += TexColorBuffer.SampleLevel( samplerLinearClamp, uvCorner, 0 ).rgb; // Bottom right corner // +0.5, +0.5 uvCorner = uvCenter + uvOffset; colorCorners += TexColorBuffer.SampleLevel( samplerLinearClamp, uvCorner, 0 ).rgb; // Calculate the average of four corners float3 averageColorCorners = colorCorners / 4.0; // Calculate the color difference float3 diffColor = colorCenter - averageColorCorners; // Find max absolute RGB component of the difference float fDiffColorMaxComponent = max( abs(diffColor.x), max( abs(diffColor.y), abs(diffColor.z) ) ); // Adjust this factor float fDiffColorMaxComponentScaled = saturate( fDiffColorMaxComponent * sharpenLumScale + sharpenLumBias ); // Calculate how much pixel will be sharpened. // Note the "1.0" here - this we we added "1.0" before to fSharpenIntensity. float fPixelShapenAmount = lerp(1.0, fSharpenAmount, fDiffColorMaxComponentScaled); // Calculate luminance of "center" of the pixel and luminance of average value. float lumaCenter = dot( LUMINANCE_RGB, finalColor ); float lumaCornersAverage = dot( LUMINANCE_RGB, averageColorCorners ); // divide "centerColor" by its luma float3 fColorBalanced = colorCenter / max( lumaCenter, 1e-4 ); // Calc the new luma float fPixelLuminance = lerp(lumaCornersAverage, lumaCenter, fPixelShapenAmount); // Calc the output color finalColor = fColorBalanced * max(fPixelLuminance, 0.0); } return float4( finalColor, 1 ); } |
我很高兴这段代码和汇编一致
总结起来,巫师3的锐化写的很好,主要改变强度的方式是改变近/远的强度,它不是常量,而是根据游戏会改变的,我这里放了一些典型值
Skellige:
sharpenNear | sharpenFar | sharpenDistanceScale | sharpenDistanceBias | sharpenLumScale | sharpenLumBias | |
low | 0.4 | 0.2 | 0.025 | -0.25 | -13.3333 | 1.33333 |
high | 2 | 1.8 | 0.025 | -0.25 | -13.3333 | 1.33333 |
Kaer Morten
sharpenNear | sharpenFar | sharpenDistanceScale | sharpenDistanceBias | sharpenLumScale | sharpenLumBias | |
low | 0.57751 | 0.31303 | 0.06665 | -0.33256 | -1 | 2 |
high | 2.17751 | 1.91303 | 0.06665 | -0.33256 | -1 | 2 |
第七部分之一,平均亮度
几乎所有当代的游戏都会计算每帧的平均明度,之后用来计算人眼适应和tonemapping。简单的方法比如计算mipmap用最后一级通常可以用,但有些局限。更复杂的方法是用compute shader,比如并行消减 parallel reduction.
我们看看巫师3的方法。我们之前已经研究了人眼适应,现在唯一缺的就是平均明度
巫师3中计算平均明度有两个pass,我决定为了清晰起见不在一起讲它们,今天集中讲第一篇,明度分布(直方图)
找到它们不麻烦,他们是连续的dispatch,就在人眼适应之前

我们看看输入,有两张。一张1/4的HDR颜色缓冲,一张全屏深度

注意一个小技巧,这张缓冲是很大的一部分。重用缓冲是个好事。

为什么降采样颜色缓冲?我猜是性能考虑。
这个pass的输出是一个structuredbuffer,256个byte4,因为shader没有debug信息,我们假定就是uint吧
重要:第一步是调用 ClearUnorderedAccessViewUint将structured buffer归零
我们看看汇编,注意这是这个系列里第一次看compute shader
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
cs_5_0 dcl_globalFlags refactoringAllowed dcl_constantbuffer cb0[3], immediateIndexed dcl_resource_texture2d (float,float,float,float) t0 dcl_resource_texture2d (float,float,float,float) t1 dcl_uav_structured u0, 4 dcl_input vThreadGroupID.x dcl_input vThreadIDInGroup.x dcl_temps 6 dcl_tgsm_structured g0, 4, 256 dcl_thread_group 64, 1, 1 0: store_structured g0.x, vThreadIDInGroup.x, l(0), l(0) 1: iadd r0.xyz, vThreadIDInGroup.xxxx, l(64, 128, 192, 0) 2: store_structured g0.x, r0.x, l(0), l(0) 3: store_structured g0.x, r0.y, l(0), l(0) 4: store_structured g0.x, r0.z, l(0), l(0) 5: sync_g_t 6: ftoi r1.x, cb0[2].z 7: mov r2.y, vThreadGroupID.x 8: mov r2.zw, l(0, 0, 0, 0) 9: mov r3.zw, l(0, 0, 0, 0) 10: mov r4.yw, l(0, 0, 0, 0) 11: mov r1.y, l(0) 12: loop 13: utof r1.z, r1.y 14: ge r1.z, r1.z, cb0[0].x 15: breakc_nz r1.z 16: iadd r2.x, r1.y, vThreadIDInGroup.x 17: utof r1.z, r2.x 18: lt r1.z, r1.z, cb0[0].x 19: if_nz r1.z 20: ld_indexable(texture2d)(float,float,float,float) r5.xyz, r2.xyzw, t0.xyzw 21: dp3 r1.z, r5.xyzx, l(0.212600, 0.715200, 0.072200, 0.000000) 22: imul null, r3.xy, r1.xxxx, r2.xyxx 23: ld_indexable(texture2d)(float,float,float,float) r1.w, r3.xyzw, t1.yzwx 24: eq r1.w, r1.w, cb0[2].w 25: and r1.w, r1.w, cb0[2].y 26: add r2.x, -r1.z, cb0[2].x 27: mad r1.z, r1.w, r2.x, r1.z 28: add r1.z, r1.z, l(1.000000) 29: log r1.z, r1.z 30: mul r1.z, r1.z, l(88.722839) 31: ftou r1.z, r1.z 32: umin r4.x, r1.z, l(255) 33: atomic_iadd g0, r4.xyxx, l(1) 34: endif 35: iadd r1.y, r1.y, l(64) 36: endloop 37: sync_g_t 38: ld_structured r1.x, vThreadIDInGroup.x, l(0), g0.xxxx 39: mov r4.z, vThreadIDInGroup.x 40: atomic_iadd u0, r4.zwzz, r1.x 41: ld_structured r1.x, r0.x, l(0), g0.xxxx 42: mov r0.w, l(0) 43: atomic_iadd u0, r0.xwxx, r1.x 44: ld_structured r0.x, r0.y, l(0), g0.xxxx 45: atomic_iadd u0, r0.ywyy, r0.x 46: ld_structured r0.x, r0.z, l(0), g0.xxxx 47: atomic_iadd u0, r0.zwzz, r0.x 48: ret |
我们看看cbuffer

我们已经知道了第一个输入是降采样的HDR,对于1080p,它是480*270。我们看看dispatch是(270,1,1),有270个线程组(threadgroup)。简单来说就是一行一个threadgroup

现在我们知道了上下文,我们来研究shader干了什么。每个线程组在x方向有64个线程,(dcl_thread_group 64 1 1)还有一个共享数组256个元素,每个有4byte(dcl_tgsm_structured g0,4,256)
注意shader里用了 SV_GroupThreadID(vThreadIDInGroup.x)[0-63]和 SV_GroupID (vThreadGroupID.x)[0-269]
1)我们用一个循环将共享内存置0,因为我们有256个元素,每组64个线程,所以一个简单的循环
1 2 3 4 5 6 7 |
// The first step is to set whole shared data to zero. // Because each thread group has 64 threads, each one can zero 4 elements using a simple offset. [unroll] for (uint idx=0; idx < 4; idx++) { const uint offset = threadID + idx*64; shared_data[ offset ] = 0; } |
2)之后,我们设了barrier同步每个线程 GroupMemoryBarrierWithGroupSync(sync_g_t)保证归0都进行了
3)之后我们有一个大概这样的循环
1 2 3 4 5 |
// cb0_v0.x is width of downscaled color buffer. For 1920x1080, it's 1920/4 = 480; float ViewportSizeX = cb0_v0.x; [loop] for ( uint PositionX = 0; PositionX < ViewportSizeX; PositionX += 64 ) { ... |
每次加64
下一步计算纹素位置
Y方向我们有SV_GroupID.x,270个线程组。X方向,大概就是这里了
因为每组64个线程,所以一次过64个像素,比如
- 线程组(0,0,0)会处理像素(0,0),(64,0),(128,0)…..(448,0)
- 线程组(1,0,0)是(1,0),(65,0),(129,0)….(449,0)
- 线程组(63,0,0)是(63,0),(127,0)…(447,0)
这样处理了所有像素
我们也计算了亮度(21行)
好了现在我们已经计算了每一个像素的亮度,下一步是载入对应的深度值但我们深度图是全屏的,这怎么处理?
结果好像很简单,仅仅是将颜色的坐标乘上个常数(cb0_v2.z),我们4x降采样了颜色缓冲,所以这里是4
1 2 3 |
const int iDepthTextureScale = (int) cb0_v2.z; uint2 depthPos = iDepthTextureScale * colorPos; float depth = texture1.Load( int3(depthPos, 0) ).x; |
到目前为止很不错,我们来看看24-25行
1 2 |
24: eq r2.x, r2.x, cb0[2].w 25: and r2.x, r2.x, cb0[2].y |
好吧我们有一个浮点数相等 ??结果进了r2.x,然后做了个,逐位和操作(bitwise AND) ?在浮点数上?这是什么鬼?
逐位等+逐位和操作
可以是说这是对我最困难的部分,我甚至想过疯狂的asint/asfloat组合其他的方法呢?我们做一个浮点-浮点相等比较
1 2 3 4 5 |
float DummyPS() : SV_Target0 { float test = (cb0_v0.x == cb0_v0.y); return test; } |
结果呢
1 2 3 |
0: eq r0.x, cb0[0].y, cb0[0].x 1: and o0.x, r0.x, l(0x3f800000) 2: ret |
看上去不太对,为啥是and,0x3f80000理论上1.0f,如果我们把1.0换成别的呢?
1 2 3 4 5 |
float DummyPS() : SV_Target0 { float test = (cb0_v0.x == cb0_v0.y) ? cb0_v0.z : 0.0; return test; } |
结果是
1 2 3 |
0: eq r0.x, cb0[0].y, cb0[0].x 1: and o0.x, r0.x, cb0[0].z 2: ret |
这次对了,但如果你把0变成其他的,这里会是movc
我们回到compute shader,下一步是检查深度是否等于cb_v2.w,它是0.简单来说,就是检查像素是不是在最远平面上(天空),如果是的话,就把值设置成0.5(我看了几帧)
这样系数用来插值颜色亮度和天空亮度(cb0_v2.x,在0左右),我猜这是控制天空在计算平均亮度时候的重要性,通常用来减弱天空的重要性。很聪明
1 2 3 4 5 6 |
// 我们检查像素是否是天空,并明确了它如何和我们的值混合 float value = (depth == cb0_v2.w) ? cb0_v2.y : 0.0; // 如果值是0,那就给我们明度,否则明度重要性更弱(cb0_v2.x通常接近0). float lumaOk = lerp( luma, cb0_v2.x, value ); |
最后我们计算它在对数分布明度区间的ID,并对对应ID加了1
1 2 3 4 5 6 |
// 计算ID,避免越界 uint uLuma = (uint) lumaOk; uLuma = min(uLuma, 255); // 给这个ID的明度加1 InterlockedAdd( shared_data[uLuma], 1 ); |
下一步,又是设置了barrier保证一行的像素都处理完了,我们把共享内存的值加给structuredbuffer
1 2 3 4 5 6 7 8 9 10 11 |
// 等到所有像素处理完毕 GroupMemoryBarrierWithGroupSync(); // 加进structuredbuffer [unroll] for (uint idx = 0; idx < 4; idx++) { const uint offset = threadID + idx*64; uint data = shared_data[offset]; InterlockedAdd( g_buffer[offset], data ); } |
每个线程组有64个线程填充完共享内存后,每个线程会往输入缓存加4个数值
至于输出缓冲,我们考虑一下,缓冲的和应该是所有像素(480*270=129 600),因此我们知道了有多少个像素有特定的明度
如果你对compute shader不是很熟悉,这可能不是很直观,这篇看几遍就好了。用纸和笔写下来试图理解背后的原理
第七部分之二,平均亮度
在我们再次开始和compute shader的斗争之前,我们快速回顾一下之前干了什么,我们处理了一张4x降采样的HDR颜色缓冲,之后做了一遍亮度直方图(256个uint的structuredbuffer),计算了每个像素的对数亮度,把它放到256个元素中,每个像素加1,这样最后这256个元素的和应该是像素的数量

比如这个全屏缓冲,降采样后480*270,256个元素的和是480*270=129 600
这个简单介绍后,我们看看最后的计算,这里只有一个线程组(Dispatch(1,1,1))
我们看看汇编
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
cs_5_0 dcl_globalFlags refactoringAllowed dcl_constantbuffer cb0[1], immediateIndexed dcl_uav_structured u0, 4 dcl_uav_typed_texture2d (float,float,float,float) u1 dcl_input vThreadIDInGroup.x dcl_temps 4 dcl_tgsm_structured g0, 4, 256 dcl_thread_group 64, 1, 1 0: ld_structured_indexable(structured_buffer, stride=4)(mixed,mixed,mixed,mixed) r0.x, vThreadIDInGroup.x, l(0), u0.xxxx 1: store_structured g0.x, vThreadIDInGroup.x, l(0), r0.x 2: iadd r0.xyz, vThreadIDInGroup.xxxx, l(64, 128, 192, 0) 3: ld_structured_indexable(structured_buffer, stride=4)(mixed,mixed,mixed,mixed) r0.w, r0.x, l(0), u0.xxxx 4: store_structured g0.x, r0.x, l(0), r0.w 5: ld_structured_indexable(structured_buffer, stride=4)(mixed,mixed,mixed,mixed) r0.x, r0.y, l(0), u0.xxxx 6: store_structured g0.x, r0.y, l(0), r0.x 7: ld_structured_indexable(structured_buffer, stride=4)(mixed,mixed,mixed,mixed) r0.x, r0.z, l(0), u0.xxxx 8: store_structured g0.x, r0.z, l(0), r0.x 9: sync_g_t 10: if_z vThreadIDInGroup.x 11: mul r0.x, cb0[0].y, cb0[0].x 12: ftou r0.x, r0.x 13: utof r0.y, r0.x 14: mul r0.yz, r0.yyyy, cb0[0].zzwz 15: ftoi r0.yz, r0.yyzy 16: iadd r0.x, r0.x, l(-1) 17: imax r0.y, r0.y, l(0) 18: imin r0.y, r0.x, r0.y 19: imax r0.z, r0.y, r0.z 20: imin r0.x, r0.x, r0.z 21: mov r1.z, l(-1) 22: mov r2.xyz, l(0, 0, 0, 0) 23: loop 24: breakc_nz r2.x 25: ld_structured r0.z, r2.z, l(0), g0.xxxx 26: iadd r3.x, r0.z, r2.y 27: ilt r0.z, r0.y, r3.x 28: iadd r3.y, r2.z, l(1) 29: mov r1.xy, r2.yzyy 30: mov r3.z, r2.x 31: movc r2.xyz, r0.zzzz, r1.zxyz, r3.zxyz 32: endloop 33: mov r0.w, l(-1) 34: mov r1.yz, r2.yyzy 35: mov r1.xw, l(0, 0, 0, 0) 36: loop 37: breakc_nz r1.x 38: ld_structured r2.x, r1.z, l(0), g0.xxxx 39: iadd r1.y, r1.y, r2.x 40: utof r2.x, r2.x 41: utof r2.w, r1.z 42: add r2.w, r2.w, l(0.500000) 43: mul r2.w, r2.w, l(0.011271) 44: exp r2.w, r2.w 45: add r2.w, r2.w, l(-1.000000) 46: mad r3.z, r2.x, r2.w, r1.w 47: ilt r2.x, r0.x, r1.y 48: iadd r2.w, -r2.y, r1.y 49: itof r2.w, r2.w 50: div r0.z, r3.z, r2.w 51: iadd r3.y, r1.z, l(1) 52: mov r0.y, r1.z 53: mov r3.w, r1.x 54: movc r1.xzw, r2.xxxx, r0.wwyz, r3.wwyz 55: endloop 56: store_uav_typed u1.xyzw, l(0, 0, 0, 0), r1.wwww 57: endif 58: ret |
一个cbuffer

快速看看汇编,有两个UAV,一个是前面输入的缓冲,另一个是1×1的R32_Float贴图。我们每个线程组还有64个线程,以及一个256个元素的4byte的线程组共享内存
我们首先用输入缓冲的数值填充共享内存,我们有64个线程,跟以前差不多为保证数据都射完了,我们设了个barrier
1 2 3 4 5 6 7 8 |
// 第一步是用前面的数值放进共享内存,每一个线程会放4个元素 [unroll] for (uint idx=0; idx < 4; idx++) { const uint offset = threadID + idx*64; shared_data[ offset ] = g_buffer[offset]; } // barrier保证所有线程都计算完毕 GroupMemoryBarrierWithGroupSync(); |
所以计算都在一个线程内,其它只是用来将输入缓冲放进共享内存
计算线程的id是0,为什么?理论上我们可以用任何线程,但与0比较避免了附加的证书-证书比较(ieq 指令)
算法基于特性范围的像素第11行我们乘了 宽度*高度,获得了所有像素数量,并乘了两个0-1的数,规定了起始和终止的范围。
后面有clamp保证0<=起始<=终点<=总像素数-1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
// 只用0号线程计算. [branch] if (threadID == 0) { //总像素数 uint totalPixels = cb0_v0.x * cb0_v0.y; //像素范围,用来计算平均亮度 int pixelsToConsiderStart = totalPixels * cb0_v0.z; int pixelsToConsiderEnd = totalPixels * cb0_v0.w; int pixelsMinusOne = totalPixels - 1; pixelsToConsiderStart = clamp( pixelsToConsiderStart, 0, pixelsMinusOne ); pixelsToConsiderEnd = clamp( pixelsToConsiderEnd, pixelsToConsiderStart, pixelsMinusOne ); |
如你看到,后面还有两个循环,问题是它们结尾都有奇怪的条件移动。我重建他们很困难,另外注意21行的-1,为什么?我们稍后揭晓
第一个循环的目的是忽略掉起始像素,给我们起始像素+1的id所在的元素。
比如,起始像素是30000,在第一个元素里有37000个像素(比如晚上),我们会从第30001号像素开始分析,这是会立刻跳出循环并把忽略掉的像素置0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
// Number of already processed pixels int numProcessedPixels = 0; // Luma cell [0-255] int lumaValue = 0; // Whether to continue execution of loop bool bExitLoop = false; // The purpose of the first loop is to omit "pixelsToConsiderStart" pixels. // We keep number of omitted pixels from previous cells and lumaValue to use in the next loop. [loop] while (!bExitLoop) { // Get number of pixels with specific luma value. uint numPixels = shared_data[lumaValue]; // Check how many pixels we would have with lumaValue int tempSum = numProcessedPixels + numPixels; // If more than pixelsToConsiderStart, exit the loop. // Therefore, we will start calculating luminance from lumaValue. // Simply speaking, pixelsToConsiderStart is number of "darken" pixels to omit before starting calculation. [flatten] if (tempSum > pixelsToConsiderStart) { bExitLoop = true; } else { numProcessedPixels = tempSum; lumaValue++; } } |
第21行神秘的-1和循环的布尔条件相关
有了第lumaValue个元素像素的数量和lumaValue自己,我们可以进入第二个循环,它是用来计算平均亮度的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
float finalAvgLuminance = 0.0f; // Number of omitted pixels in the first loop uint numProcessedPixelStart = numProcessedPixels; // The purpose of this loop is to calculate contribution of pixels and average luminance. // We start from point calculated in the previous loop, keeping number of omitted pixels and starting lumaValue positon. // We decode luma value from [0-255] range, multiply it by number of pixels which have this specific luma, and sum it up until // we process pixelsToConsiderEnd pixels. // After that, we divide total contribution by number of analyzed pixels. bExitLoop = false; [loop] while (!bExitLoop) { // Get number of pixels with specific luma value. uint numPixels = shared_data[lumaValue]; // Add to all processed pixels numProcessedPixels += numPixels; // Currently processed luma, distributed in [0-255] range (uint) uint encodedLumaUint = lumaValue; // Number of pixels with currently processed luma float numberOfPixelsWithCurrentLuma = numPixels; // Currently processed, encoded [0-255] luma (float) float encodedLumaFloat = encodedLumaUint; |
这是我们把明度编码进0-255范围解码很简单,回退这个计算就行
快速总结
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
float luma = dot( hdrPixelColor, float3(0.2126, 0.7152, 0.0722) ); ... float outLuma; // because log(0) is undef and log(1) = 0 outLuma = luma + 1.0; // logarithmically distribute outLuma = log( outLuma ); // scale by 128, which means log(1) * 128 = 0, log(2,71828) * 128 = 128, log(7,38905) * 128 = 256 outLuma = outLuma * 128 // to uint uint outLumaUint = min( (uint) outLuma, 255); |
为了解码亮度,我们回退这个操作
1 2 3 4 5 6 7 8 9 10 11 12 |
// we start by adding 0.5f (we don't want to have zero result) float fDecodedLuma = encodedLumaFloat + 0.5; // and decode luminance: // Divide by 128 fDecodedLuma /= 128.0; // exp(x) which cancels log(x) fDecodedLuma = exp(fDecodedLuma); // Subtract 1.0 fDecodedLuma -= 1.0; |
然后我们计算贡献时,乘以特定亮度像素的数量,然后加和直到结束的像素之后除以分析到的总的像素
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
// Calculate contribution of this luma float fCurrentLumaContribution = numberOfPixelsWithCurrentLuma * fDecodedLuma; // (Temporary) contribution from all previous passes and current one. float tempTotalContribution = fCurrentLumaContribution + finalAvgLuminance; [flatten] if (numProcessedPixels > pixelsToConsiderEnd ) { // to exit the loop bExitLoop = true; // We already processed all pixels we wanted, so perform final division here. // Number of all processed pixels from user-selected start int diff = numProcessedPixels - numProcessedPixelStart; // Calculate final average luminance finalAvgLuminance = tempTotalContribution / float(diff); } else { // Pass current contribution further and increase lumaValue finalAvgLuminance = tempTotalContribution; lumaValue++; } } // Save average luminance g_avgLuminance[uint2(0,0)] = finalAvgLuminance; |
完整的代码在这里