BVH 是一种空间存储数据结构,便于空间查询。类似的空间查找方法还有BSP,Octree等等。
BVH的构造/查询比较快,空间冗余也比较少,在实时图形学领域用的比较广泛。比如NVidia的RTX技术,就是硬件固定管线实现了BVH便于RayTracing加速。
构造BVH或其它空间查询结构对于渲染剔除,物理计算,粒子/boid/swarm系统通信等等都是很重要的,ECSPhysics这个项目是实现了BVH的,不过抄完测了一下感觉这个BVH构造的不对啊。于是找到BVH有并行算法实现,非常适合用Unity JobSystem多线程构造。于是基本参考了Thinking Parallel, Part III: Tree Construction on the GPU这篇经典文章自己在Unity中实现。
构造BVH的基本流程:1. 构造Z-Order 2. 排序 3 构造子节点 4 构造内部节点 5 更新AABB
1 ZOrder Curve & Morton Code

是一种将多维数据降为一维的方法,降为一维的好处是便于排序,便于存储。用这种方法将场景物体排序后再并行构造BVH会比较方便。
其原理可以参考Morton encoding/decoding through bit interleaving这篇文章,把xyz三个维度的bit展开交织在一起。文章里面讲了loop/magic bit/LUT三种方法。magic bit比较好写,速度也比较快,一般也就采用这种方法。ECSPhysics里面照抄的Thinking Parallel, Part III: Tree Construction on the GPU的写法.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[MethodImpl(MethodImplOptions.AggressiveInlining)] public static uint ExpandBits(uint v) { v = (v * 0x00010001u) & 0xFF0000FFu; v = (v * 0x00000101u) & 0x0F00F00Fu; v = (v * 0x00000011u) & 0xC30C30C3u; v = (v * 0x00000005u) & 0x49249249u; return v; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public static uint CalculateMortonCode(float3 vector) { vector.x = math.min(math.max(vector.x * 1024.0f, 0.0f), 1023.0f); vector.y = math.min(math.max(vector.y * 1024.0f, 0.0f), 1023.0f); vector.z = math.min(math.max(vector.z * 1024.0f, 0.0f), 1023.0f); uint xx = ExpandBits((uint) vector.x); uint yy = ExpandBits((uint) vector.y); uint zz = ExpandBits((uint) vector.z); return (xx * 4 + yy * 2 + zz); } |
需要注意的是xyz只有10bit的空间,对大量物体是不够用的,这也会出现mortoncode相同的情况,后面会遇到这个问题。
注意CalculateMortonCode要输入一个0-1的向量,ECSPhysics似乎代码写错了
对于Y轴变化比较少的情况也可以用2D的版本,这样xz各有16bit,大大减少morton code相同的情况。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
[MethodImpl(MethodImplOptions.AggressiveInlining)] public static uint ExpandBits2D(uint v) { v &= 0x0000ffff; v |= (v << 8); v &= 0x00ff00ff; v |= (v << 4); v &= 0x0f0f0f0f; v |= (v << 2); v &= 0x33333333; v |= (v << 1); v &= 0x55555555; return v; } [MethodImpl(MethodImplOptions.AggressiveInlining)] public static uint CalculateMortonCode2D(float3 vector) { vector.x = math.min(math.max(vector.x * 65536.0f, 0.0f), 65536.0f); vector.z = math.min(math.max(vector.z * 65536.0f, 0.0f), 65535.0f); uint xx = ExpandBits2D((uint)vector.x); uint zz = ExpandBits2D((uint)vector.z); return (xx * 2 + zz); } |
2. 排序
ECSPhysics代码里用Radix单线程排的序,Profile看这也是性能瓶颈之一,一核有难七核围观
并行排序可以考虑Bitonic Sort,时间复杂度是o(log(n)^2)。很好的是每个pass不会有data racing的。原理和双调序列的特性相关,有Batcher定理:这篇文章讲的很好,三十分钟理解:双调排序Bitonic Sort,适合并行计算的排序算法
于是笔者用Job实现了Bitonic Sort。比较麻烦的是Job好像不能递归调用,只能一个pass开一个job了
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
[BurstCompile] struct BitonicMergeJob : IJobParallelFor { [NativeDisableParallelForRestriction] public NativeArray<uint> values; [NativeDisableParallelForRestriction] public NativeArray<int> indexConverter; public int strideSwap; public int strideRisingGroup; public void Execute(int index) { int swapPairId = index / strideSwap; int swapGroupId = index - swapPairId * strideSwap; int swapGroupStartId = swapPairId * 2 * strideSwap; int swapIdFirst = swapGroupStartId + swapGroupId; int swapIdSecond = swapIdFirst + strideSwap; int risingGroupId = swapPairId / strideRisingGroup; bool rising = risingGroupId % 2 == 0 ? true : false; if (values[swapIdFirst] > values[swapIdSecond] == rising) { uint tempValue = values[swapIdFirst]; int tempId = indexConverter[swapIdFirst]; values[swapIdFirst] = values[swapIdSecond]; indexConverter[swapIdFirst] = indexConverter[swapIdSecond]; values[swapIdSecond] = tempValue; indexConverter[swapIdSecond] = tempId; } } } [BurstCompile] struct BitonicSortJob : IJobParallelFor { [NativeDisableParallelForRestriction] public NativeArray<uint> values; [NativeDisableParallelForRestriction] public NativeArray<int> indexConverter; public int strideSwap; public void Execute(int index) { int swapPairId = index / strideSwap; int swapGroupId = index - swapPairId * strideSwap; int swapGroupStartId = swapPairId * 2 * strideSwap; int swapIdFirst = swapGroupStartId + swapGroupId; int swapIdSecond = swapIdFirst + strideSwap; if (values[swapIdFirst] > values[swapIdSecond]) { uint tempValue = values[swapIdFirst]; int tempId = indexConverter[swapIdFirst]; values[swapIdFirst] = values[swapIdSecond]; indexConverter[swapIdFirst] = indexConverter[swapIdSecond]; values[swapIdSecond] = tempValue; indexConverter[swapIdSecond] = tempId; } } } |
调用是在某个JobComponentSystem的OnUpdate里
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
var bitonicMergeJob = new BitonicMergeJob() { values = mortonCodes, indexConverter = indexConverter }; var bitonicSortJob = new BitonicSortJob() { indexConverter = indexConverter, values = mortonCodes }; int pass = (int)math.log2(mortonCodes.Length); for (int i = 0; i < pass - 1; i++) { for (int j = 0; j <= i; j++) { bitonicMergeJob.strideSwap = 1 << (i - j); bitonicMergeJob.strideRisingGroup = 1 << j; deps = bitonicMergeJob.Schedule(mortonCodes.Length / 2, 64, deps); } } for (int i = 0; i < pass; i++) { bitonicSortJob.strideSwap = 1 << (pass - i - 1); deps = bitonicSortJob.Schedule(mortonCodes.Length / 2, 64, deps); } |
Profile看确实比原来的单线程快一些,但可能有线程调用和开销,也不是很理想。猜测另一个原因是理论时间复杂度o(n*log(n)^2),GPU上n核并行才能做到o(log(n)^2),CPU上要看log(n)/core了,8核的话n小于10^8还是比quicksort快的,10000个元素估计能快一倍吧。4核估计速度就不是很显著了。
3 构造子节点
没有什么技术含量,这里就不放代码了
4 构造内部节点
Tero Karras 论文里放了伪码,Thinking Parallel, Part III: Tree Construction on the GPU里面也放了C++的代码。但是determineRange(i),这个函数代码没给出来。
其实主要划分成三步
- determinrange
- findsplit
- output
第一步找到split可能出现的范围,第二步binary search找split,第三步输出
其实也好实现,照抄伪码就行。这步是很神的。先binary往前找最远的range,再binary往回找最近的range。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
int2 determineRange(int i) { int d = (clz_safe(i, i + 1) - clz_safe(i, i - 1)) > 0 ? 1 : -1; int commonPrefixMin = clz_safe(i, i - d); int l_max = 2; while (clz_safe(i, i + d * l_max) > commonPrefixMin) { l_max *= 2; } int l = 0; int t = l_max; do { t = (t + 1) >> 1; // exponential decrease if (clz_safe(i, i + d * (l + t)) > commonPrefixMin) { l += t; } } while (t > 1); int j = i + l * d; int2 range = d > 0 ? new int2(i, j) : new int2(j, i); return range; } |
这时会遇到一个坑,clz的计算,common leading zero。__clz是C++原生的,c#里只能面向stackoverflow
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
//https://stackoverflow.com/questions/10439242/count-leading-zeroes-in-an-int32 [MethodImpl(MethodImplOptions.AggressiveInlining)] private int clz_value(uint value1, uint value2) { uint value = value1 ^ value2; //do the smearing value |= value >> 1; value |= value >> 2; value |= value >> 4; value |= value >> 8; value |= value >> 16; //count the ones value -= ((value >> 1) & 0x55555555); value = (((value >> 2) & 0x33333333) + (value & 0x33333333)); value = (((value >> 4) + value) & 0x0f0f0f0f); value += (value >> 8); value += (value >> 16); return (int)(32 - (value & 0x0000003f)); } |
然后又会遇到一个大坑,mortoncode一样怎么办???
如果mortoncode一样的话构造bvh时,determinrange和findsplit都会出错,最后树节点上会有环,没法更新aabb了。
Tero Karras 原始论文Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees里第四节有讲, 大致就是如果mortoncode一样的话用index来fallback算一下。
于是就有了下面的代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
private int clz_safe(int idx, int idy) { if (idy < 0 || idy > NumObjects - 1) return -1; return clz_index(idx, idy); } [MethodImpl(MethodImplOptions.AggressiveInlining)] private int clz_index(int idx, int idy) { //rely on morton being unique, otherwise clz its index //https://devblogs.nvidia.com/parallelforall/wp-content/uploads/2012/11/karras2012hpg_paper.pdf //see section 4. BVHs, Octrees, and k-d Trees return mortonCodes[idx] == mortonCodes[idy] ? (NumObjects - math.max(idx, idy)) + 32 : clz_value(mortonCodes[idx], mortonCodes[idy]); } |
5 更新AABB
bottom-up更新,第一个到的节点退出。这里理应有一锁,避免datarace。实际应用中并不会写,用Interlock造成了unity死锁。目前这样偶尔会出错但一般问题不大,如果有谁会写请告诉我。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
[BurstCompile] unsafe struct UpdateAABBJob : IJobParallelFor { [NativeDisableParallelForRestriction] public NativeArray<BVHNode> BVHArray; [NativeDisableUnsafePtrRestriction] public long* locks; public void Execute(int i) { int halfLength = BVHArray.Length / 2; int leafNodeId = halfLength + i; AABB leafNodeAABB = BVHArray[leafNodeId].aabb; int parentIndex = BVHArray[leafNodeId].ParentNodeIndex; while (parentIndex != -1) { //todo locks! BVHNode parent = BVHArray[parentIndex]; if (parent.IsValid < 1) { parent.aabb = leafNodeAABB; parent.IsValid = 1; BVHArray[parentIndex] = parent; break; } else { parent.aabb = Utils.GetEncompassingAABB(parent.aabb, leafNodeAABB); parent.IsValid = 2; BVHArray[parentIndex] = parent; } leafNodeAABB = parent.aabb; parentIndex = parent.ParentNodeIndex; } } } |
最后水平大概是30k物体用10ms.
相关代码在我的Github项目OSMTrafficSim里,可以直接看Assets/Code/BVH这部分
参考资料
Thinking Parallel, Part III: Tree Construction on the GPU,本文的主要参考。有一些源码,解释也较为通俗。作者Tero Karras 原始论文可见Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees
Morton encoding/decoding through bit interleaving,有关MortonCode的计算. 作者也是libmorton的作者
三十分钟理解:双调排序Bitonic Sort,适合并行计算的排序算法,有关Bitonic Sort双调排序
ECSPhysics, 一个用ECS做物理的项目,可以看代码学习ECS,想抄它的BVH发现它可能写错了

TA代码玩的这么溜,厉害啊。
是做无人驾驶测试吗?
哈哈没那么高级,就是学习一下ECS
你这评论还要审核的吗?刚才的评论没显示。
是的好像之前规则是默认没评论过的用户都需要审核,刚给取消掉了试试看
我正在寻找用ecs构建类似megacity的框架,有幸看到博主的blog. 非常感谢!!!
谢谢支持!