在 Unity 中使用 Burst Compiler 可以有效提高计算密集型任务的运算速度，问题是 Burst Compiler 做了什么魔法？我们还有什么优化空间？

1. Burst Compiler 是啥

Burst 可以认为是一个 Unity 的 LLVM 编译器。而 LLVM (Low Level Virtual Machine) 是一个编译器基础设施，提供了丰富的中间表示（Intermediate Representation, IR）和高级优化工具。

一般的 C# 代码，会经过如下编译步骤 (针对IL2CPP编译)

C# → CLR IL → CPP → 机器码

而 Burst 的代码，会经过如下编译步骤

C# → CLR IL → LLVM IR → 机器码

看似只是改变了一个 LLVM 的步骤，但其实因为 LLVM 的自动优化特性获得了更高的运算速度，包括

LLVM 的 SIMD 优化
LLVM 的很多编译 Pass
IL2CPP 还是带托管内存的，而 LLVM 没有这个机制，限制了内存管理灵活性，但优化了性能。

LLVM 非常多的编译优化机制

Burst 的编译 Pass

图源：Deep dive into the Burst compiler - Unite LA 2018

LLVM 优化生成机器码

2. Burst 的优化建议

2.1 别从 “实现抽象” 的角度理解 SIMD，而是把它理解成 128 bit 的单元

来自（Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019）

math.dot

手动实现dot

上面这个例子，3D点乘，直接自己写实现可能都比 math.dot 快

最后还是要看机器码和自己 profile 看结果

2.2 让编译器完成 Loop Vectorization 和提升 Aliasing

Loop Vectorization 优化中，编译器会一次 loop 多个值，而不是一个值，这样利用 SIMD 机制优化

Loop vectorization | Burst | 1.8.10 (unity3d.com)

一个典型的机器码，会发现编辑器自动 unroll，一次 loop 计算了很多值。

.LBB1_4:
    vmovdqu    ymm0, ymmword ptr [rdx + 4*rax]
    vmovdqu    ymm1, ymmword ptr [rdx + 4*rax + 32]
    vmovdqu    ymm2, ymmword ptr [rdx + 4*rax + 64]
    vmovdqu    ymm3, ymmword ptr [rdx + 4*rax + 96]
    vpaddd     ymm0, ymm0, ymmword ptr [rcx + 4*rax]
    vpaddd     ymm1, ymm1, ymmword ptr [rcx + 4*rax + 32]
    vpaddd     ymm2, ymm2, ymmword ptr [rcx + 4*rax + 64]
    vpaddd     ymm3, ymm3, ymmword ptr [rcx + 4*rax + 96]
    vmovdqu    ymmword ptr [rcx + 4*rax], ymm0
    vmovdqu    ymmword ptr [rcx + 4*rax + 32], ymm1
    vmovdqu    ymmword ptr [rcx + 4*rax + 64], ymm2
    vmovdqu    ymmword ptr [rcx + 4*rax + 96], ymm3
    add        rax, 32
    cmp        r8, rax
    jne        .LBB1_4

Enhanced Aliasing with Burst | Unity Blog

Memory Aliasing 是指不同的内存地址被用来访问内存中的同一位置，这些不同的地址互相为 Aliasing（别名）

告诉编译器 No Aliasing 后，主要好处是

减少加载和存储操作
让编辑器更容易进行 SIMD 优化

2.3 用 Burst Intrinsic 指令优化

来自（Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019）

一开始的代码

[BurstCompile]
public struct DoorTest_Reference : IJob
{
    public NativeArray<Door> Doors;
    public NativeArray<DoorTestPos> TestPos;
    public NativeArray<int> DoorOpenStates;

    public void Execute()
    {
        for (int j = 0; j < Doors.Length; ++j)
        {
            bool shouldOpen = false;

            for (int i = 0; i < TestPos.Length; ++i)
            {
                float3 delta = TestPos[i].Pos - Doors[j].Pos;
                float dsq = math.csum(delta * delta);

                if (dsq < Doors[j].RadiusSquared && Doors[j].Team == TestPos[i].Team)
                {
                    shouldOpen = true;
                    break;
                }
            }

            DoorOpenStates[j] = shouldOpen ? 1 : 0;
        }
    }
}

之前机器码

.LBB0_6:
    vmovsd xmm2, qword ptr [rsi - 12]
    vinsertps xmm2, xmm2, dword ptr [rsi - 4], 32
    vsubps xmm2, xmm2, xmm0
    vmulps xmm2, xmm2, xmm2
    vmovshdup xmm3, xmm2
    vpermilpd xmm4, xmm2, 1
    vaddss xmm3, xmm3, xmm4
    vaddss xmm2, xmm2, xmm3
    vucomiss xmm2, xmm1
    jae .LBB0_10               ; not inside radius?

    mov ebx, dword ptr [rdx]
    cmp ebx, dword ptr [rsi]
    je .LBB0_8                 ; break out of loop

.LBB0_10:
    inc rdi
    add rsi, 16
    cmp rdi, rax
    jl .LBB0_6

优化后

for (int j = 0; j < Doors.Length; ++j) {
    m128 openMask = new m128(~0u);

    for (int i = 0; i < TestPos.Length; ++i) {
        m128 tx = new m128(TestPos[i].X);
        m128 ty = new m128(TestPos[i].Y);
        m128 tz = new m128(TestPos[i].Z);
        m128 tt = new m128(TestPos[i].Team);

        m128 xDeltas = sub_ps(Doors[j].Xs, tx);
        m128 yDeltas = sub_ps(Doors[j].Ys, ty);
        m128 zDeltas = sub_ps(Doors[j].Zs, tz);

        m128 xdsq = mul_ps(xDeltas, xDeltas);
        m128 ydsq = mul_ps(yDeltas, yDeltas);
        m128 zdsq = mul_ps(zDeltas, zDeltas);

        m128 dsq = add_ps(xdsq, add_ps(ydsq, zdsq));
        m128 rangeMask = cmple_ps(dsq, Doors[j].RadiiSquared);

        rangeMask = and_ps(rangeMask, cmpeq_epi32(Doors[j].Teams, tt));
        openMask = or_ps(openMask, rangeMask);
    }

    DoorOpenStates.ReinterpretStore(j * 4, openMask);
}

对应机器码

.LBB1_3:
    vbroadcastss xmm4, dword ptr [rax - 12]
    vbroadcastss xmm5, dword ptr [rax - 8]
    vbroadcastss xmm6, dword ptr [rax - 4]
    vpbroadcastd xmm7, dword ptr [rax]
    vpcmpeqd xmm7, xmm3, xmm7
    vsubps xmm4, xmm1, xmm4
    vsubps xmm5, xmm1, xmm5
    vsubps xmm6, xmm1, xmm6
    vmulps xmm4, xmm4, xmm4
    vmulps xmm5, xmm5, xmm5
    vmulps xmm6, xmm6, xmm6
    vaddps xmm4, xmm5, xmm4
    vaddps xmm4, xmm6, xmm4
    vcmpleps xmm4, xmm4, xmm2
    vpand xmm4, xmm7, xmm4
    vpor xmm0, xmm4, xmm0
    inc rsi
    add rax, 16
    cmp rsi, rdx
    jl .LBB1_3

Arm @ GDC 2021 : Supercharging mobile performance with Arm Neon and Unity Burst Compiler - YouTube 在 Unity 里直接写 ARM 的硬件指令优化了一倍性能

2.4 忘记上述原则，做 Profile

这个是笔者总结的，

看指令不能完全确信进行了优化，上面原则也有不成立的时候。

最后落到比较还是还是靠 Profile。

3. Burst 内存管理

众所周知，Burst Compiler 作用域中不能创建托管对象，那么创建对象时候，发生了什么？

3.1 New stuct

我们测试一个 Job

[StructLayout(LayoutKind.Sequential)]
    public struct TestStruct
    {
        public int id;
        public float value;
        public float value1;
        public float value2;
        public float value3;
        public float value4;
        public float value5;
        public float value6;
    }
    
    [BurstCompile]
    public struct TestNewStructJob1 : IJob
    {
        [WriteOnly] public NativeArray<TestStruct> outputs;
        public int x;
        public float y;

        public void Execute()
        {
            var newStruct = new TestStruct { id = x, value = y };
            outputs[0] = newStruct;
        }
    }

看到编译后的机器码

mov        eax, dword ptr [rcx + 48] // x 放进 eax
vmovss        xmm0, dword ptr [rcx + 52] // y 放进 xmm0
mov        rcx, qword ptr [rcx] // rcx 获取到 outputs array 的首地址
mov        dword ptr [rcx], eax // outputs array 前四个bytes，置成 eax，即 x
vmovss        dword ptr [rcx + 4], xmm0 // outputs array 下面四个bytes，置成 xmm0 的前四个bytes，即 y
vxorps        xmm0, xmm0, xmm0 // 把 xmm 用 xor 操作置成 0
vmovups        xmmword ptr [rcx + 8], xmm0 // outputs array 下面 16 个bytes 置成 xmm0，即 0
mov        qword ptr [rcx + 24], 0 // outputs array 下面 8 个bytes 置成 0

所以说，new struct 可能没有分配堆内存，而是直接用寄存器了。同样array的复制也就是按地址拷贝。

3.2 New Array

编译出来的代码看不太懂，但可以反编译进 NativeArray.cs 看看

private static unsafe void Allocate(int length, Allocator allocator, out NativeArray<T> array)
{
  long size = (long) UnsafeUtility.SizeOf<T>() * (long) length;
  NativeArray<T>.CheckAllocateArguments(length, allocator);
  array = new NativeArray<T>();
  NativeArray<T>.IsUnmanagedAndThrow();
  array.m_Buffer = UnsafeUtility.MallocTracked(size, UnsafeUtility.AlignOf<T>(), allocator, 0);
  array.m_Length = length;
  array.m_AllocatorLabel = allocator;
  array.m_MinIndex = 0;
  array.m_MaxIndex = length - 1;
  AtomicSafetyHandle.CreateHandle(out array.m_Safety, allocator);
  NativeArray<T>.InitStaticSafetyId(ref array.m_Safety);
  NativeArray<T>.InitNestedNativeContainer(array.m_Safety);
}

ok, 所以是用 malloc 做的堆上内存分配，但是打了标记，防止之后泄露。

3.3 New List

我们看看 NativeList Add 会发生什么

[BurstCompile]
public struct TestNativeListJob : IJob
{
    [WriteOnly] public NativeList<int3> outputs;

    public void Execute()
    {
        outputs.Add(new int3(1,2,3));
    }
}

编译出来

# NativeList.cs(332, 1)            m_ListData->Add(value);
        mov               rsi, qword ptr [rcx + 32]
.Ltmp93:
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.Add:idx <- 0
        .cv_inline_site_id 57 within 56 inlined_at 7 332 0
# UnsafeList.cs(489, 1)            var idx = m_length;
        movsxd            rdi, dword ptr [rsi + 8]
.Ltmp94:
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.Add:idx <- $edi
# UnsafeList.cs(491, 1)            if (m_length + 1 > Capacity)
        lea               ebx, [rdi + 1]
.Ltmp95:
        .cv_inline_site_id 58 within 57 inlined_at 5 491 0
# UnsafeList.cs(117, 1)                return CollectionHelper.AssumePositive(m_capacity);
        mov               eax, dword ptr [rsi + 12]
.Ltmp96:
# UnsafeList.cs(491, 1)            if (m_length + 1 > Capacity)
        cmp               ebx, eax
        jle               .LBB6_3
.Ltmp97:
# %bb.1:                                # %BL.0015.i.i.i.i
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.Add:idx <- $edi
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.Resize:num <- 0
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.Resize:ptr <- 0
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.Resize:sizeOf <- 0
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.SetCapacity<Unity.Collections.AllocatorManager.AllocatorHandle>:sizeOf <- 0
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.SetCapacity<Unity.Collections.AllocatorManager.AllocatorHandle>:newCapacity <- 0
        .cv_inline_site_id 59 within 57 inlined_at 5 493 0
        .cv_inline_site_id 60 within 59 inlined_at 5 351 0
        .cv_inline_site_id 61 within 60 inlined_at 5 419 0
# UnsafeList.cs(402, 1)            var newCapacity = math.max(capacity, 64 / sizeOf);
        cmp               ebx, 6
        mov               ecx, 5
        cmovge            ecx, ebx
.Ltmp98:
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.SetCapacity<Unity.Collections.AllocatorManager.AllocatorHandle>:newCapacity <- $ecx
        .cv_inline_site_id 62 within 61 inlined_at 5 403 0
# math.cs(5315, 1)            x -= 1;
        dec               ecx
.Ltmp99:
# math.cs(5316, 1)            x |= x >> 1;
        mov               edx, ecx
        shr               edx
        or                edx, ecx
# math.cs(5317, 1)            x |= x >> 2;
        mov               ecx, edx
        shr               ecx, 2
        or                ecx, edx
# math.cs(5318, 1)            x |= x >> 4;
        mov               edx, ecx
        shr               edx, 4
        or                edx, ecx
# math.cs(5319, 1)            x |= x >> 8;
        mov               ecx, edx
        shr               ecx, 8
        or                ecx, edx
# math.cs(5320, 1)            x |= x >> 16;
        mov               r8d, ecx
        shr               r8d, 16
        or                r8d, ecx
# math.cs(5321, 1)            return x + 1;
        inc               r8d
.Ltmp100:
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.SetCapacity<Unity.Collections.AllocatorManager.AllocatorHandle>:newCapacity <- $r8d
# UnsafeList.cs(405, 1)            if (newCapacity == Capacity)
        cmp               r8d, eax
.Ltmp101:
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.Resize:oldLength <- undef
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.SetCapacity<Unity.Collections.AllocatorManager.AllocatorHandle>:sizeOf <- undef
        je                .LBB6_3
.Ltmp102:
# %bb.2:                                # %BL.0037.i.i.i.i.i.i
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.Add:idx <- $edi
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.Resize:num <- 0
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.Resize:ptr <- 0
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.Resize:sizeOf <- 0
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.SetCapacity<Unity.Collections.AllocatorManager.AllocatorHandle>:newCapacity <- $r8d
# UnsafeList.cs(419, 1)            SetCapacity(ref Allocator, capacity);
        lea               rdx, [rsi + 16]
.Ltmp103:
# UnsafeList.cs(410, 1)            Realloc(ref allocator, newCapacity);
        mov               rcx, rsi
        call              "Unity.Collections.LowLevel.Unsafe.UnsafeList`1[[Unity.Mathematics.int3, Unity.Mathematics, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]], Unity.Collections, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null.Realloc<Unity.Collections.AllocatorManager+AllocatorHandle, Unity.Collections, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null>(Unity.Collections.LowLevel.Unsafe.UnsafeList`1[[Unity.Mathematics.int3, Unity.Mathematics, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]]*, Unity.Collections, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null this, Unity.Collections.AllocatorManager+AllocatorHandle&, Unity.Collections, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null allocator, System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089 newCapacity) -> System.Void, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089_1b1e5bd2b95e1e579075e9cb12b5342e from Unity.Collections, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null@@24"
.Ltmp104:
.LBB6_3:                                # %"IndexMethod+TestNativeListJob, Tests, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null.Execute(IndexMethod+TestNativeListJob*, Tests, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null this) -> System.Void, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089_1b1e5bd2b95e1e579075e9cb12b5342e from Tests, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null.exit"
        #DEBUG_VALUE: Unity.Collections.LowLevel.Unsafe.UnsafeList`1<Unity.Mathematics.int3>.Add:idx <- $edi
# UnsafeList.cs(500, 1)            UnsafeUtility.WriteArrayElement(Ptr, idx, value);
        mov               dword ptr [rsi + 8], ebx
        mov               rax, qword ptr [rsi]
        lea               rcx, [rdi + 2*rdi]
        movabs            rdx, 8589934593
        mov               qword ptr [rax + 4*rcx], rdx
        mov               dword ptr [rax + 4*rcx + 8], 3

看代码看不太懂，想必也是 malloc 出来的内存，然后 Add 时候动态扩容。

3.4 stackalloc

Burst现在可以直接分配栈内存，我们试试直接 stackalloc 一个

[BurstCompile]
public unsafe struct TestNewAlloc : IJob
{
    [WriteOnly] public NativeArray<float> outputs;
    public float x;
    public float y;
    public float z;
    public void Execute()
    {
        float* data = stackalloc float[4];
        data[0] = x;
        data[1] = y;
        data[3] = z;
        for (int i = 0; i < 4; i++)
        {
            outputs[i] = data[i];
        }
    }
}

看一下机器码，

vmovss        xmm0, dword ptr [rcx + 56]
mov        rax, qword ptr [rcx]
vmovsd        xmm1, qword ptr [rcx + 48]
vmovsd        qword ptr [rax], xmm1
mov        dword ptr [rax + 8], 0
vmovss        dword ptr [rax + 12], xmm0

好家伙，直接没分配内存，编译器优化掉了，赋值直接进寄存器了

但如果我没给一个编译期不固定的尺寸，

[BurstCompile]
public unsafe struct TestNewAlloc : IJob
{
    [WriteOnly] public NativeArray<float> outputs;
    public float x;
    public float y;
    public float z;
    public int size;
    public void Execute()
    {
        float* data = stackalloc float[size];
        data[0] = x;
        data[1] = y;
        data[3] = z;
        for (int i = 0; i < 4; i++)
        {
            outputs[i] = data[i];
        }
    }
}

看一下机器码

mov        r8d, dword ptr [rcx + 60]
shl        r8d, 2
lea        rax, [r8 + 15]
and        rax, -16
// 上面在计算 stackalloc 预留的大小
call        __chkstk
sub        rsp, rax
mov        rdi, rsp
// 上面检查一下 栈尺寸还够不够，然后移动一下栈顶
sub        rsp, 32
mov        rcx, rdi
xor        edx, edx
xor        r9d, r9d
call        burst.memset.inline.AVX2.i32@@32
add        rsp, 32
// 上面在初始化这块分配的内存，下面就在赋值了
vmovsd        xmm0, qword ptr [rsi + 48]
vmovsd        qword ptr [rdi], xmm0
vmovss        xmm0, dword ptr [rsi + 56]
vmovss        dword ptr [rdi + 12], xmm0
mov        rax, qword ptr [rsi]
vmovups        xmm0, xmmword ptr [rdi]
vmovups        xmmword ptr [rax], xmm0
mov        rsp, rbp

可以看到这次确实分配了栈空间，机器码上用寄存器代表栈顶位置。

3.5 内存分配的结论

在 burst compiler 作用域中，new 一个固定尺寸的 stackalloc，或者一个 struct，是有可能被优化成直接使用寄存器的，并不会分配内存。

而 stackalloc 一个不定长的内存，确实发生了栈内存的分配。如果要局部缓存控件，确实这种方式更快，但受栈内存大小限制。

NativeArray 和 NativeList 都是 malloc 申请的堆内存，好在 burst 的机制一般要求使用完 dispose，减少堆内存碎片。

burst 中无法使用别的托管对象，一般不用担心内存分配和GC问题。

4. 优化例子

我们构造一个 loop voxel 的任务，

voxel 是 3D 的，但我们排进了一个 1D array，现在我们想从中抽取一小部分。

inputs是所有 voxel 的数组，而 outputs 是抽取出的一部分。chunkStartPos 是我们抽取出的部分的起始位置。

4.1 基础版

一个最基础版本的如下，per voxel 计算 id 然后读取

[BurstCompile]
public struct IndexJob1 : IJob
{
    [ReadOnly] public NativeArray<float> inputs;
    [WriteOnly] public NativeArray<float> outputs;
    public int dim;
    public int3 chunkStartPos;
    public int numYX;
    public int numX;

    private int LocalXYZToGlobalIndex(int3 localXyz)
    {
        var voxelPos = localXyz + chunkStartPos;
        return voxelPos.x + voxelPos.y * numX + voxelPos.z * numYX;
    }

    public void Execute()
    {
        for (var z = 0; z < dim; z++)
        for (var y = 0; y < dim; y++)
        for (var x = 0; x < dim; x++)
        {
            var index = LocalXYZToGlobalIndex(new int3(x, y, z));
            outputs[index] = inputs[index] * 2.0f;
        }
    }
}

其中最内层 loop 的机器码是这样

.LBB0_4:
        vmovd        xmm1, esi  // new int3(x,y,z), 把 x 放进 128bit 寄存器
        vpinsrd        xmm1, xmm1, edx, 1  // new int3(x,y,z), 把 y 放进 128bit 寄存器
        vpinsrd        xmm1, xmm1, eax, 2  // new int3(x,y,z), 把 z 放进 128bit 寄存器
        vpaddd        xmm1, xmm1, xmm0 // var voxelPos = localXyz + chunkStartPos;
        vmovd        edi, xmm1 // voxelPos.x 放进 ebi 寄存器
        vpextrd        ebx, xmm1, 1 // voxelPos.y 放进 ebx 寄存器
        imul        ebx, r9d // mul，ebx 寄存器变成 voxelPos.y * numX 放进 ebx 寄存器
        add        ebx, edi // add，ebx 寄存器变成 voxelPos.x + voxelPos.y * numX
        vpextrd        edi, xmm1, 2 // voxelPos.z 放进 ebi 寄存器
        imul        edi, r10d // mul，ebx 寄存器变成 voxelPos.z * numYX 放进 ebi 寄存器
        add        edi, ebx // add，ebi voxelPos.x + voxelPos.y * numX + voxelPos.z * numYX
        movsxd        rdi, edi // var index = ..., 寄存器获取到地址
        vmovss        xmm1, dword ptr [r11 + 4*rdi] // 取值 inputs[index]
        vaddss        xmm1, xmm1, xmm1 // inputs[index] * 2.0f;
        vmovss        dword ptr [rcx + 4*rdi], xmm1 // 赋值 outputs[index]
        inc        esi

有意思的是，new int3 并并没有构造栈内存，而是直接放进寄存器了。

在 dim = 16，运行 10000 次，耗时 152.03 ms

4.2 尝试点乘优化 index 计算

我们优化一下函数计算，

idMultiplier = new int3(1, Dim, Dim * Dim)

private int LocalXYZToGlobalIndex(int3 localXyz)
{
    var voxelPos = localXyz + chunkStartPos;
    return math.dot(idMultiplier, voxelPos);
}

看一下生成的机器码

.LBB0_4:
        vmovd        xmm1, edi
        vpinsrd        xmm1, xmm1, esi, 1
        vpinsrd        xmm1, xmm1, edx, 2
				// 上面和以前一样，就是构造 new int3(x,y,z) 放进 xmm1 寄存器
        vpaddd        xmm1, xmm1, xmm0 // var voxelPos = localXyz + chunkStartPos;
				// 下面开始是点乘计算 
        vmovd        ebx, xmm1
        imul        ebx, r9d
        vpextrd        eax, xmm1, 1
        imul        eax, r10d
        add        eax, ebx
        vpextrd        ebx, xmm1, 2
        imul        ebx, r11d
        add        ebx, eax
        movsxd        rax, ebx
        vmovss        xmm1, dword ptr [r14 + 4*rax]
        vaddss        xmm1, xmm1, xmm1
        vmovss        dword ptr [rcx + 4*rax], xmm1

可以发现，点乘计算和之前手动写的没啥区别，甚至还多了一次乘1 的计算指令

如前面 1.1 讲的，这个 math.dot 这里确实没啥用

最后时间 155.57 ms，确实没什么用

4.3 尝试强制 SIMD乘法优化 index 计算

再优化一下计算 LocalXYZToGlobalIndex, 强制用 simd 乘法

private int LocalXYZToGlobalIndex(int3 localXyz)
{
    v128 id = new v128(localXyz.x + chunkStartPos.x, localXyz.y + chunkStartPos.y, localXyz.z + chunkStartPos.z,
        0);
    v128 result = Unity.Burst.Intrinsics.X86.Sse4_1.mul_epi32(id, idMultiplier128);
    return result.SInt0 + result.SInt1 + result.SInt2;
}

看一下生成的机器码，

.LBB0_4:
        vmovd        xmm1, edi
        vpinsrd        xmm1, xmm1, eax, 2 // new v128(...)
        vpmuldq        xmm1, xmm1, xmm0 // mul_epi32(id, idMultiplier128)
        vpshufd        xmm2, xmm1, 85 // 下面是计算 result.SInt0 + result.SInt1 + result.SInt2
        vpshufd        xmm3, xmm1, 238
        vpaddd        xmm1, xmm1, xmm3
        vpaddd        xmm1, xmm2, xmm1
        vmovd        esi, xmm1
        movsxd        rsi, esi
        vmovss        xmm1, dword ptr [rdx + 4*rsi]
        vaddss        xmm1, xmm1, xmm1
        vmovss        dword ptr [rcx + 4*rsi], xmm1

看到确实是少了一些指令，

最后， 144.25ms，快了一点点

4.4 Index 计算不构造 int3

再优化一下计算 LocalXYZToGlobalIndex，可以发现构造 int3 到寄存器还是花了一些时间的，改成

private int LocalXYZToGlobalIndex(int x, int y, int z) // 这里之前参数是 int3
{
    return (chunkStartPos.x + x) + (chunkStartPos.y + y) * numX + (chunkStartPos.z + z) * numYX;
}

看一下生成的机器码，

.LBB3_9:
        mov        ecx, r9d
        imul        ecx, edi
        add        ecx, r8d
        lea        esi, [rcx + r11]
        cmp        esi, ecx
        jl        .LBB3_6
        mov        rcx, r11
        shr        rcx, 32
        mov        esi, 0
        jne        .LBB3_7
        mov        esi, ebx
        mov        r13, r14
        .p2align        4, 0x90
.LBB3_12:
        movsxd        rsi, esi
        vmovups        ymm0, ymmword ptr [rdx + 4*rsi]
        vmovups        ymm1, ymmword ptr [rdx + 4*rsi + 32]
        vmovups        ymm2, ymmword ptr [rdx + 4*rsi + 64]
        vmovups        ymm3, ymmword ptr [rdx + 4*rsi + 96]
        vaddps        ymm0, ymm0, ymm0
        vaddps        ymm1, ymm1, ymm1
        vaddps        ymm2, ymm2, ymm2
        vaddps        ymm3, ymm3, ymm3
        vmovups        ymmword ptr [rax + 4*rsi], ymm0
        vmovups        ymmword ptr [rax + 4*rsi + 32], ymm1
        vmovups        ymmword ptr [rax + 4*rsi + 64], ymm2
        vmovups        ymmword ptr [rax + 4*rsi + 96], ymm3

LBB3_9 这里应该是在计算 ID，

LBB3_12 这里，很神奇，出现了 Loop Vectorization 的痕迹！四个一组一起unroll赋值了，

最后时间 131.56 ms，比上面还快。这里因为没构造 int3，莫名其妙被编译器弃用了 Loop Vectorization 做了加速。

4.5 手动 Unroll

上文基础上，我们直接在 loop 里面手动 unroll 试试

public void Execute()
{
    for (var z = 0; z < dim; z++)
    for (var y = 0; y < dim; y++)
    {
        for (var x = 0; x < dim; x += 4) // 这里之前是 x++
        {
            var index = LocalXYZToGlobalIndex(x, y, z);
            outputs[index] = inputs[index] * 2.0f;
            index++;
            outputs[index] = inputs[index] * 2.0f; 
            index++;
            outputs[index] = inputs[index] * 2.0f;
            index++;
            outputs[index] = inputs[index] * 2.0f;
        }
    }
}

看一下生成机器码，

.LBB3_4:
        lea        ecx, [rsi + rbx]
        movsxd        rcx, ecx
        vmovss        xmm0, dword ptr [rdx + 4*rcx]
        vaddss        xmm0, xmm0, xmm0
        vmovss        dword ptr [rax + 4*rcx], xmm0
        lea        ecx, [rsi + rbx + 1]
        movsxd        rcx, ecx
        vmovss        xmm0, dword ptr [rdx + 4*rcx]
        vaddss        xmm0, xmm0, xmm0
        vmovss        dword ptr [rax + 4*rcx], xmm0
        lea        ecx, [rsi + rbx + 2]
        movsxd        rcx, ecx
        vmovss        xmm0, dword ptr [rdx + 4*rcx]
        vaddss        xmm0, xmm0, xmm0
        vmovss        dword ptr [rax + 4*rcx], xmm0
        lea        ecx, [rsi + rbx + 3]
        movsxd        rcx, ecx
        vmovss        xmm0, dword ptr [rdx + 4*rcx]
        vaddss        xmm0, xmm0, xmm0
        vmovss        dword ptr [rax + 4*rcx], xmm0

赋值的部分确实和上面 Loop Vectorization 一样，一次赋四个了，

然后感觉主要省略的计算是 index 的计算，

这时，耗时变成了 102 ms。这或许告诉我们，

有时候手动 unroll 可能效果和编译器 Loop Vectorization 差不多。
减少 ALU 计算量可能是优化的最方便途径

4.6 Index 局部自增

我们直接在 Loop X 这层不算index了，直接自增。另外我们预先计算 chunkStartPos


globalStartIndex = chunkStartPos.x + chunkStartPos.y * numX + chunkStartPos.z * numYX;

private int LocalXYZToGlobalIndex(int x, int y, int z)
{
    return globalStartIndex + x + y * numX + z * numYX;
}

public void Execute()
{
    for (var z = 0; z < dim; z++)
    for (var y = 0; y < dim; y++)
    {
        var index = LocalXYZToGlobalIndex(0, y, z);
        for (var x = 0; x < dim; x++)
        {
            outputs[index] = inputs[index] * 2.0f;
            index++;
        }
    }
}

看一下代码，没问题自动做了 Loop Vectorization，

最后耗时 96.02 ms，比上面快一点。

4.7 Baseline

这个我们发挥编译器最大优势，直接写个最简单的

public void Execute()
{
    for (var z = 0; z < inputs.Length; z++)
    {
        outputs[z] = inputs[z] * 2.0f;
    }
}

最后耗时 85.02 ms，这个应该是极限了。

4.8 优化案例的总结

ID	方法描述	耗时	性能
1	原始方法	152.03	1
2	尝试点乘计算 Index	155.57	0.98x
3	尝试 SIMD点乘计算Index	144.25	1.05x
4	Index 计算直接使用XYZ	131.56	1.18x
5	手动 unroll	102.00	1.49x
6	内层 Index 自增	96.02	1.58x
7	基准版本	85.02	1.79x

相比最初版本，理论极限性能应该能提升 1.79x，而我们符合业务逻辑的构造里面，性能提升了 1.58 倍还算不错。

这个例子中我们学到的：

不要相信 Unity.Mathematic 会自动做 SIMD 运算，而且SIMD 运算不一定能变快
new 一个 struct 不一定会分配栈内存，有可能直接进寄存器了。
loop vectorization 确实能变快，手动 unroll loop 可能也可以
减少 ALU 计算量，做缓存永远是值得相信的

5. 结论

从上面测试和分析我们可以看出，虽然使用 Burst Compiler 能较大提升计算密集型任务的性能，但了解其实现机制仍然有助于性能优化，在例子中我们直接将性能提升了 1.6 倍。

Burst Compiler 是基于 LLVM 的，因此编译器自身的优化能力较强，但也有一些限制使得编译器无法自动优化。

从上面的结论中我们学到的：

就算加上了 [BurstCompile] 的 Attribute 也还有不少优化空间
和 C# 不太一样，Burst 里面没有托管内存，有 SIMD，有人叫它 HPC# ，它优化起来更像 shader
不要相信一些数学计算会自动做 SIMD 运算，而且 SIMD 运算不一定能变快。但使用 intrinsic SIMD 计算说不定会变快
尽量利用好编译器的机制，比如 Loop Vectorization，如果利用不好，手动 unroll 可能也行
减少 ALU 计算量，做缓存永远是值得相信的
上面原则都可以忘了，但别忘了看 Profile 结果

参考资料

Burst Manual

Supercharging mobile performance with ARM Neon and Unity Burst Compiler

Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 (slides)

Deep dive into the Burst compiler - Unite LA 2018

Enhanced aliasing with Burst