跳至内容

Unity Job System and Burst: A Foolproof Guide on How to Use Them Correctly?

If the sole purpose is multithreading, using Async for thread switching or System.Threading is the most convenient and straightforward method. Burst can also be called directly, not necessarily requiring Jobs. Refer to the articles on Async Programming and Direct Calling. However, if you have a large number of small computations, that’s when you should consider the Job System.

Unity’s Job documentation is extremely incomplete. This article is based on long-term usage experience and should be relatively comprehensive.

Purpose and Limitations of Jobs

Different from general multithreading approaches, the Unity Job System is a system that simulates GPU-like high throughput via multithreading, meaning the overhead of task distribution is extremely low.

Jobs can only handle computations within a single frame. Like CUDA/shader functions, they are small kernels. Heavy computations will still affect FPS because Jobs are sometimes assigned to run on the main thread. If you want to run long-duration calculations, either break them into many small Jobs, or .Net threads are still the best.

So the question arises: since it’s similar to GPU, why not just use ComputeShader? That’s right, ComputeShader can do everything a Job does, and more conveniently. Unless:

  1. If frequent data exchange with the CPU is needed, which GPUs are not good at, CPU-based Jobs have an advantage.
  2. The ECS (DOTS) system is based on Jobs, for the same reason as above.

Note: WebGL does not support Jobs or ComputeShader, nor Burst acceleration. However:

  • WebGL is being replaced by WebGPU, which supports ComputeShader, and newer browsers already support it (IOS17 requires enabling in settings, preview version 19 has it enabled by default).
  • In settings, WebAssembly 2023 can enable multithreading support for Jobs, and browser support is more comprehensive. But officially, it’s not even considered an experimental feature and should not be used. Enabling it in Unity 6 LTS will cause a Crash. They’ve been struggling with wasm multithreading for many years. Moreover, DOTS is currently receiving more attention, and with .Net 8 supporting wasm multithreading, perhaps they are waiting for Unity 7 to support .Net 8.

Performance

Providing performance metrics helps understand the design intent and applicability. You can also read the article first and then come back to the performance tests.

Three projects were tested, first testing the performance of 3 allocators:

  • BenchAllocatorTemp: Executes 100,000 Allocator.Temp allocations.
  • BenchAllocatorTempJob: Same as above, using Allocator.TempJob.
  • BenchAllocatorPersistent: Same as above, using Allocator.Persistent. The result: TempJob is the fastest, followed by Persistent.

Then, the performance of 4 Job patterns was tested:

  • BenchBaseLine: Uses a For loop to perform 100,000 simple calculations as a reference baseline.
  • BenchIJob: Time to schedule 100,000 Jobs.
  • BenchIJobParallelFor: Time to batch schedule 100,000 Jobs in parallel mode.
  • BenchIJobParallelForBurst: Same as above, but with Burst enabled.
  • BenchIJobParallelForBurstLoopVectorization: Schedules 10 Jobs, each Job uses For to calculate 10,000 times, with Burst loop vectorization enabled.

Here are the results from my PC test:

|2x

Median is the median time taken by the test project, in milliseconds.

It’s evident that Job scheduling overhead is small, designed for executing a large number of tasks. Of course, I only performed simple multiplication calculations here, so the Job improvement is limited.

Data Types

First, Burst does not support C# managed types. It can only use types with the same length as C, which can be directly memcpy’d (no need for serialization/marshalling), called blittable types. This includes basic types like int (char, string, and bool are sometimes managed, avoid them), and 1-dimensional C-Style arrays of blittable types (new int[5]). Since Jobs are inevitably used with Burst, follow this restriction.

Unity has encapsulated a thread-safe type NativeArray specifically for Job use. These types can share data with the main thread without copying because only the data pointer is passed during copying; multiple copies reference the same memory region. Derivatives include NativeList, NativeQueue, NativeHashMap, NativeHashSet, NativeText, etc., but these can only be used in single-threaded contexts.

Note: You cannot use code like nativeArray[0].x = 1.0f or nativeArray[0]++;. The value won’t change because it doesn’t return a reference.

Thread Safety

Thread safety is achieved through scheduling restrictions. The same NativeArray instance can only have one Job writing to it; otherwise, an exception is thrown. If data can be parallelized through segmentation, you can use IJobParallelFor to execute on NativeArray in batches. If the data is read-only, you can identify it when defining the member variable, e.g., [ReadOnly] public NativeArray<int> input;.

When a Job is writing, the main thread cannot read from the NativeArray; it will cause an error. You must wait for completion.

Memory Allocation (allocate)

First, Native types need to be manually Dispose()d after use; they are not automatically destroyed. For this, Unity has added memory leak tracking.

When creating a Native type, you need to choose one of three allocators: Temp, TempJob, Persistent. Allocation speed is from fastest to slowest. Temp has a 1-frame lifecycle, TempJob has 4 frames. What does this mean?

  • Temp means it’s for use within the current function and should be Dispose()d before the function ends. Therefore, if you forget to Dispose, Unity will report an error immediately during the next rendering, but this allocation is actually quite slow.
  • TempJob has more lenient error conditions. It’s still intended for use within 1 frame, but you can Dispose it in the next frame.
  • Persistent won’t report errors; you need to be careful yourself.

The BenchAllocator project from the earlier performance test tested the performance of these three. As you can see, Allocator.Temp took 4 times longer than TempJob. The documentation says Temp is the fastest; this is either a bug or an Editor mode issue.

Executing Single-Threaded Jobs

The entire process involves writing your own IJob class, having the main thread Schedule it, and then calling Complete to block and wait for the Job to finish.

public struct MyJob : IJob {
    public NativeArray<float> result;

    public void Execute() {
        for (int j = 0; j < result.Length; j++)
            result[j] = result[j] * result[j];
    }
}

void Update() {
    result = new NativeArray<float>(100000, Allocator.TempJob);

    MyJob jobData = new MyJob{
        result = result
    };

    handle = jobData.Schedule();
}

private void LateUpdate() {
    handle.Complete();
    result.Dispose();
}

But the problem is, we use Jobs for a large number of tasks. This single-task approach isn’t very useful. Referencing GPU’s parallel mode is more helpful.

Parallel Mode (Parallel Job)

Changing the above code from inheriting IJob to IJobParallelFor makes it parallel mode.

public struct MyJob : IJobParallelFor {
    public NativeArray<float> result;

    public void Execute(int i) {
        result[i] = result[i] * result[i];
    }
}

void Update() {
    result = new NativeArray<float>(100000, Allocator.TempJob);

    MyJob jobData = new MyJob{
        result = result
    };

    handle = jobData.Schedule(result.Length, result.Length / 10);
}

private void LateUpdate() {
    handle.Complete();
    result.Dispose();
}

Parallel mode doesn’t require writing your own For loop; it executes Execute once for each element, similar to Shaders.

Schedule(result.Length, result.Length / 10) means executing Execute for each unit from index 0 to result.Length, distributed across 10 workers.

For the performance difference between IJob and IJobParallelFor, refer to the earlier performance test.

Parallel Restrictions

In IJobParallelFor, you can only write to element i, and it doesn’t know which member Array you want to write to, so all Arrays can only write to element i. However, you can add the [NativeDisableParallelForRestriction] attribute to the NativeArray to disable safety checks, ensuring no write conflicts yourself.

Read-only mode has no restrictions on all Native containers.

Additionally, IJobParallelFor cannot enable loop vectorization unless your calculations already use vectorization (calling other vectorized functions); otherwise, performance is still not optimal.

Using NativeList and Other Containers in Parallel

Containers other than Array, like NativeList, can only be in read-only mode in parallel. So how do you write to them? In design, NativeList is divided into Add and Set operations. The correct usage pattern is: one Job performs Add operations, and a second Job performs Set operations.

For Add operations, you can use ParallelWriter and AsParallelWriter, as shown below:

    public struct AddListJob : IJobParallelFor {
        public NativeList<float>.ParallelWriter result;

        public void Execute(int i) {
            result.AddNoResize(i);
        }
    }

    public void RunIJobParallelForList() {
        var results = new NativeList<float>(10, Allocator.TempJob);
        var jobData = new AddListJob() {
            result = results.AsParallelWriter(),
        };
        var handle = jobData.Schedule(10, 1);
        handle.Complete();
        Debug.Log(String.Join(",", results.ToArray(Allocator.TempJob)));
        results.Dispose();
    }

In this state, NativeList has a fixed capacity. Memory must be pre-allocated before starting, and only AddNoResize() can be used. This method is implemented via atomic locking of the Length property, which incurs significant performance overhead.

Then, use the lossless conversion from NativeList to NativeArray: NativeList.AsDeferredJobArray(). The returned NativeArray is lazy; conversion only happens when the Job actually runs, so it can be passed before both Jobs execute:

var addJob = new AddListJob { result = results.AsParallelWriter() };
var jobHandle = addJob.Schedule(10, 1);

var setJob = new SetListJob { array = results.AsDeferredJobArray() };
setJob.Schedule(10, 1, jobHandle).Complete();

Note: Both AsDeferredJobArray and AsArray return Views, which are views of the original data. The source data must still be Disposed.

Parallel Mode for Two-Dimensional Arrays

IJobParallelFor can only parallelize per single element of an Array. However, parallelizing per row of a two-dimensional array is more useful and can enable loop vectorization for higher performance. Use IJobParallelForBatch for this.

First, we create a flattened 2D array of size [10*15], then schedule it via IJobParallelForBatch.Schedule(int length, int batchCount). batchCount indicates how many data points each job is responsible for; Execute will be called length/batchCount times.

var results = new NativeArray<float>(10*15, Allocator.TempJob);
var jobData = new MyJob2D {
    result = results
};
var handle = jobData.Schedule(10*15, 15);
handle.Complete();
Debug.Log(String.Join(",", results));
results.Dispose();

Then, the implementation of MyJob2D.

[BurstCompile]
public struct MyJob2D : IJobParallelForBatch {
    public NativeArray<float> result;

    public void Execute(int i, int count) {
        for (int j = i; j < i + count; j++) {
            result[j] = i;
        }
    }
}

Execution result:

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
UnityEngine.Debug:Log (object)

This method can automatically enable loop vectorization via Burst, so in the performance test, the time for 100,000 calculations was 0.09ms, making it the fastest.

Other Limitations

  • You cannot start a Job from within a Job.

Combining with Async

The above examples schedule Jobs in Update and Complete in LateUpdate, aiming to speed up Update code. For one-time tasks, it’s not that complicated; you can use Async to wait directly without blocking rendering. Use the extension method CompleteAsync from the package:

async void GenerateMesh() {
    result = new NativeArray<float>(100000, Allocator.Persistent);

    MyJob jobData = new MyJob{
        result = result
    };

    handle = jobData.Schedule();

    await handle.CompleteAsync();
}

Note: This pattern requires the Persistent allocator because you are not guaranteed to finish within 1 frame.

Burst

Burst, based on LLVM, is a subset of C# called “High-Performance C#”, essentially C code. It is typically 10 to 100 times faster than Mono, which also indicates how slow Mono is.

Burst can further enhance Job execution speed. For the examples above, just add this line:

 [BurstCompile]
 public struct MyJob : IJobParallelFor {
     ...
 }

For the IJobParallelFor performance test, just with this line, execution time improved from 5.16ms to 0.21ms. At this point, Job execution speed finally surpassed the For loop.

Note: The above performance tests were conducted with 10 Workers. Fine-tuning the number of Workers may yield different performance results.

Vectorization

Vectorization packs multiple calculations into one instruction. For example, float3 calculations are naturally vectorized. For vectorization, it’s best to use types and methods from the Unity.Mathematics library; otherwise, it might fail.

If you haven’t performed vectorized calculations, you can also vectorize loops. The earlier performance test improved to 0.09ms because of this; see the previous section on two-dimensional arrays. Loop vectorization allows some parallelizable For loop calculations to be completed within one instruction set; Burst automatically judges and optimizes.

How to Know if a Job is Correctly Vectorized?

Open the Burst Inspector tool (in the Jobs menu) |2x

Select your function and check the Assembly for AVX instructions and see if there are any warnings in IR Optimisation. If vectorization didn’t work correctly, it will show:

---------------------------
Remark Type: Analysis
Message:     test.cs:30:0: loop not vectorized: call instruction cannot be vectorized
Pass:        loop-vectorize
Remark:      CantVectorizeInstructionReturnType

Common ones include:

  • loop not vectorized: call instruction cannot be vectorized Refers to calling an external function that cannot be vectorized.
  • loop not vectorized: instruction return type cannot be vectorized Generally, this means calling an already optimized function, so it cannot be vectorized a second time, which is normal.

Converting Between Job and Unity Data

The most painful part of using Jobs and Burst is converting various data to NativeArray.

For example, Vector3 needs to be changed to float3. If they are the same size, they can be directly cast. Example:

var floats = new NativeArray<float3>(100, Allocator.TempJob);
NativeArray<Vector3> vertices = floats.Reinterpret<Vector3>();
Vector3[] verticesArray = vertices.ToArray();
floats.Dispose();

You can also Reinterpret into structures, for example, converting 3 float1s into 1 vector3:

var floats = new NativeArray<float>(new float[] {1,2,3}, Allocator.TempJob);
NativeArray<Vector3> aaa = floats.Reinterpret<Vector3>(sizeof(float));
Debug.Log(string.Join("\n", aaa.Select(v => v.ToString())));
floats.Dispose();
(1.00, 2.00, 3.00)

For casts like NativeArray<int> to NativeArray<ushort>, you need to create your own Job for conversion.

JobSystem that Automatically Batches Execution on WebGL Platform

JobSystem code on the Web