[11/03/2018] : added Unity Jobs System version and updated timings. Please check at the end of the article.
With my previous article on Svelto.Tasks and multi-threaded cache friendly code, I failed to show visually the power of Svelto.Tasks because I didn’t know how to upload a huge amount of data to the GPU without stalling the main thread. Honestly, I stopped being a graphic programmer right before DX 11 was introduced, so my knowledge of modern pipelines is limited. I also thought that to find a good solution it would have been necessary to write some kind of low level workaround to overcome the Unity API limitations, but actually Unity has already got what I needed, it was just necessary some investigation and help to find out :).
With my great astonishment, using the compute buffers API is possible to upload every frame a huge amount of data without affecting the CPU too much. I still have to understand how this works and which DX 11 function ComputeBuffer.SetData maps onto, so if you know, please leave a comment, as I need to understand it, although it is not my priority for this current demo. As matter of fact, it was enough for me to know that uploading the vertices transformed on the CPU would not affect the final performance in a significative way.
I immediately threw some lines of code to show off how simple and efficient is to work with multi-threads with Svelto.Tasks and after combining it with the new IL2CPP feature of Unity 2018, I achieved the incredible number of 1 million particles transformed on the CPU at >30fps!! If someone would have told me that this was possible, to be honest, I would have had my doubts. However let me clear, these kind of demo are pretty lame because it doesn’t make much sense to do these kind of operations on the CPU. This is exactly what the GPUs are for. It’s more a show off than something practical, but the library, obviously, can be used for better cases.
I was also lucky enough to find a good and, most of all, simple demo on github, called MillionPoints, which does what I needed, that is transforming 1 Million points on the GPU using Compute Shaders and the Graphics.DrawMeshInstancedIndirect function. All I had to do is to convert the simple compute shaders code in pure c# and make it run on the CPU.
While I still don’t consider my self a multi-threaded code expert, because one never stops to learn and I didn’t have the chance to use multi-threading in sophisticated algorithms yet, I may dare to say that I start to have a good understanding of all the problems involved and consequentially I designed Svelto.Tasks to be very simple to use in a multi-thread environment, exactly like the new unity c# job system does. Since in Freejam we hire a lot of junior programmers, I had the chance to see first hand all the problems that could arise to give advanced tools to inexpert hands. That’s why I had to design something very straightforward to use and most of the time, worries free. This is what I hope to have achieved with Svelto.Tasks, improving it constantly over the years.
There are two keys elements that make Svelto.Tasks powerful: the runners (or schedulers) and the continuation. The runners are designed to run every kind of IEnumerator (or often called task) on every kind of defined Runner. Svelto.Tasks already ships with a lot of Unity related Runners, but two are platform agnostic: the MultithreadRunner and the SyncRunner.
The concept of continuation (similar to await/async) is even more powerful. It allows to start a task running on a specific scheduler from another task running on another scheduler and continue from there once the new task is finished. This is much simpler to code than to explain. Although there are similarities, Tasks.Net and Svelto.Tasks are obviously designed differently as the latter is designed around all the problematic intrinsic in a game development production, including performance.
Enough talk now so let’s dive in the details. Open the scene main.unity and click on the MillionPoints GameObject. Be sure that the MillionPointsCPU Monobehaviour is enabled and the GPU one is disabled. Run it, don’t expect great performance in the editor though, there is a huge difference between editor and clients in this case. I will show all the differences in bit. BTW I am currently using Unity 2018.1b3.
The goal is to perform the following cache friendly CPU instructions inside the ParticleCPUKernel MoveNext() method inner loop on 1 Million points every frame using tasks running on multiple cores:
class ParticlesCPUKernel : IEnumerator { int startIndex; int endIndex; CPUParticleData[] _particleDataArr; GPUParticleData[] _gpuparticleDataArr; static uint Hash(uint s) { s ^= 2747636419u; s *= 2654435769u; s ^= s >> 16; s *= 2654435769u; s ^= s >> 16; s *= 2654435769u; return s; } static float Randomf(uint seed) { return Hash(seed) / 4294967295.0f; // 2^32-1 } static void RandomUnitVector(uint seed, out Vector3 result) { float PI2 = 6.28318530718f; float z = 1.0f - 2.0f * Randomf(seed); float xy = (float) Math.Sqrt(1.0 - z * z); float sn, cs; var value = PI2 * Randomf(seed + 1); sn = (float) Math.Sin(value); cs = (float) Math.Cos(value); result.x = sn * xy; result.y = cs * xy; result.z = z; } static void RandomVector(uint seed, out Vector3 result) { //return RandomUnitVector(seed) * sqrt(Random(seed + 2)); RandomUnitVector(seed, out result); var sqrt = (float) Math.Sqrt(Randomf(seed + 2)); result.x = result.x * sqrt; result.y = result.z * sqrt; result.z = result.z * sqrt; } static float quat_from_axis_angle(ref Vector3 axis, float angle, out Vector3 result) { float half_angle = (angle * 0.5f) * 3.14159f / 180.0f; var sin = (float) Math.Sin(half_angle); result.x = axis.x * sin; result.y = axis.y * sin; result.z = axis.z * sin; return (float) Math.Cos(half_angle); } static void Cross(ref Vector3 lhs, ref Vector3 rhs, out Vector3 result) { result.x = lhs.y * rhs.z - lhs.z * rhs.y; result.y = lhs.z * rhs.x - lhs.x * rhs.z; result.z = lhs.x * rhs.y - lhs.y * rhs.x; } static void rotate_position(ref Vector3 position, ref Vector3 axis, float angle, out Vector3 result) { Vector3 q; var w = quat_from_axis_angle(ref axis, angle, out q); Cross(ref q, ref position, out result); result.x = result.x + w * position.x; result.y = result.y + w * position.y; result.z = result.z + w * position.z; Vector3 otherResult; Cross(ref q, ref result, out otherResult); result.x = position.x + 2.0f * otherResult.x; result.y = position.y + 2.0f * otherResult.y; result.z = position.z + 2.0f * otherResult.z; } public ParticlesCPUKernel(int startIndex, int numberOfParticles, MillionPointsCPU t) { this.startIndex = startIndex; endIndex = startIndex + numberOfParticles; _particleDataArr = t._cpuParticleDataArr; _gpuparticleDataArr = t._gpuparticleDataArr; } public bool MoveNext() { for (int i = startIndex; i < endIndex; i++) { Vector3 randomVector; RandomVector((uint) i + 1, out randomVector); Cross(ref randomVector, ref _particleDataArr[i].BasePosition, out randomVector); var magnitude = 1.0f / randomVector.magnitude; randomVector.x *= magnitude; randomVector.y *= magnitude; randomVector.z *= magnitude; rotate_position(ref _particleDataArr[i].BasePosition, ref randomVector, _particleDataArr[i].rotationSpeed * MillionPointsCPU._time, out _gpuparticleDataArr[i].Position); } //is it actually working? //Utility.Console.Log("startIndex ".FastConcat(startIndex).FastConcat(" endIndex ").FastConcat(endIndex)); return false; } public void Reset() {} public object Current { get; private set; } }
I skip all the compute buffer initialization stuff because is not relevant for this article and I start directly from the function StartSveltoCPUWork(). First of all, I decided to split the job in 16 threads. As you know, when it’s time for CPU intensive work, increasing the number of threads more than the number of available cores gives a logarithmic-like gain, so even 16 is already probably too much on my 8 cores machine. Since the operations to apply on the vertices are quite straightforward, we can just have each thread operating on a specific segment of the vertices array and this is what the function PrepareParallelTasks method does.
private void PrepareParallelTasks() { //calculate the number of particles per thread var particlesPerThread = _particleCount / NUM_OF_SVELTO_THREADS; //create a collection of task that will run in parallel on several threads. //the number of threads and tasks to perform are not dipendennt. _multiParallelTask = new MultiThreadedParallelTaskCollection(NUM_OF_SVELTO_THREADS, false); //in this case though we just want to perform a task for each thread //ParticlesCPUKernel is a task (IEnumerator) that executes the //algebra operation on the particles. Each task perform the operation //on particlesPerThread particles for (int i = 0; i < NUM_OF_SVELTO_THREADS; i++) _multiParallelTask.Add(new ParticlesCPUKernel((int) (particlesPerThread * i), (int) particlesPerThread, this)); }
The MultiThreadedParallelTaskCollection has a similar interface to the simpler ParallelTaskCollection, but it is able to run N tasks on M threads using M ParallelTaskCollections. You may have figured out how this works already. it basically creates M Threads and runs one ParallelTaskCollection on each. The N tasks are split among the M ParallelTaskCollections, so the execution of M ParallelTaskCollections, and not N tasks, are truly in parallel. When N coincide to M, then all the tasks run in parallel like in the case of this example. In this case we initialize 16 ParallelTaskCollections running 1 task each. This task applies the ParticlesCPUKernel methods instructions on 1.000.000 / 16 particles.
Remember, the MultiThreadedParallelTaskCollection is just an IEnumerator than must run like the others tasks in Svelto.Tasks, so it doesn’t just start on its own. For the purpose of this article, I show two different ways to start the collection execution. All the following code assumes that you know well how to work with the yield instruction.
Let’s start from enabling the #Test2 define as it is the simplest scenario. In this case, everything is driven by a single loop running on the main thread.
IEnumerator MainThreadLoopWithNaiveSynchronization() { var bounds = new Bounds(_BoundCenter, _BoundSize); var syncRunner = new SyncRunner(); while (_breakIt == false) { _time = Time.time; //exploit continuation here. Note that we are using the SyncRunner here //this will actually stall the mainthread and its execution until //the multiParallelTask is done yield return _multiParallelTask.ThreadSafeRunOnSchedule(syncRunner); //then it resumes here, copying the result to the particleDataBuffer. //remember, multiParalleTasks is not executing anymore until the next frame! //so the array is safe to use _particleDataBuffer.SetData(_gpuparticleDataArr); //render the particles. I use DrawMeshInstancedIndirect but //there aren't any compute shaders running. This is so cool! Graphics.DrawMeshInstancedIndirect(_pointMesh, 0, _material, bounds, _GPUInstancingArgsBuffer); //continue the cycle on the next frame yield return null; } //the application is shutting down. This is not that necessary in a //standalone client, but necessary to stop the thread when the //application is stopped in the Editor to stop all the threads. _multiParallelTask.ClearAndKill(); TaskRunner.Instance.StopAndCleanupAllDefaultSchedulerTasks(); }
As I said, this is the simplest flow to understand. The intention is to wait for the other threads to finish, that’s why I run the multithreaded parallel collection execution on the SyncRunner that is designed to stall the current thread. Once the execution is completed, the rest of the code will run on the main thread.
This is probably not the best solution, because the other threads could start compute the particles operations for the next frame between the end of this function and its next execution. Let’s see this visually:
in the way we are running the loop, the multithreaded collection starts during the Update phase, as the main thread loop is running on the Update Runner. While it’s running it stalls the main thread, so nothing else can be executed, then it finished and the process continue. The red vertical line can give an idea about where the threads start and complete to run. However why should the update phase wait for the threads to finish their operation when those could have ran outside the Update phase in parallel with it?
We could invert the way we trigger the multi-threads operation so that we compute the next values in between updates rather than inside the update phase.
There is more than one way to achieve this. Let’s see an example enabling the define #Test1
IEnumerator MainThreadOperations() { var bounds = new Bounds(_BoundCenter, _BoundSize); var syncRunner = new SyncRunner(true); //these will help with synchronization between threads WaitForSignalEnumerator _waitForSignal = new WaitForSignalEnumerator(); WaitForSignalEnumerator _otherwaitForSignal = new WaitForSignalEnumerator(); //Start the operations on other threads OperationsRunningOnOtherThreads(_waitForSignal, _otherwaitForSignal) .ThreadSafeRunOnSchedule(StandardSchedulers.multiThreadScheduler); //start the mainloop while (true) { _time = Time.time; //wait until the other thread tell us that the data is ready to be used. //Note that I am stalling the main thread here! This is entirely up to you //if you don't want to stall it, as you can see with the other use cases yield return _otherwaitForSignal.RunOnSchedule(syncRunner); _particleDataBuffer.SetData(_gpuparticleDataArr); //render the particles. I use DrawMeshInstancedIndirect but //there aren't any compute shaders running. This is so cool! Graphics.DrawMeshInstancedIndirect(_pointMesh, 0, _material, bounds, _GPUInstancingArgsBuffer); //tell to the other thread that now it can perform the operations //for the next frame. _waitForSignal.Signal(); //continue the cycle on the next frame yield return null; } } IEnumerator OperationsRunningOnOtherThreads(WaitForSignalEnumerator waitForSignalEnumerator, WaitForSignalEnumerator otherWaitForSignalEnumerator) { //a SyncRunner stop the execution of the thread until the task is not completed //the parameter true means that the runner will sleep in between yields var syncRunner = new SyncRunner(true); while (_breakIt == false) { //execute the tasks. The MultiParallelTask is a special collection //that uses N threads on its own to execute the tasks. This thread //doesn't need to do anything else meanwhile and will yield until //is done. That's why the syncrunner can sleep between yields, so //that this thread won't take much CPU just to wait the parallel //tasks to finish yield return _multiParallelTask.ThreadSafeRunOnSchedule(syncRunner); //the 1 Million particles operation are done, let's signal that the //result can now be used otherWaitForSignalEnumerator.Signal(); //wait until the application is over or the main thread will tell //us that now we can perform again the particles operation. This //is an explicit while instead of a yield, just because if the _breakIt //condition, which is needed only because if this application runs //in the editor, the threads spawned will not stop until the Editor is //shut down. while (_breakIt == false && waitForSignalEnumerator.MoveNext() == true) ThreadUtility.SleepZero(); } //the application is shutting down. This is not that necessary in a //standalone client, but necessary to stop the thread when the //application is stopped in the Editor to stop all the threads. _multiParallelTask.ClearAndKill(); TaskRunner.Instance.StopAndCleanupAllDefaultSchedulerTasks(); }
In this case we have two loops. One running on the main thread, as we have started the MainThreadOperations task using the standard UpdateRunner, and the other on an other thread, starting OperationsRunningOnOtherThreads on the standard MultiThreadedRunner before the main loop starts. Note that the standard threaded runner is used just to check when the _multiParallelTask completes, as the _multiParallelTasks uses its own set of threads to run.
At this point _waitForSignal and _otherwaitForSignal are used to signal when the operations on each thread are completed. I hope this is intuitive. The main thread first waits for the other threads to finish, when this happens the draw mesh is issued and the main thread will signal the other thread to start the operations for the next frame. Since the other thread can finish running before the next update, it will yield its execution to other threads until the are signalled to compute the next values.
The _breakit part is a bit awkward at the moment. It’s necessary to be sure that the threads will stop when the code is executed in the editor, as stopping the execution in the editor won’t kill the running threads like it normally happens with a standalone application (note, it has been eliminated with the latest updates).
Last scenario is an alternative possibility:
IEnumerator MainLoopOnOtherThread() { var syncRunner = new SyncRunner(); var then = DateTime.Now; RenderingOnCoroutineRunner().ThreadSafeRun(); var CopyBufferOnUpdateRunner = new SimpleEnumerator(this); //let's avoid useless allocations while (_breakIt == false) { _time = (float) (DateTime.Now - then).TotalSeconds; //exploit continuation here. Note that we are using the SyncRunner here //this will actually stall the mainthread and its execution until //the multiParallelTask is done yield return _multiParallelTask.ThreadSafeRunOnSchedule(syncRunner); //then it resumes here, copying the result to the particleDataBuffer. //remember, multiParalleTasks is not executing anymore until the next frame! //so the array is safe to use var continuator = CopyBufferOnUpdateRunner.ThreadSafeRunOnSchedule(StandardSchedulers.updateScheduler); while (_breakIt == false && continuator.MoveNext() == true) ThreadUtility.Yield(); } //the application is shutting down. This is not that necessary in a //standalone client, but necessary to stop the thread when the //application is stopped in the Editor to stop all the threads. _multiParallelTask.ClearAndKill(); TaskRunner.Instance.StopAndCleanupAllDefaultSchedulerTasks(); } IEnumerator RenderingOnCoroutineRunner() { var bounds = new Bounds(_BoundCenter, _BoundSize); while (true) { //render the particles. I use DrawMeshInstancedIndirect but //there aren't any compute shaders running. This is so cool! Graphics.DrawMeshInstancedIndirect(_pointMesh, 0, _material, bounds, _GPUInstancingArgsBuffer); //continue the cycle on the next frame yield return null; } } class SimpleEnumerator:IEnumerator { MillionPointsCPU _million; public SimpleEnumerator(MillionPointsCPU million) { _million = million; } public bool MoveNext() { _million._particleDataBuffer.SetData(_million._gpuparticleDataArr); return false; } public void Reset() { throw new NotImplementedException(); } public object Current { get; } }
so what’s the deal here? Again two loops, although this time the one running on the main thread just keeps on issuing the draw mesh with the last updated particles. In the multithreaded loop instead continuation is exploited to set the freshly computed particles to the compute buffer. Since Unity will throw an exception if this is done on other threads, we have to run the task on the main thread and wait it to finish.
The frame rate for this case is much higher, but don’t be fooled, the reason being that the mainthread this time is not stalled, however the particles will be updated at the same frame rate of the other cases, more or less.
Before to test the performance, let me spend two words about the cache friendly code created to compute the particle values. As you may notice, I use a lot of ref and out. This is because it’s very important to avoid copying structure when not necessary as it can hit hard the stack. This is also why c# 7.2 has recently introduced the byref returning value and byref variables, so that I can write simpler code to avoid copies on the stack of structs. You should always pass your Vector[N] by ref or out.
Let’s do some benchmarking now, using the #Test1 case:
Mono/.Net 4.6 | ~20fps |
IL2CPP | ~48fps |
UWP | ~23fps |
UWP .Net Native | ~59fps(*) |
Last test was just an experiment. I knew that UWP .net core code can be compiled in native code through the .net native toolchain, therefore I had obviously to compare it against IL2CPP. The result make me think that a future integration in Unity for standalone platforms, if possible, would be beneficial (update: now available in Unity 2018, * I wasn’t able to reproduce the UWP .Net Native timings, so I may have done something wrong then. They are like the IL2CPP timings in my new tests)
And finally the project can be downloaded from github as usual: https://github.com/sebas77/Svelto.Tasks.MillionPoints
P.S.: If you notice that Svelto.Tasks code can perform better, please tell me, I am sure there can be some areas to improve, as I continuously do it.
Update: Svelto.Tasks and Unity Jobs System comparison
Obviously I was curious to see how Unity Jobs Systems compares with Svelto.Tasks. Both systems have been designed to write multi-threaded code worries free, but Svelto.Tasks relies only on the power of c#, while Unity Jobs is allegedly mainly written in native code, exploiting the internal job workers of the Unity Engine.
There isn’t any difference between threads in c# and threads in c++. Threads are anyway handled by the operative system, therefore in both languages what it’s implemented is just a wrapper of the underling system. For this reason, I had some concerns about Unity Jobs System, as we have noticed in the past how marshaling could affect the performance of c# code.
However I can confirm that, once compiled, Unity Jobs runs at the same speed of Svelto.Tasks with very similar results. The IL2CPP still needs some optimizations, as Svelto.Tasks there is actually faster. As IL2CPP is not affected by the marshalling issue, it’s very likely that the Unity Jobs work in IL2CPP is not completed.
Let’s start from the standard mono version first. I have updated the code in github and the Unity Jobs System version is under the folder UnityJobsKernel.
I have compiled the “Naive” (#define TEST2) version to execute Svelto tasks and the Signal based version (#define TEST1). Both of them execute at the same speed of the Unity Jobs version. It’s important to note that the Unity Jobs version maps almost exactly to the “Naive” version of the Svelto solution.
Now you would wonder, why is the Naive version not slower than the Signal based version if it’s so Naive? I will come to that later, as it actually behaves exactly as I assumed, meaning that in a real world scenario, the “advanced version” could be faster than either the Naive version and the Unity jobs system version.
Let’s see the results of the 1 Million Points simulation (Unity Jobs is not available in UWP yet, thus I couldn’t test the performance there). This time I am measuring milliseconds ranges (the lower the faster). Note in order to achieve these results, I target the 32-bit CPU for the Mono version and the 64-bit got the native version. I didn’t have the time to investigate why mono is faster using 32 bit and IL2CPP is faster using 64 bit, but I guess is just about the sheer amount of data to handle and its layout.
Svelto Tasks | Svelto Tasks Naive | Unity Jobs System | |
Mono | 56-59 | 57-59 | 56-57 |
IL2CPP | 24-26 | 24-25 | 29-30 |
UWP | 45-49 | 45-49 | n/a |
UWP Native | 23-24 | 23-24 | n/a |
Pretty close right? Difference in IL2CPP could be significant, but Unity will very likely improve it. I am not sure what happened with the UWP Native Toolchain profiling there. Somehow I am not able to reproduce the timings of the first profiling. Maybe I did something wrong then? I don’t care investigating as the platform is not a priority for me and actually it would mean that IL2CPP is as good as the .Net Native Tooilchain platform.
I will not dig in the Unity jobs details. I don’t have the time and it’s out of the scope of this article. However I will explain why it maps to the naive version of Svelto.Tasks solution. As explained in the first part of this article, the naive version stalls the main thread until the offloaded operations are not completed. Working on Svelto.Tasks I learned that the concept of main thread is honestly obsolete. I would love to work with an engine where the main thread is not a thing, after all that’s also the mentality behind DX 12 and Vulcan. Even with the Svelto.Tasks Vanilla Example the applications runs in its own thread that is not the “main thread”.
However following the “naive” approach, its not optimal code must rely exclusively on number of CPU cores and their power, more than on the fact that the code is multi-threaded. This is what I explained above when I showed the unity frame rendering. We don’t want just to trigger a “burst” of operations, we actually should be able to run code in parallel with the rest of the unity pipeline.
As showed, the “Advanced” and “Naive” update have similar results, but what would happen if the main-thread executes other heavy operations outside the Job update? Let’s see what happens if we add a Thread.Sleep(10) in the main update simulating a main thread taking 10 milliseconds extra for other operations:
Errata Corrige:
actually there is a way to run the just scheduled jobs without needing to call the Complete() method. This can be done through the static ScheduleBatchedJobs() method of the JobHandle class. I changed the Unity Jobs System kernel in this way:
void Update() { Time = UnityEngine.Time.time / 10; var jobSchedule = _job.Schedule(_particleCount, 32); JobHandle.ScheduleBatchedJobs(); //do something seriously slow #if DO_SOMETHING_SERIOUSLY_SLOW Thread.Sleep(10); #endif jobSchedule.Complete(); _particleDataBuffer.SetData(_gpuparticleDataArr); Graphics.DrawMeshInstancedIndirect(_pointMesh, 0, _material, _bounds, _GPUInstancingArgsBuffer); }
and now the new timings are more in line with Svelto.Tasks:
Svelto Tasks | Unity Jobs System | |
Mono | 57-62 | 59-62 |
This is closer to a real life scenario than all the Unity Jobs demo showed so far. I understand anyway, Unity wants to keep things simple. However I don’t think with Svelto they are so hard, so personally I will keep on using Svelto (for this and many other reasons), but in future I will integrate the Unity Jobs because of the future coming “burst” technology, however this one must perform faster than IL2CPP, otherwise there is no real point in using it.
As usual, don’t quote me without testing yourself! I write these articles only when I have time, which usually is during the late night, so it’s better for you to double check always, running my example that you can find on github 😉
Thanks for this great article! The results are very cool! This shows that it is not always necessary to panically rush towards GPU computation as soon as a game (or simulation) has to compute large number of entities. Sure, GPUs are having warp drives with speed 9.9, but that nice multithreading framework of yours allows CPU to make some big jump from sol drive to warp 1.0 – at least! :)) I might have overseen something, but could it be you might have forgotten to add ‘Svelto.Utilities’ as a submodule? Both Unity and VS can’t find it. Your .gitmodules only… Read more »
Svelto.Common is a submodule of svelto.tasks everything should work if you update recursively. I will update the read me with some instructions.
Edit: In reality the GPU must be exploited, especially in these cases. I left the original code using compute shaders working. It runs the same operations at 90fps leaving the CPU free to do anything else!
Dang! Didn’t have a look into Svelto.Tasks. Thanks for the quick reply!
Svelto.Tasks is used on Svelto.ECS applicatons to run tasks (engine loops and whatever)