Write a software CUDA path tracer from scratch: the road to better parallelism

CUDA-PT

I. Intro

I decide to write this blog post in English, which might potentially help more people, I guess. It's been a while since I've finished the backbone of the CUDA-PT, and I think it's time to honor the original purpose of developing this repo: write your own wheel, profile and benchmark it, feel the challenges and understand things better. I can't say I have done a pretty good job at benchmarking, since the commits make it difficult to rewind back to the previous sub-optimal implementation. Nevertheless, I'll try making my point, offering my two cents, even if they are insignificant for those CUDA geeks.


II. Preliminaries

I've already written a path tracing renderer AdaPT, so why there is yet another path tracer? The answer is simple: I didn't dive deep into the IR layer of Taichi lang, therefore, it's difficult to max out the performance of GPU with the JIT of Taichi lang. The experiments show that my handcrafted CUDA kernels can be 2~3x as fast as the code based on Taichi lang. The more access we have to the basic elements, the more we, as humans, can do to manually improve the implementation. I am not saying that I am smarter than the compilers. I am just saying that if I don't know how these stuffs in the bottom layer work, it's highly unlikely that I wrote anything that's better than a good compiler

Write your own wheel, profile and benchmark it, feel the challenges and understand things better along the way.

So much for the self hypnosis, let's cut to the chase. The tide of AI is surging more fiercely lately, creating a huge demand for computation power. However, the AI related computation like deep learning is way too regular, that is, the parallelism seems to be a natural thing to do. Take General Matrix Multiply (GEMM) as an example. This is the most important building block of deep learning: linear layer and attention both need it. Yet for GEMM, the threads in a GPU can pretty much do the same thing, at any given time point: there can be no branching (flow divergence), and the memory access pattern are regular enough (coalesced, can be vectorized, even). We even have specialized cores for this op (Tensor Core, well, I do know we have RT Core for ray tracing). Balanced workloads and regular memory access make it possible for us to compute the theoretical upper bound of efficiency, and through optimization techniques, we can progressively approach this upper bound.

But... it's somewhat hard for a soft path tracing program: for a classic path tracing pipeline, each thread processes the tracing of a ray (from a pixel). The ray bounces stochastically in the scene, meaning that:

problems

  • The neighboring threads might access memory locations distant to each other (Random and un-coalesced accesses).
  • The neighboring threads might have different behaviors, like one gets reflected, the other gets refracted (Breaks SIMT).
  • Tailing effect: some threads are dead, while others survive (the warp is neither full nor empty).

All these problems can be fatal for the speed of the renderer.

Ok, there are two main kinds of kernels (renderers):

All the logics are in the same, big, kernel. We can use registers to hold some intermediate results.