Performance Analysis of Multithreaded Renderer

The renderer of a game engine is often a performance bottleneck from the CPU side. Adding multithreading to the rendering step is an effective means to address the performance issue without losing content details. The primary factors affecting multi-threaded rendering performance are analyzed, and the relevant optimization methods are explored as well.

Role

Graphics Programmer

Game Engine

Personal Phoenix Engine

Platform

PC (Windows)

Development Time

Ongoing, 5 Months, 2020

Team Size

Solo Developer

Download

Code Repository

Current Progress Demonstration

Screenshots

Interface

Picture4

Multiple Shadow Casting Directional Lights and Spotlights

Interface

1/11

Screenshots

What did I plan to accomplish?

In simple terms - Rendering more objects on screen. High fidelity graphics are a big topic of interest for most hardcore gamers ,hence models and particles effects with the augmented realism of lighting with shadows were added to give a greater sense of depth to our scene. Understanding the intricacies of synchronization problem between multiple threads and between update and render loop.

What the artifact currently features?

Multiple Directional Light and Spotlight Shadows
CPU side Multithreaded 3D Billboarded Particle System
Multithreaded View Frustum Culling

What would I do to make it better?

Find more spots for parallelization
Add transparency to the particles and GPU instantiated particles
Implement 4-Split Cascaded shadow maps and scene graphs

Goals

Dynamic Directional Light & Spot Light Shadow Maps

Supports Multiple Spot Light and directional shadow casting lights.
Shadow casting on any light source can be toggled at runtime via the UI.
Non shadow casting Point lights are also supported with options to update various light parameters on the lights at runtime via the UI.
The purpose of dynamic shadows here is to serve towards enhancing the bottleneck in the performance of the rendering engine and explore optimization techniques.

Demonstration of Multiple Directional Light and Spot Light Shadows

The Light Structure used for Lighting with or Without Shadows

Lighting With Shadows Fragment Shader Code

Method for Rendering Shadow Maps for all Shadow Casting Directional Lights.

There were a number of challenges getting shadow maps working. A major that still prevails is that the shadows suffer from artifacts like peter panning and shadow acne but the main focus of this project was optimization via multi threading and not on getting perfectly good shadows

Job System and Worker Threads

Job System, Worker Threads and Atomic Mutex

The Job System is created as a global singleton and utilizes Total Number of Cores on the machine - 1 worker threads, minus 1 for the main thread.
Each Thread sets it's thread affinity mask over the cores to reduce resource contention and context switching.
The system is capable of handling child dependencies and indirect blocker dependencies on shared resources.
It follows a Lock free threading model using std::atomics.

Job System Worker Threads - Parallel Call Stack

Job System Worker Threads

The Job System - Overview

Atomic Mutex - a mutex created using atomics and Acquire and Release Semantics

(Reference - Which is more efficient, basic mutex lock or atomic integer?)

A great job system might be very much similar to an OS level scheduler and capable of supporting multiple type of Queues with Job Priorities and even giving the jobs fixed time slices and handling starvation and other such problems that arise from concurrency.

3-D Billboarded Particles

A 3-D billboarded particle system was implemented as a scalable inhibitor in conjunction with the shadow maps for testing performance limitations.
The particles are CPU instantiated, view frustum culled and billboarded.
The number of particles spawned per frame can be tuned as a controllable knob via the on screen UI.

3D Billboarded Particles Demonstration

Particle Emitter 3D Data Members

3D Particle System

Particle 3D

3D Particle System

Particle Optimizations, Limitations and Data Decomposition Models

Particles Spawning

Using a std::vector of particles on emitter turned out to be slow due to the dynamic memory allocations on the std::vector. So, the 3D particle emitter were switched to construction time fixed size array created dynamically once.
The fixed size array on the Particle Emitter 3D is treated like a Circular/Ring Buffer and the last position where the particle was spawned is cached off into the atomic variable - m_lastSearchPos.
A particle can only be spawned if space is available in the buffer.
Before starting to spawn particles a pre check is performed to see if a space is available for reuse another atomic variable - m_numAliveParticles is used to keep track of the number of alive particles against the buffer limit.
If pre check is passed another check is performed to find the slot available for reuse using a parallel array of atomic bools m_isParticleGarbage using the atomic exchange method on the slot index.

Particles | Circular Buffer | Ring Buffer | Phoneix Engine | Phoneix Game Engine | Game Engine Development | Graphics | Rendering | Development | Programming

Particle 3D in a Circular/Ring Buffer on Particle Emitter 3D

(Image Source: Wikipedia - Circular_buffer)

Method to insert particles into the Circular/Ring Particle Buffer

Particles Data Decomposition Models

Model 1 - Game handles Particle Spawning Job while Engine handles the Update Job

In this model the game would post 1 or more Game Emitter Spawn Jobs to the Job System depending on number of emitter in the scene and their slot availability.
Each Game Emitter Spawn Job would spawn Emitter Spawn Sub Jobs based on Number of Particles to Spawn / Number of Cores - 1, minus 1 for the main thread and Number of cores - 1 is also the number of worker threads created by the jobsystem at startup.
These Spawn sub jobs are responsible for the actual spawning of the particles.
The job creates sub jobs every time for scalability with increase in number of particles over a single emitter.
While the Job System worker threads complete all the Particle Spawning Jobs, the main thread idle spins and waits for all of them to complete while checking constantly if they are complete.
After the spawning jobs are completed, the main thread would post a new particle system job which would post a new emitter update Job for every emitter it owns.
The Emitter Update Job will update the age, position, color, view frustum culling etc. of the particles in the particle buffer on the emitter.
Once the updation of the particle buffer is complete the Emitter Update Job Would Spawn new Create Particles Vertices Job similar to the Game Emitter Spawn Job i.e. Number of Particles in the view frustum / Number of Cores - 1, minus 1 for the main thread and Number of cores - 1 is also the number of worker threads created by the jobsystem at startup.
The major reason for again splitting Emitter Update Job into Particle Vertices Sub Jobs was the time consuming Billboarding calculations for the tens to hundreds of thousands of particles on a single emitter.
For efficiency and lock-free threading model with use of any synchronization primitives, the particle vertices sub jobs are divided using a divide and conquer, which means that the threads have access limited only to start and end index of the slots defined by the job they can access from the Particles Circular/Ring buffer.
The main thread needed to idle spin in this model because of the lack of support for job dependencies within the job system.

Particle System Multithreading model | Graphics | Particles | Particle System | Rendering | Programming | Game Engine | Game Development | Phoenix Game Engine | Phoenix Engine

Particle System update dependency order with spin wait on Main thread

Particle System Level Wait Based Multi t

Particle System 3D Update Job

Model 2 - Game handles Particle Spawning Job And Game handles the Update Job

In this model the game would post 1 or more Game Emitter Spawn Jobs to the Job System depending on number of emitter in the scene and their slot availability.
Each Game Emitter Spawn Job in the constructor would spawn Emitter Spawn Sub Jobs based on Number of Particles to Spawn / Number of Cores - 1, minus 1 for the main thread and Number of cores - 1 is also the number of worker threads created by the jobsystem at startup.
These Spawn sub jobs are responsible for the actual spawning of the particles and the Emitter Spawn Job is dependent on the completion of the sub jobs in spawned.
The job creates sub jobs every time for scalability with increase in number of particles over a single emitter.
Once all Spawn sub jobs are completed the Game Emitter spawn job is ready to be executed and in it's execution it spawns an Update Emitter Job.
The Update Emitter Job then follows the same execution and communication flow as Model 1.
This dependency based model where the job system handles the dependencies helps in reducing the amount of Main thread idle Spinning.

Dependency Based Multithreading Model_2.

Particle Emitter spawn and update dependency based jobs

Particles Tighter Data Packing

The original Particle 3D data structure used 80 bytes per particle i.e. on a L1 Cache of 64 KB a total of 819 particles could be loaded at once.
The Particle 3D data structure was updated to use 60 bytes per particle by making the following changes -
- Is garbage and is view frustum culled boolean moved to emitter as arrays parallel to the particle array.
- Age variables changed from float to uint16_t. Age now calculated in terms of frame count as opposed to time.
- UV's stored as a sprite index integer as opposed to 4 floats for UV mins and maxs.
On a L1 Cache of 64 KB a total of 1092 particles could now be loaded at once.
This is a big improvement when you have tens or hundreds of thousands of particles in the scene as more data will now be cache warm or cache local.

Original Particle3D | Unoptimized Data Structre | Particle 3D | Graphics | Rendering | Programming | Game Engine | Game Development | Phoenix Game Engine | Phoenix Engine

Size and Alignment of initial implementation of Particle 3D

Data Packing | Optimized Data Structre | Particle 3D | Graphics | Rendering | Programming | Game Engine | Game Development | Phoenix Game Engine | Phoenix Engine

Size and Alignment of updated implementation of Particle 3D for tighter data packing

Original Data members in initial implementation of Particle 3D

Data members in final implementation of Particle 3D for tighter data packing

Particle Limitations

Only additive blending is supported at present.
A Particle can have last a maximum lifetime of 65535 frames i.e. at 60 FPS they can have a max lifetime of 18.2 seconds.
Do not support lighting or physics.
Any emitter particle buffer limit cannot be resized post construction.

Particles Spawning

Particles Data Decomposition Models

Model 1 - Game handles Particle Spawning Job while Engine handles the Update Job

Model 2 - Game handles Particle Spawning Job And Game handles the Update Job

Particles Tighter Data Packing

Particle Limitations

Other features

View frustum culling has been implemented as one of the optimization techniques.
Screen Space post processing effects like tone mapping such as grayscale and sepia have been implemented via the pixel shader and compute shader for the purposes of performance analysis.
XML based data driven scene setup support for the following :-
- Lights settings for all lights
- Player Camera start Position
- Models to load
- Model Instances
- Particle Emitter placements and effect settings
- Scene startup settings and multithreading parameters
Minor features include switching to debug shaders such as UV, surface normals, vertex tangents, bitangents and more at run time via the UI.
Some other effect shaders supported are Triplanar, Fresnel and Discard shader.
Integration of Dear ImGUI for on screen UI.
Frame Rates and Frame time are plotted over 2.5 mins of playtime and rolled over using the ImGUI Implot extension library integration.
ImGUI File Dialog extension library was also integrated to support file dialog boxes and run time file loading and testing of certain features.

Other features

Multithreading Models, Profiling, Memory Leaks Tracking and Data Dumping

Multithreading Models, Profiling, Tracking Memory Leaks and Data Dumping

Multithreading Models and Profiling

Choosing the right profiler was one of the hardest tasks as Nvidia N-sight and Intel V-Tune did not support for instrumented profiling CPU and GPU simultaneously to the fine scaled granularity that was needed. They also had a lot more overhead than what was expected.
An open source profiler - Tracy Profiler was selected after testing various profilers, as it allowed fine grained instrumented profiling on both the CPU and the GPU.
2 Single threaded and 2 multithreaded models were profiled with the same scene setup as demonstrated in the demo.

Model 1 - Single Threaded Interleaved Update and Render loops

Single Threaded Interleaved Update and Render Loop Model

This is the Single threaded interleaved Update and Render model.
In this model we complete all the frame updates first and only then we start rendering.
Performance in this model is the worst of all models.

Tracy Profile Capture for this Model

Multithreading Models and Profiling

Model 1 - Single Threaded Interleaved Update and Render loops

Model 2 - Single Threaded Interlaced Update and Render loops

Single Threaded Interlaced Update and Render Loop Model

This is the Single threaded interlaced Update and Render model.
In this model we once we are done updating something, we immediately push rendering commands for that part of the loop.
It can also be observed that, in this model the GPU becomes busy much earlier in the frame.
Performance in this model was observed to be better than the interleaved model.

Tracy Profile Capture for this Model

Model 2 - Single Threaded Interlaced Update and Render loops

Model 3 - Multithreaded Interleaved Update and Render loops

Multi Threaded Interleaved Update and Render Loop Model

This is the Multithreaded interleaved Update and Render model.
In this model we update light view frustum culling, particles over 1 or more emitters and player camera view Frustum culling over multiple threads concurrently. Once all updates are completed we start rendering.
Update starts again once present is completed.
It can also be observed that, in this model the GPU becomes busy much earlier in the frame.
Performance in this model was observed to be significantly better than the Single threaded interleaved and the Single threaded interlaced models.

Tracy Profile Capture for this Model

Model 3 - Multithreaded Interleaved Update and Render loops

Model 4 - Multithreaded Interlaced Update and Render loops

Single Threaded Interlaced Update and Render Loop Model

This is the Multithreaded interlaced Update and Render model.
In this model we update light view frustum culling via a worker thread and signal the main thread to render the shadow map once a culling job on a light is completed.
Similarly particles over 1 or more emitters are concurrently updated and upon completion signal the main thread and player camera view Frustum culling over multiple threads concurrently.
It can also be observed that, in this model the GPU becomes busy much earlier in the frame.
This model has a lot of dependencies and state blockers.
Performance in this model was observed to be significantly better than the Single threaded interleaved and the Single threaded interlaced models. It also performs very competitively against the Multithread interleaved model.

Tracy Profile Capture for this Model

Model 4 - Multithreaded Interlaced Update and Render loops

Tracking Memory Leaks

Visual Studio Diagnostic Tools was used to track if there were any memory leaks by taking memory usage snapshot at the start of the program while the program was running and right before exiting the program.
Deleaker a C++ memory leak detection and tracker Visual Studio plugin was also used occasionally.

Tracking Memory with Visual Studio Diagnostic Tools

Tracking Memory with Deleaker

Tracking Memory Leaks

Development Insights and Deep Learning

A well behaved program does not need to change it's process class and thread priorities assigned to it by the Operating System.
Lockless or Lock free programming is less effective as compared to algorithms. - Fedor Pikus CPP Con 2017
A great Job System is almost very similar to an OS level scheduler.
Async bugs are hard to track due to varying timing of execution on the threads.
Typing the Release and acquire semantics help convey the intent more clearly than simply using the default values.
Thread memory leaks were easier to track because of the exponential increase in memory footprint.
There is no easy way to choose between different profilers and debuggers, but they have their use cases.

Development Insights andDeep Learning

Post Mortem

What Went Well ?

Most features that were originally planned were accomplished.
Understanding of Low level programming and OS increased.
Data Driving helped save time on future iterations.

What Went Wrong ?

Debugging took longer than expected due to assumptions.
Decision on certain features could have been made earlier.
Until XML based data driving was added iteration on the scene setup was hard.

What I Learned ?

Go in assuming as little as possible - let the data tell you where to look.
Don't write optimized code all the time - but write optimizable code.
Using the right tool for the right job.

Performance Analysis of Multithreaded Renderer

Role

Game Engine

Platform

Development Time

Team Size

Contents

Current Progress Demonstration

Screenshots

What did I plan to accomplish?

What the artifact currently features?

What would I do to make it better?

Dynamic Directional Light & Spot Light Shadow Maps

Job System, Worker Threads and Atomic Mutex

3-D Billboarded Particles

Particle Optimizations, Limitations and Data Decomposition Models

Particles Spawning

Particles Data Decomposition Models

Model 1 - Game handles Particle Spawning Job while Engine handles the Update Job

Model 2 - Game handles Particle Spawning Job And Game handles the Update Job

Particles Tighter Data Packing

Particle Limitations

Other features

Multithreading Models, Profiling, Tracking Memory Leaks and Data Dumping

Multithreading Models and Profiling

Model 1 - Single Threaded Interleaved Update and Render loops

Model 2 - Single Threaded Interlaced Update and Render loops

Model 3 - Multithreaded Interleaved Update and Render loops

Model 4 - Multithreaded Interlaced Update and Render loops

Tracking Memory Leaks

Development Insights and Deep Learning

Post Mortem

What Went Well ?

What Went Wrong ?

What I Learned ?