DX12 Do's And Don'ts

https://developer.nvidia.com/dx12-dos-and-donts

 

Introduction

The DX12 API places more responsibilities on the programmer than any former DirectX™ API. This starts with resource state barriers and continues with the use of fences to synchronize command queues. Likewise illegal API usage won’t be caught or corrected by the DX-runtime or the driver. In order to stay on top of things the developer needs to strongly leverage the debug runtime and pay close attention to any errors that get reported. Also make sure to be thoroughly familiar with the DX12 feature specifications.

Engine Architecture/Structure

Do’s

  • Prefer a tasks graph architecture for parallel draw submission
    • This way you may achieve sufficient parallelism in terms of draw submission whilst making sure that resource and command queue dependencies get respected

  • Consider a ‘Master Render Thread’ for work submission with a couple of ‘Worker Threads’ for command list recording, resource creation and PSO ‘Pipeline Stata Object’ (PSO) compilation
    • The idea is to get the worker threads generate command lists and for the master thread to pick those up and submit them
  • Expect to maintain separate render paths for each IHV minimum
    • The app has to replace driver reasoning about how to most efficiently drive the underlying hardware

Don’ts

  • Don’t rely on the driver to parallelize any Direct3D12 works in driver threads
    • On DX11 the driver does farm off asynchronous tasks to driver worker threads where possible – this doesn’t happen anymore under DX12
    • While the total cost of work submission in DX12 has been reduced, the amount of work measured on the application’s thread may be larger due to the loss of driver threading. The more efficiently one can use parallel hardware cores of the CPU to submit work in parallel, the more benefit in terms of draw call submission performance can be expected.

Work Submission – Command Lists & Bundles

Do’s

  • Accept the fact that you are responsible for achieving and controlling GPU/CPU parallelism
    • Submitting work to command lists doesn’t start any work on the GPU
    • Calls to ExecuteCommadList() finally do start work on the GPU
  • Submit work in parallel and evenly across several threads/cores to multiple command lists
    • Recording commands is a CPU intensive operation and no driver threads come to the rescue
    • Command lists are not free threaded so parallel work submission means submitting to multiple command lists
  • Be aware of the fact that there is a cost associated with setup and reset of a command list
    • You still need a reasonable number of command lists for efficient parallel work submission
    • Fences force the splitting of command lists for various reasons ( multiple command queues, picking up the results of queries)
  • Try to aim at a reasonable number of command lists in the range of 15-30 or below. Try to bundle those CLs into 5-10 ExecuteCommandLists() calls per frame.
  • Reuse fragments recorded in bundles if you can
    • No need to spend CPU time once again
  • Use bundle resource binding inheritance sparsely
    • This allows bundles to be reused with less overhead as it facilitates more thoroughly cooked bundles
    • Check carefully if the use of a separate compute command queues really is advantageous
  • Even for compute tasks that can in theory run in parallel with graphics tasks, the actual scheduling details of the parallel work on the GPU may not generate the results you hope for
  • Be conscious of which asynchronous compute and graphics workloads can be scheduled together - use fences to pair up the right workloads
  • Make sure to use just one CBV/SRV/UAV/descriptor heap as a ring-buffer for all frames if you want to aim at running parallel asynchronous compute and graphics workloads

Don’ts

  • Don’t use bundles to record more than a few draw calls (e.g.~12 draw calls is fine)
    • Otherwise you typically limit the reusability of the bundle
  • Don’t overlap compute work on the 3D queue with compute work on a dedicated asynchronous compute queue
    • This may lead to bubbles in the asynchronous compute queue
    • Switch compute workload to graphics workloads in this case if possible
  • Don't submit extremely small command lists.
    • Small command lists can sometimes complete faster than the OS scheduler on the CPU can submit new ones. This can result in wasted idle GPU cycles.
    • The OS takes 50-80 microseconds to schedule command lists after the previous ExecuteCommandLists call. If a command list or all command lists in the call executes faster than that, there will be a bubble in the HW queue
    • Check for bubbles using GPUView
  • Don’t record everything or big scene parts in just very few command lists
    • This limits your ability to fully utilize all your CPU cores
    • Also building a few large command lists means you’ll potentially find it harder to keep the GPU from going idle
  • Don’t submit only at the end of frame after you have recorded everything
    • You may waste the opportunity to keep the GPU working in parallel with the recording of other command lists
  • Don’t expect lots of list reuse
    • There are usually many per-frame changes in terms of objects visibility etc.
    • Post-processing may be an exception
  • Don’t create too many threads or too many command lists
    • Too many threads will oversubscribe your CPU resources, whilst too many command lists may accumulate too much overhead

Pipeline State Objects (PSOs)

Do’s

  • Create PSOs on worker threads asynchronously
    • PSO creation is where shaders compilation and related stalls happen
  • Start using more general PSOs (with generic shaders that compile quickly) first and generate specializations later
    • Gets you up running faster even if you are not running the most optional PSO/shader yet
    • It is your job to generate shader specializations – the driver will not generate constant optimized shader variants behind your back
  • Avoid runtime PSO compilations as they most likely will lead to stalls
    • The driver-managed shader disk cache may come to the rescue though
  • Minimize state changes between PSOs where possible
    • A PSO doesn’t necessarily map to an atomic state change on the GPU
  • Use identical sensible defaults for don’t care fields wherever possible
  • This allows for more possibilities for PSO reuse

 

  • Use the /all_resources_bound / D3DCOMPILE_ALL_RESOURCES_BOUND compile flag if possible
  • This allows for the compiler to do a better job at optimizing texture accesses. We have seen frame rate improvements of > 1% when toggling this flag on.

 

Don’ts

  • Don’t toggle between compute and graphics on the same command queue more than absolutely necessary
    • This is still a heavyweight switch to make

  • Don’t toggle tessellation on/off more than absolutely necessary
    • Again, this is still a heavyweight switch to make

  • Don’t forget that PSO creation is where shaders get compiled and stalls get introduced
    • It is really important to create PSO asynchronously and early enough before they get used
    • Tread carefully with thread priorities for PSO compilation threads
      • Use Idle priority if there is no ‘hurry’ to prevent slowdowns for game threads
      • Consider temporarily boosting priorities when there is a ‘hurry’


       


     

Root Signatures

Do’s

  • Place constants and CBVs (SRVs and UAVs only if you have directly into the root signature if possible on NVIDIA Hardware
    • Start with the entries for the pixel stage
      • Constants that sit directly in root can speed up pixel shaders significantly on NVIDIA hardware – specifically consider shader constants that toggle parts of uber-shaders
      • CBVs that sit in the root signature can also speed up pixel shaders significantly on NVIDIA hardware
    • Carry on with decreasing execution frequency of the shader stages
    • Using root signature CBVs does not require a descriptor heap for storing CBV desciptors, versioning entries or extra indirection (=> no need to call CreateConstantBufferView() )
    • Remember root views don’t do bounds checking and have other limitations
  • Cache the current values of root constants, CBVs, SRVs and UAVs in CPU memory and only change the contents of the root signature when a true change is detected
    • We have seen significant speedups through managing changes properly
  • Limit the shader visibility of CBVs, SRVs and UAVs to only the necessary stages
    • There is overhead in the driver and on the GPU for each stage that needs to see those views
    • Use the DENY_*_ACCESS flags to explicitly limit resource-shader visibility
  • Minimize the number of Root Signature changes
    • The problem is not the change of the RS but there is usually a follow up cost of initializing the root signature entries after such a change
  • Gracefully handle CBV, UAV, SRV and Sampler descriptors on Tier 1 and CBV and UAV descriptors on Tier 2 hardware
    • For these Tiers, the application must fill in all descriptors defined in the root signature (and descriptor tables used) by the time the command list executes. This is even the case if the used shaders may not reference all these descriptors.
    • For Tier 3 do keep your unused descriptors bound – don’t waste time unbinding them as this can easily introduce state thrashing bottlenecks


     

Don’ts

  • Don’t group CBVs into CBV descriptor tables that have a different update frequency
    • Ideally all CBVs in a table would need updating at the same time
  • Don’t bloat your root signature and descriptor tables to be able to reuse them
    • Try to aim at using a minimum set of entries for each set of materials
  • Don't simultaneously set visible and deny flags for the same shader stages on root table entries
    • For current drivers the deny flags only work when D3D12_SHADER_VISIBILITY_ALL is set
  • Don’t place constants SRVs and UAVs directly into the root signature unless you have a lot of draw/dispatch call that can make use of them
  • Don’t leave resource bindings undefined after a change of Root Signature
    • A change in root signatures removes/clears all resource binding used in the previous root signature

Allocators and Lists

Do's

  • Reuse allocators for similarly sized sequence of draw call
    • Allocations are fast when the list has been pre-warmed

  • Use 2*T + N allocators minimum
    • 2* - one set of lists/allocators from last frame is still being consumed by the GPU and the second set is being built/used in the current frame
    • T = #threads creating command lists – please note that allocators are not free threaded!
    • N = extra pool for bundles
  • Call Allocator::Reset before reusing it in another frame
    • Otherwise the allocator will keep on growing until you’ll run out of memory

 

Don’ts

  • Don’t forget that Allocator and Lists consume GPU memory
    • A too large allocators may limit your GPU working set in other undesirable ways
  • Don’t create/destroy allocators but reuse allocators
    • Save the overhead for allocator creation/destruction

  • Don’t reuse for differently sized sequence of draw calls
    • This leads to worst case size allocator

  • Don't forget to reset the corresponding allocator when resetting a set of command lists
    • Not resetting an allocator means leaking memory!
  • Don’t free/reuse an Allocator still in use by active command lists
    • This is illegal and may free or overwrite memory that the command list is still using


     

Resources

Do's

  • Avoid vidmem overcommitment
    • Use IDXGIAdapter3:: QueryVideoMemoryInfo() to gain accurate information about the available video memory
    • Foreground app isn’t necessarily allocated all, or even a high %, of vidmem
    • Respond to budget changes from OS
      • Consider using IDXGIAdapter3::RegisterVideoMemoryBudgetChangeNotificationEvent
      • Consider capping graphics settings based on memory available
    • Create overflow heaps in sysmem and move resources over from vidmem heaps
    • DX12 gives the app a memory managment advantage over the DX11 driver here
      • Break up command lists so that the amount of memory referenced in each one fits in vidmem.
      • Keep track of what's used per CL
      • Consider using MakeResident/Evict before/after executing command lists when you are going over the vidmem budget
    • Use committed resources where possible to give the driver more knowledge
      • This allows the driver to better manage GPU memory
      • A good use case for placed resources are resource heaps that are e.g. used during streaming and do hold different sets of read-only textures over their life time
  • Batch up MakeResident calls (expect a CPU and GPU cost for page table updates)
    • This lowers the overhead inside the driver and the GPU
  • Work to a given memory budget using MakeResident/MakeUnresident
    • Do drop mip levels of tiled resources as needed
    • Need to handle the case when MakeResident fails
  • Be aware of the fact that certain resource types have different alignment rules within a heap
  • Make sure to devise ways to deal with varying resource binding Tiers within a device feature level
    • UAV count across all stages may be limited to 8 or 64
    • CBV count may be limited to 14 per stage
    • Sampler count may be limited to 16 per stage
  • Be aware of the aliasing rules for heaps
    • See tiled resource specification for a good roll-up
  • Be aware of the fact that there are different heap types for resources, SRVs, DSVs etc.
    • On some heap tiers there may be more restrictions than on others
    • Check resource heap tier capabilities
  • Do fill D3D12_TEXTURE_COPY_LOCATION with care when using CopyTextureRegion() when copying depth stencil textures
    • Copying only the depth part of the resource may hit a slow path

Dont's

  • Don’t go overboard with your re-use count for placed resources for depth stencil and render target resources
    • On top of the need to clear those resources before they can be rendered to, there may be other hardware dependent book-keeping operations that make those switches expensive
  • Don’t rely on the availability of tiled resources (check cap bits)
    • Still need to think about different DX12 hardware classes
  • Don’t rely on being able to allocate all GPU memory in one go
    • Depending on the underlying GPU architecture the memory may or may not be segmented
  • Don’t expect an immediate cost for a MakeUnresident call
    • Cost might be deferred until another MakeResident call utilizes the memory
      • Use GPUView analysis to find out about deferred paging requests
  • Don’t destroy and create resources if it can be avoided
    • Better to use MakeUnresident and MakeResident where possible
    • Saves the overhead of creation and destruction of resources

Barriers, Fences & Hazards

Do's

  • Minimize the use of barriers and fences
    • We have seen redundant barriers and associated wait for idle operations as a major performance problem for DX11 to DX12 ports
      • The DX11 driver is doing a great job of reducing barriers – now under DX12 you need to do it
    • Any barrier or fence can limit parallelism
  • Make sure to always use the minimum set of resource usage flags
    • Stay away from using D3D12_RESOURCE_USAGE_GENERIC_READ unless you really need every single flag that is set in this combination of flags
    • Redundant flags may trigger redundant flushes and stalls and slow down your game unnecessarily
    • To reiterate: We have seen redundant and/or overly conservative barrier flags and their associated wait for idle operations as a major performance problem for DX11 to DX12 ports.
  • Specify the minimum set of targets in ID3D12CommandList::ResourceBarrier
    • Adding false dependencies adds redundancy
  • Group barriers in one call to ID3D12CommandList::ResourceBarrier
    • This way the worst case can be picked instead of sequentially going through all barriers
  • Use split barriers when possible
    • Use the _BEGIN_ONLY/_END_ONLY flags
    • This helps the driver doing a more efficient job
  • Do use fences to signal events/advance across calls to ExecuteCommandLists

Dont's

  • Don’t insert redundant barriers
    • This limits parallelism
    • A transition from D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE to D3D12_RESOURCE_STATE_RENDER_TARGET and back without any draw calls in-between is redundant
    • Avoid read-to-read barriers
      • Get the resource in the right state for all subsequent reads
  • Don’t use D3D12_RESOURCE_USAGE_GENERIC_READ without good reason.
    • For transitions from write-to-read states, ensure the transition target is inclusive of all required read states needed before the next transition to write. This is done from the API by combining read state flags– and is preferred over transitioning from read-to-read in subsequent ResourceBarrier calls.
  • Don’t sequentially call ID3D12CommandList::ResourceBarrier with just one barrier
    • This doesn’t allow the driver to pick the worst case of a set of barriers
  • Don’t expect fences to trigger signals/advance at a finer granularity then once per ExecuteCommandLists call.

Multi GPU

Do's

  • Use the DX12 standard checks to find out how many GPUs are in your system
    • No need to use vendor specific APIs anymore
    • Make sure to check the CROSS_NODE_SHARING tier

  • Take full control over which surface syncs need to happen and which don’t
    • Make full use of the explicit control over resources
    • Create resources that need to by synchronized on each node
      • Use the proper CreationNodeMask
      • Make them visible on other nodes that need access
    • Copy them to the current node when needed

  • Minimize the number of necessary syncs
  • If the device supports tier 2 cross node sharing
    • Always compare performance to a tier 1 type implementation
  • Use designated copy queues to do cross node copy operations
    • Keep the main queue open to do rendering work in parallel


     

Dont's

  • Don’t try to benefit from implicit MGPU scaling
  • Don’t rely on any surface syncs to be done automatically (implicitly behind your back)
  • You should take full control over what syncs happen if you need them


Swap Chains

Do's

  • Do use flip mode swap-chains
  • Do use SetFullScreenState(TRUE) along with a (borderless) fullscreen window and a non-windowed flip model swap-chain to switch to true immediate independent flip mode
    • This is at the moment, according to Microsoft, the only mode you can get unleashed frame rates with tearing out of D3D12 when calling Present(0,0)
    • Any other mode doesn’t allow unlimited frame rates with tearing
  • Do use the DXGI_SWAP_CHAIN_FLAG_ALLOW_MODE_SWITCH flag consciously
    • The flag is not necessary to achieve unlimited frame rates (see above) if your window size matches the current screen resolution
    • If this flag is set, trying to change resolution using ResizeTarget() before calling SetFullScreenState(TRUE) works fine and you’ll achieve uncapped FPS
    • If this flag is not set, trying to change resolution using ResizeTarget() before calling SetFullScreenState(TRUE) results in no change of display resolution. Your target will get stretched to the current resolution and FPS won’t be uncapped.
  • If not in fullscreen state (true immediate independent flip mode) do control your latency and buffer count in your swap-chain carefully for the desired FPS and latency
      • Use IDXGISwapChain2::SetMaximumFrameLatency(MaxLatency) to set the desired latency
        • For this to work you need to create your swap-chain with the DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT flag set.
      • li>A sync interval of 0 indicates that "the buffer I am presenting now is the newest buffer available next time composition happens" and discards all previous presents. However, the present does not go through until composition happens, which currently is only at VSync.
      • DXGI will start to block in Present() after you have presented MaxLatency-1 times
        • At the default latency of 3 this means that you FPS can’t go higher than 2 * RefershRate. So for a 60Hz monitor the FPS can’t go above 120 FPS.
      • Try using about 1-2 more swap-chain buffers than you are intending to queue frames (in terms of command allocators and dynamic data and the associated frame fences) and set the "max frame latency" to this number of swap-chain buffers.
  • If not in fullscreen state (true immediate independent flip mode) consider using a waitable object swap-chain along with WaitForSingleObjectEx() to generate higher FPS
    • Please note that this will lead to some frame never being even partially visible, but may be a good solution for benchmarking
    • Using the waitable object swapchain and GetFrameLatencyWaitableObject(), one can test if a buffer is available before rendering to it or presenting it – the following options are available:
    1. Use an additional off-screen surface
      • Render to the off-screen surface. Test the waitable object with timeout 0 to check if a buffer is available. If so copy to the swap-chain back buffer and Present(). If no buffer is available start the frame over again.
      • At the beginning of the frame, test the waitable object. If it succeeds, render to the available swapchain buffer. If it fails, render to the offscreen surface.
    2. Use a 3 or 4 buffer swapchain
      • Render directly to a back buffer. Before calling Present(), test the waitable object. If it succeeds, call Present(), if not, start over.

Dont's

  • Don’t forget that there's a per swap-chain limit of 3 queued frames before DXGI will start to block in Present().
    • Set the DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT flag on swapchain creation and use IDXGISwapChain2::SetMaximumFrameLatency to modify this default value
  • Don’t forget to call ResizeBuffers() after you have switched to true immediate indepent flip mode using SetFullScreenState(TRUE).

SetStablePowerState

Don’t ever call SetStablePowerState(TRUE) from game engine code.

Do consider carefully whether or not you need highly stable results at the expense of lower performance. See the discussion in our blog.

If and only if you want its stable results, do call SetStablePowerState from a separate, standalone application.

To avoid confusion, do make it crystal clear when the function is in effect or not. (One way to make it obvious is to record clocks along with performance results. We often do that. Our blog has a code snippet showing how to query GPU clocks on NVIDIA.)

Do use the DX12 API and our standalone program to stabilize the clocks when testing other APIs.

We have a separate blog post with more discussion: SetStablePowerState.exe: Disabling GPU Boost on Windows 10 for more deterministic timestamp queries on NVIDIA GPUs

DirectX12 Hardware Features and other Maxwell Features

Do's

  • Use hardware conservative raster for full-speed conservative rasterization
  • Make use of NvAPI (when available) to access other Maxwell features
    • Advanced Rasterization features
      • Bounding box rasterization mode for quad based geometry
      • New MSAA features like post depth coverage mask and overriding the coverage mask for routing of data to sub-samples
      • Programmable MSAA sample locations
    • Fast Geometry Shader features
      • Render to cube maps in one geometry pass without geometry amplifications
      • Render to multiple viewports without geometry amplifications
      • Use the fast pass-through geometry shader for techniques that need per-triangle data in the pixel shader
    • New interlocked operations
    • Enhanced blending ops
    • New texture filtering ops

Don’ts

  • Don’t use Raster Order View (ROV) techniques pervasively
    • Guaranteeing order doesn’t come for free
    • Always compare with alternative approaches like advanced blending ops and atomics

NVIDIA DirectX12 Hardware Features table

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章