Performance Optimizations (Direct3D 9)

Performance Optimizations (Direct3D 9)


General Performance Tips


1. Clear only when you must. 
2. Minimize state changes and group the remaining state changes. 
3. Use smaller textures, if you can do so. 
4. Draw objects in your scene from front to back. 
5. Use triangle strips instead of lists and fans. For optimal vertex cache performance, arrange strips to reuse triangle vertices sooner, rather than later. 
6. Gracefully degrade special effects that require a disproportionate share of system resources. 
7. Constantly test your application's performance. 
8. Minimize vertex buffer switches. 
9. Use static vertex buffers where possible. 
10. Use one large static vertex buffer per FVF for static objects, rather than one per object. 
11. If your application needs random access into the vertex buffer in AGP memory, choose a vertex format size that is a multiple of 32 bytes. Otherwise, select the smallest appropriate format. 
12. Draw using indexed primitives. This can allow for more efficient vertex caching within hardware. 
13. If the depth buffer format contains a stencil channel, always clear the depth and stencil channels at the same time. 
14. Combine the shader instruction and the data output where possible. For example: 
// Rather than doing a multiply and add, and then output the data with 
//   two instructions:
mad r2, r1, v0, c0
mov oD0, r2
// Combine both in a single instruction, because this eliminates an  
//   additional register copy.
mad oD0, r1, v0, c0 


常規技巧
1 只在必須的時候Clear。
2 儘量減少狀態切換。並且將需要進行的狀態切換組合在一起設置。
3 紋理尺寸儘可能小
4 從前至後渲染場景中的對象,從前至後渲染可以儘可能早地精選出不需要繪製的對象和象素
5 使用三角條帶代替三角列表和三角扇。爲了能更有效利用頂點高速緩存(cache),在排列條帶時因考慮儘快重用頂點。
6   根所需要據消耗的系統資源來逐步減少特效。
7 經常性地檢測程序的性能。這樣可以更容易發現引起性能突變的部分
8 最小化頂點緩存的切換
9 儘可能使用靜態頂點緩存
10 對靜態對象,對每種FVF使用一個大的靜態頂點緩存來保存多個對象的頂點數據,而不是每個對象使用一個頂點緩存。其目的也是減少頂點緩存的切換
11 如果程序需要隨機訪問AGP內存中的頂點緩存,頂點格式的大小最好是32bytes的倍數。否則,選擇合適的最小的格式。32bytes 也就是8個float數據或2個vector4。
12 使用頂點索引方式渲染,這樣可以更有效利用頂點高速緩存。
13 如果深度緩存格式中包含有模版緩存,總是將兩者一起Clear。
14 將計算結果和輸出的shader指令合併:




Databases and Culling
Building a reliable database of the objects in your world is key to excellent performance in Direct3D. It is more important than improvements to rasterization or hardware.
You should maintain the lowest polygon count you can possibly manage. Design for a low polygon count by building low-polygon models from the start. Add polygons if you can do so without sacrificing performance later in the development process. Remember, the fastest polygons are the ones you don't draw.


建立場景對象的數據時,首先使用最低精度的模型,在保證性能的前提下逐步使用更高精度的模型。密切關注渲染的總的三角面數。




Batching Primitives
To get the best rendering performance during execution, try to work with primitives in batches and keep the number of render-state changes as low as possible. For example, if you have an object with two textures, group the triangles that use the first texture and follow them with the necessary render state to change the texture. Then group all the triangles that use the second texture. The simplest hardware support for Direct3D is called with batches of render states and batches of primitives through the hardware abstraction layer (HAL). The more effectively the instructions are batched, the fewer HAL calls are performed during execution.


批次渲染,例如紋理相同的物體一起渲染。


Lighting Tips
Because lights add a per-vertex cost to each rendered frame, you can improve performance significantly by being careful about how you use them in your application. Most of the following tips derive from the maxim, "the fastest code is code that is never called."


1. Use as few light sources as possible. To increase the overall lighting level, for example, use the ambient light instead of adding a new light source. 
2. Directional lights are more efficient than point lights or spotlights. For directional lights, the direction to the light is fixed and doesn't need to be calculated on a per-vertex basis. 
3. Spotlights can be more efficient than point lights, because the area outside the cone of light is calculated quickly. Whether spotlights are more efficient or not depends on how much of your scene is lit by the spotlight. 
4. Use the range parameter to limit your lights to only the parts of the scene you need to illuminate. All the light types exit fairly early when they are out of range. 
5. Specular highlights almost double the cost of a light. Use them only when you must. Set the D3DRS_SPECULARENABLE render state to 0, the default value, whenever possible. When defining materials, you must set the specular power value to zero to turn off specular highlights for that material; just setting the specular color to 0,0,0 is not enough. 


因爲燈每個頂點的成本添加到每個渲染幀,則可以通過小心你如何使用它們的應用程序提高性能顯著。記住,“最快的代碼是代碼不會被調用。”
1. 使用儘可能少的光源成爲可能。爲了提高整體照明的水平,例如,使用而不是添加一個新光源的環境光
2. 平行光最有效率
3. 聚光燈比電光源更有效率
4. 用範圍參數,在場景在燈光範圍內才參與計算。
5. 鏡面光計算亮高出一倍。 D3DRS_SPECULARENABLE 設爲 0會關閉鏡面光。僅僅把材質設爲0,不能提高鏡面光計算效率。




Texture Size
Texture-mapping performance is heavily dependent on the speed of memory. There are a number of ways to maximize the cache performance of your application's textures.
1. Keep the textures small. The smaller the textures are, the better chance they have of being maintained in the main CPU's secondary cache. 
2. Do not change the textures on a per-primitive basis. Try to keep polygons grouped in order of the textures they use. 
3. Use square textures whenever possible. Textures whose dimensions are 256x256 are the fastest. If your application uses four 128x128 textures, for example, try to ensure that they use the same palette and place them all into one 256x256 texture. This technique also reduces the amount of texture swapping. Of course, you should not use 256x256 textures unless your application requires that much texturing because, as mentioned, textures should be kept as small as possible. 
紋理映射的性能在很大程度上依賴於內存的速度。有許多方法來最大化的應用程序的紋理高速緩存性能。
1. 保持紋理小。紋理越小,越容易被CPU的二級高速緩存命中。
2. 儘量保證按紋理分組渲染幾何圖元
3. 儘量用方形的紋理。小紋理儘量用相同的調色板,合併到一張大紋理上面,這樣做減少了紋理的切換。


Matrix Transforms
Direct3D uses the world and view matrices that you set to configure several internal data structures. Each time you set a new world or view matrix, the system recalculates the associated internal structures. Setting these matrices frequently - for example, thousands of times per frame - is computationally time-consuming. You can minimize the number of required calculations by concatenating your world and view matrices into a world-view matrix that you set as the world matrix, and then setting the view matrix to the identity. Keep cached copies of individual world and view matrices so that you can modify, concatenate, and reset the world matrix as needed. For clarity in this documentation, Direct3D samples rarely employ this optimization.


每次設置世界、視圖、投影矩陣內部的數據結構都會重新計算,頻繁設置比較費時。可以將視圖矩陣和投影矩陣連乘來減少計算,世界矩陣沒有變換不要頻繁設置。(其實framemove和render分離,已經做到效率優化了)


Using Dynamic Textures
To find out if the driver supports dynamic textures, check the D3DCAPS2_DYNAMICTEXTURES flag of the D3DCAPS9 structure.
Keep the following things in mind when working with dynamic textures.
1. They cannot be managed. For example, their pool cannot be D3DPOOL_MANAGED. 
2. Dynamic textures can be locked, even if they are created in D3DPOOL_DEFAULT. 
3. D3DLOCK_DISCARD is a valid lock flag for dynamic textures. 
It is a good idea to create only one dynamic texture per format and possibly per size. Dynamic mipmaps, cubes, and volumes are not recommended because of the additional overhead in locking every level. For mipmaps, D3DLOCK_DISCARD is allowed only on the top level. All levels are discarded by locking just the top level. This behavior is the same for volumes and cubes. For cubes, the top level and face 0 are locked.
The following pseudocode shows an example of using a dynamic texture.
DrawProceduralTexture(pTex)
{
    // pTex should not be very small because overhead of 
    //   calling driver every D3DLOCK_DISCARD will not 
    //   justify the performance gain. Experimentation is encouraged.
    pTex->Lock(D3DLOCK_DISCARD);
    <Overwrite *entire* texture>
    pTex->Unlock();
    pDev->SetTexture();
    pDev->DrawPrimitive();
}


動態紋理工作時,請記住以下幾點。
1. 它們不能被託管。例如,他們的水池不能D3DPOOL_MANAGED。 
2. 動態紋理可以被鎖定,即使它們是在D3DPOOL_DEFAULT創建。 
3. D3DLOCK_DISCARD是一個有效的鎖定標誌爲動態紋理。


Using Dynamic Vertex and Index Buffers
Locking a static vertex buffer while the graphics processor is using the buffer can have a significant performance penalty. The lock call must wait until the graphics processor is finished reading vertex or index data from the buffer before it can return to the calling application, a significant delay. Locking and then rendering from a static buffer several times per frame also prevents the graphics processor from buffering rendering commands, since it must finish commands before returning the lock pointer. Without buffered commands, the graphics processor remains idle until after the application is finished filling the vertex buffer or index buffer and issues a rendering command.


Ideally the vertex or index data would never change, however this is not always possible. There are many situations where the application needs to change vertex or index data every frame, perhaps even multiple times per frame. For these situations, the vertex or index buffer should be created with D3DUSAGE_DYNAMIC. This usage flag causes Direct3D to optimize for frequent lock operations. D3DUSAGE_DYNAMIC is only useful when the buffer is locked frequently; data that remains constant should be placed in a static vertex or index buffer.


To receive a performance improvement when using dynamic vertex buffers, the application must call IDirect3DVertexBuffer9::Lock or IDirect3DIndexBuffer9::Lock with the appropriate flags. D3DLOCK_DISCARD indicates that the application does not need to keep the old vertex or index data in the buffer. If the graphics processor is still using the buffer when lock is called with D3DLOCK_DISCARD, a pointer to a new region of memory is returned instead of the old buffer data. This allows the graphics processor to continue using the old data while the application places data in the new buffer. No additional memory management is required in the application; the old buffer is reused or destroyed automatically when the graphics processor is finished with it. Note that locking a buffer with D3DLOCK_DISCARD always discards the entire buffer, specifying a nonzero offset or limited size field does not preserve information in unlocked areas of the buffer.


There are cases where the amount of data the application needs to store per lock is small, such as adding four vertices to render a sprite. D3DLOCK_NOOVERWRITE indicates that the application will not overwrite data already in use in the dynamic buffer. The lock call will return a pointer to the old data, allowing the application to add new data in unused regions of the vertex or index buffer. The application should not modify vertices or indices used in a draw operation as they might still be in use by the graphics processor. The application should then use D3DLOCK_DISCARD after the dynamic buffer is full to receive a new region of memory, discarding the old vertex or index data after the graphics processor is finished.


The asynchronous query mechanism is useful to determine if vertices are still in use by the graphics processor. Issue a query of type D3DQUERYTYPE_EVENT after the last DrawPrimitive call that uses the vertices. The vertices are no longer in use when IDirect3DQuery9::GetData returns S_OK. Locking a buffer with D3DLOCK_DISCARD or no flags will always guarantee the vertices are synchronized properly with the graphics processor, however using lock without flags will incur the performance penalty described earlier. Other API calls such as IDirect3DDevice9::BeginScene, IDirect3DDevice9::EndScene, and IDirect3DDevice9::Present do not guarantee the graphics processor is finished using vertices.


Below are ways to use dynamic buffers and the proper lock flags.




    // USAGE STYLE 1
    // Discard the entire vertex buffer and refill with thousands of vertices.
    // Might contain multiple objects and/or require multiple DrawPrimitive 
    //   calls separated by state changes, etc.
 
    // Determine the size of data to be moved into the vertex buffer.
    UINT nSizeOfData = nNumberOfVertices * m_nVertexStride;
 
    // Discard and refill the used portion of the vertex buffer.
    CONST DWORD dwLockFlags = D3DLOCK_DISCARD;
    
    // Lock the vertex buffer.
    BYTE* pBytes;
    if( FAILED( m_pVertexBuffer->Lock( 0, 0, &pBytes, dwLockFlags ) ) )
        return false;
    
    // Copy the vertices into the vertex buffer.
    memcpy( pBytes, pVertices, nSizeOfData );
    m_pVertexBuffer->Unlock();
 
    // Render the primitives.
    m_pDevice->DrawPrimitive( D3DPT_TRIANGLELIST, 0, nNumberOfVertices/3)




    // USAGE STYLE 2
    // Reusing one vertex buffer for multiple objects
 
    // Determine the size of data to be moved into the vertex buffer.
    UINT nSizeOfData = nNumberOfVertices * m_nVertexStride;
 
    // No overwrite will be used if the vertices can fit into 
    //   the space remaining in the vertex buffer.
    DWORD dwLockFlags = D3DLOCK_NOOVERWRITE;
    
    // Check to see if the entire vertex buffer has been used up yet.
    if( m_nNextVertexData > m_nSizeOfVB - nSizeOfData )
    {
        // No space remains. Start over from the beginning 
        //   of the vertex buffer.
        dwLockFlags = D3DLOCK_DISCARD;
        m_nNextVertexData = 0;
    }
    
    // Lock the vertex buffer.
    BYTE* pBytes;
    if( FAILED( m_pVertexBuffer->Lock( (UINT)m_nNextVertexData, nSizeOfData, 
               &pBytes, dwLockFlags ) ) )
        return false;
    
    // Copy the vertices into the vertex buffer.
    memcpy( pBytes, pVertices, nSizeOfData );
    m_pVertexBuffer->Unlock();
 
    // Render the primitives.
    m_pDevice->DrawPrimitive( D3DPT_TRIANGLELIST, 
               m_nNextVertexData/m_nVertexStride, nNumberOfVertices/3)
 
    // Advance to the next position in the vertex buffer.
    m_nNextVertexData += nSizeOfData;






Z-Buffer Performance
Applications can increase performance when using z-buffering and texturing by ensuring that scenes are rendered from front to back. Textured z-buffered primitives are pretested against the z-buffer on a scan line basis. If a scan line is hidden by a previously rendered polygon, the system rejects it quickly and efficiently. Z-buffering can improve performance, but the technique is most useful when a scene draws the same pixels more than once. This is difficult to calculate exactly, but you can often make a close approximation. If the same pixels are drawn less than twice, you can achieve the best performance by turning z-buffering off and rendering the scene from back to front


從前向後渲染,當像素被遮擋 時Z-Buffer會拒絕繪製該像素,提高效率。 但是,當通常平均每個像素繪製小於兩次時,從後向前渲染,然後關閉Z-Buffer效率會更高。
發佈了61 篇原創文章 · 獲贊 33 · 訪問量 13萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章