文章目錄

Data Structure - 數據結構

Simple Introduction - 簡介

新版的 Turing （圖靈）結構介紹通過使用 Mesh Shader 來實現幾何可編程着色器管線。Mesh Shader 帶來了新的計算模型，在GPU中圖形管線中將多線程合作生成精簡的網格（meshlets），該網格是直接爲 Rasterizer（光柵器）提供數據使用的。應用程序和遊戲處理高精度的幾何體將得益於靈活的兩個階段，允許高效的 culling（剔除），程序生成的LOD(level-of-detail) 技術。

Motivation - 值得一提

真實的世界中視覺是非常的豐富的，各種幾何體的形狀，還有各種擺放的位置。特別是戶外場景可以成百萬或上千萬的物件、元素的數量（岩石、樹、小的植物、等等）。CAD 模型的呈現就類似的挑戰了這兩點，複雜的表面形狀，就像是由許多的很小部分組成，例如太空飛船。圖1展示了一些示例，關於當代的圖形管線中，使用 vertex（頂點），tessellation（曲面細分），和 geometry（幾何） shader，instancing（實例化繪製）和 multi draw indirect（非立即繪製的：延遲多繪製），也是非常的高效的，但仍然限制於全屏分辨率時的幾何體到達了 上千萬的三角形 和 上十萬的對象。

圖1。用巨量的複雜幾何體來提升逼真度。

其他的使用案例不會展示像上面的包含大量幾何體，而是合理的計算（粒子、文字、代理對象、點雲）或生成形狀（電子工程佈局，vfx 粒子，帶條、拖尾，路徑渲染）。

後面我們將看看使用 mesh shader 來加速渲染大量三角形的網格。原始的網格被分解爲更小的 meshlets ，如圖2 的展示。理想情況下每個 meshlet 用於優化頂點複用。使用新的硬件階段和和分解調度機制，我們可以並行的渲染更多的幾何體而無需 fethcing（獲取）所有的數據。

圖2。大量的網格被分解到 meshlets，用於 mesh shader 渲染使用。

例如 CAD 中可達上千萬級技術細節，說明幾何體可以不限制頂點數量、多邊形的數量，可以做到非常密集的的數量，密集到一個像素還可以容納到數個多邊形。

例如 CAD 中可達上千萬級或億級別數量的三角形。即使在 occlusion culling（遮擋剔除）後，還是有大量的三角形存在。一些在管線中固定功能可能還一些浪費的工作、浪費的內存加載：

頂點批量創建，它在硬件每次都 primitive distributor scanning（圖元分佈掃描） indexbuffer（索引緩存），即使拓撲沒有改變過
看不見（背面，視錐體外，或子像素剔除）的頂點和屬性數據的也 fetch （獲取）

mesh shader 給開發者提供了新的可能性來避免這些瓶頸。新的方法允許內存被一次讀取，並保持在 on-chip （芯片）中，而不是之前的方法，例如，基於 compute shader 的圖元剔除（查看腳註3，4，5），可見的三角形的索引緩存被計算並延遲繪製。

mesh shader 階段爲光柵器生成三角形，內部使用的是協作線程模型來處理，而不是單線程程序模式，類似 compute shader。在新的 mesh shader 管線中在 mesh shader 階段的前一個是 task shader。task shader 操作類似於 control stage of tessellation（tessellation control stage，曲面細分的控制階段），爲了能動態生成的工作。然而，使用一個協作線程模式而不是像tessellation的輸出決定輸出的方式，它是輸入和輸出都是用戶定義的。

簡單的比較一下 on-chip 的幾何體創建與之前的死板的方式，與帶有限制的 tessellation 和 geometry shader 的線程只能用於特定的任務，如圖3 的展示。

圖3。Mesh Shader 代表着在處理複雜幾何的逐步步驟

Mesh Shading Pipeline - Mesh 着色管線

一個新的兩個階段的管線可替代傳統的 attribute fetch, vertex, tessellation, geometry shader 管線。這個新的管線包含一個 task shader 和 mesh shader：

Task shader：一個可編程單元，它是在 workgroups 工作組中操作生成每個需要（或不需要）的mesh shader 工作組。
Mesh shader：也是一個可編程單元，它在 workgroups 工作組中操作，並允許生成圖元。

mesh shader 階段爲 rasterizer 光閃器生成三角形，內部使用到的方式就是上面提及到的寫作線程模式。 task shader 操作類似與 tessellation 階段的 hull shader，爲了可動態生成的工作。然而，類似 mesh shader一樣，task shader 也使用寫作線程模式。它們的輸入和輸出都是用戶定義的，而不是像 tessellation 中拿一小塊數據來決定輸出的內容。

pixel/fragment shader 沒有影響。傳統的管線仍然能依賴用於使用提供很好的效果。圖5 高亮了管線風格的差異。

Mesh Shader 的管線 mesh shading pipeline （網格着色器管線）替換了一般的 VTG pipeline 管線（VTG = Vertex / Tessellation / Geometry）。

新的 mesh shader 管線爲開發者提供了一些好處：

Higher scalability：更高的穩定性着色器單元，減少固定函數對圖元處理的影響。通用性，現在 GPUs 將可以用於更多不同的應用程序中，添加更多的內核，和提升着色器通用內存，和算術性能。
Bandwidth-reduction：減少帶寬消耗，更加直接的重複頂點（可複用的頂點），再許多幀中都可以複用。當前的 API 模型意味着 index buffer 在硬件中每次都掃描。巨量的 meshlets 意味着更高的頂點複用，也降低了對帶寬的需求（bandwitdh requirements）。還有開發者可以引入他們自己的壓縮或程序生成的調度。task shader 的 expansion/filtering 都是可選的，可以完全的跳過這些數據的獲取。
Flexibility：靈活性，它是定義 mesh topology（網格拓撲）和創建圖形的工作。之前的 tessellation shader 限制於固定的 tessellation 模式， geometry shader 忍受着低效的線程，不友好的編程模型方式來在每個線程創建三角帶條。

Mesh shading 用的是 compute shader 的編程模式，給開發者自由的使用線程來處理不同的共享數據。當 rasterization（光柵化）禁用了，兩個階段可以用於通用計算的工作。

圖5。Mesh shader 表現的類似與 compute shader，使用寫作線程的模型。

但 mesh 和 task shader 都是 compute shader 編程模型，使用協作線程來計算他們的結果，no inputs other than a workgroup index（除了 workgroup 索引外都不需要輸入的數據）。這些執行在圖形管線；因此硬件直接管理內存在多個階段間的傳輸並保持在芯片中（kept on-chip）。

我們將展示如何處理圖元剔除的例子，線程可以在一個 workgroup 中訪問所有的頂點。圖6 代表 task shader 可以提早剔除的能力。

圖6。task shader 是可選的，task shader 開啓可提前剔除來提升 throughput（吞吐量）。

通過 task shader optional expansion（可選的展開）允許提早的剔除圖元組，或是直接的標記 LOD。該機替代了 instancing 或是 multi draw indirect 的方式來繪製小網格。這些配置類似與 tessellation control shader 設置如和細分一小塊表面（~task workgroup）和影響要創建多少個 tessellation evaluation 的調用（~mesh workgroup）。

在一個 task workdgroup 能發射（生成）多少個 mesh workdgroups 是有限制的。第一代硬件最大支持每個 task 任務生成 64K 子空間。在 mesh 子對象的總數沒有限制，通過所有 tasks 執行 draw call 繪製。同樣的，如果沒有使用 task shader，draw call 時的大量的 mesh workgroups 生成是沒有限制的。圖7 表示了這個工作。

圖7。Mesh shader 工作組流

第T個task的children子任務都會保證在第T-1個之後執行。然而，task 和 mesh workdgroups 工作組是完全管線化的，所以是不需要等待之前的 childrene task 任務完整。

task shader 一般用於動態的生成或是過濾工作。靜態的設置受益於單獨使用 mesh shader。

光柵器輸出的網格和圖元都是保留的。光柵器禁用的話，task 和 mesh shader 可用於實現基礎的 compute-trees （計算樹）。

Meshlets and Mesh Shading - Meshlets 和 Mesh 着色

每一個 meshlet 代表着一個可變的頂點和圖元的數量。連接的對應圖元是沒有限制的。然而，他們的 shader code 的數量必須在限制的範圍內。

我們推薦使用 64 個頂點和 126 個圖元。126中的’6’沒有打錯。第一代的硬件分配圖元的索引使用 128 字節並預留 4 字節作用圖元的數量。因爲 3 * (126 + 4) 就是 3 * 128 = 384個字節塊。超過 126 個三角形將分配到下一個 128 字節（說實話，我對這英文表達能力、和我自己的理解能力表示懷疑，我看過一些其他的教程英文表達能力的清晰度，絕對比 NVidia 這篇好很多，爲何會醬紫。。。）84 和 40 都都是很好的數值。

在每個 GLSL mesh-shader 代碼中，workdgroup 在圖形管線分配大量固定的網格內存。

最大值，與大小與圖元的輸出如下定義：

分配的每個 meshlet 的大小依賴於編譯期間的決定的大小，就像輸出的attributes 是參考shader的。分配的越少，能在硬件並行運行的 workdgroup 就可以越多。workdgroup 共享的一塊在 on-chip 上的共用內存都是可以訪問的。因爲我們推薦輸出的或是共享的內存儘可能這塊共享內存。這在現在的着色器是可行的。然而，內存的佔用量將會更高，自從我們允許更大量的頂點和圖元的數量在當前編程模式中。

// Set the number of threads per workgroup (always one-dimensional).
// 設置每個 workdgroup 的線程數量（總是一維的）
  // The limitations may be different than in actual compute shaders.
  // 限制可能與 compute shader 不同。
  layout(local_size_x=32) in;

  // the primitive type (points,lines or triangles)
  // 圖元類型（點，線或三角形）
  layout(triangles) out;
  // maximum allocation size for each meshlet
  // 每個 meshlet 的最大分配大小
  layout(max_vertices=64, max_primitives=126) out;

  // the actual amount of primitives the workgroup outputs ( <= max_primitives)
  // workgroup 輸出的實際圖元數量（<= max_primitives）
  out uint gl_PrimitiveCountNV;
  // an index buffer, using list type indices (strips are not supported here)
  // 一個索引緩存，使用鏈表類型的索引（條帶在這不支持）
  out uint gl_PrimitiveIndicesNV[]; // [max_primitives * 3 for triangles]

Turing（圖靈）支持其他的新的 GLSL 擴展。NV_fragment_shader_barycentric，啓用 fragment shader 獲取原始的三個頂點的數據來生成一個圖元，並手動插值。這些原始的方位意味着我們可以輸出"unit"（單元）頂點屬性，但使用不同的打包/解包函數來儲存 float 爲 fp16， unorm8 或是 snorm8 。這可以大量的減少每個頂點的法線，紋理座標，顏色值佔用的空間，並益與標準化 mesh 着色器管線。

另外頂點和圖元的屬性定義如下：

out gl_MeshPerVertexNV {
     vec4  gl_Position;
     float gl_PointSize;
     float gl_ClipDistance[];
     float gl_CullDistance[];
  } gl_MeshVerticesNV[];            // [max_vertices]

  // define your own vertex output blocks as usual
  // 像平常一樣定義你想要的頂點輸出塊
  out Interpolant {
    vec2 uv;
  } OUT[];                          // [max_vertices]

  // special purpose per-primitive outputs
  // 特殊使用的逐圖元的輸出
  perprimitiveNV out gl_MeshPerPrimitiveNV {
    int gl_PrimitiveID;
    int gl_Layer;
    int gl_ViewportIndex;
    int gl_ViewportMask[];          // [1]
  } gl_MeshPrimitivesNV[];          // [max_primitives]

其一一個目標是最小的 meshlets 的數量，因此 meshlets 將最大化頂點的複用，也因此浪費了一些分配空間。在 meshlet 生成數據之前，indexbuffer應用頂點緩存優化器是有益的。例如， Tom Forsyth’s linear-speed optimizer （Tom Forsyth 的線性速度優化器）可用於使用這點。優化頂點的位置和索引緩存都是有益的，當使用 mesh shader 時，原來的三角形的順序都會被保留。CAD 模型通常“naturally”（天生自帶的）使用條帶生成，因此本身有很好的數據定位。調整索引緩存可以會引起 meshlet 剔除特性的負面影響（查看 task-level culling（task級別的剔除））。

Pre-Computed Meshlets - 與計算的Meshlets

例如，我們可以渲染靜態的內容，它們都是 index buffer 在多少都沒有改變的。因爲生成 meshlet 數據的消耗可在頂點、索引上傳到設備內存前隱蔽起來。這可以在頂點數據都是靜態的可以完成（沒有逐頂點動畫；沒有該表頂點的位置），允許預先計算數據，在整個 meshlets 的快速剔除是非常有用的。

Data Structure - 數據結構

在後面的示例中，我們將提供 meshlet 的構建起，它包含一些基礎的實現，每次都會掃描索引，並在遇到 meshlet 大小限制（頂點或是圖元的數量）時創建一個新的 meshlet。

爲一個輸入的三角形網格生成下面的數據：

struct MeshletDesc {
    uint32_t vertexCount; // number of vertices used - 使用的頂點數量
    uint32_t primCount;   // number of primitives (triangles) used - 使用的圖元（三角形）的數量
    uint32_t vertexBegin; // offset into vertexIndices - 頂點索引的偏移
    uint32_t primBegin;   // offset into primitiveIndices - 圖元索引的偏移
  }

  std::vector<meshletdesc>  meshletInfos;
  std::vector<uint8_t>      primitiveIndices;

  // use uint16_t when shorts are sufficient
  // 在足夠的使用可以使用 unit16_t
  std::vector<uint32_t>     vertexIndices;

每位有兩個索引緩存？

下面是原始的三角形的索引緩存數組

// let's look at the first two triangles of a batch of many more triangleIndices = { 4,5,6, 8,4,6, ...}
// 讓我們看一下，首先是，一個批次中兩個三角性索引緩存 = { 4,5,6, 8,4,6, ... }

被分爲兩個新的索引緩存。

我們構建一個集合，唯一的頂點索引，作爲我們遍歷三角索引用的。這個處理也就是我們都知道的 vertex de-duplication（頂點去重）。

vertexIndices = { 4,5,6,  8, ...}
// For the second triangle only vertex 8 must be added
// 第二個三角性只有一個頂點8是必須添加的
// and the other vertices are re-used.
// 而其他的頂點都被複用了。

圖元索引被調整，相對於整個 vertexIndices。

// original data
// 原始數據
triangleIndices  = { 4,5,6,  8,4,6, ...}
// new data
// 新的數據
primitiveIndices = { 0,1,2,  3,0,2, ...}
// the primitive indices are local per meshlet
// 圖元索引位於每個 meshlet

一旦遇到佔用大小限制（如：太多唯一頂點，或是太多的圖元），一個新的 meshlet 將會被開啓。隨後 meshlets 將被創建，並擁有他們唯一頂點集合。

Rendering Resources and Data Flow - 渲染資源與數據流

在渲染中，我們使用原始的頂點緩存。然而，不是原始的三角性的緩存，我們使用三個新的緩存，如下面圖8展示：

Vertex Index Buffer：頂點索引緩存，就像上面解釋到的。每個 meshlet 引用一系列的唯一頂點集合。這些頂點的索引被儲存在一個緩存中，該緩存可以爲後續的所有 meshlets 使用。
Primitive Index Buffer：圖元索引緩存，就像上面解釋到的。每個 meshlet 代表一個不定的圖元數量。每個三角形需要三個圖元索引，這些索引儲存在但個緩存中。注意：在每個 meshlet 之後添加的額外索引可能需要4個字節對齊。
Mesh Desc Buffer：網格表述緩存。儲存每個 meshlet 的 workload（工作負載）的信息和緩存偏移值，就想是 cluster culling 的剔除信息。

這三個緩存實際比原始的 index-buffer 要小，因爲 mesh shading 允許有更高的頂點複用性。我們注意到減少的大小，大概在原始索引緩存大小的 75% 左右。

圖8. Meshlet 緩存結構

Meshlet Vertices：vertexBegin 儲存着從頂點索引的哪個位置開始讀取。vertexCount 儲存着連續的頂點數量。頂點在一個 meshlet 是唯一的；沒有重複的索引值。
Meshlet Primitives：primBegin 儲存着從索引的哪個位置開始讀取。primCount 儲存着在 meshlet 中涉及的圖元數量。注意索引的數量依賴於圖元的類型（這裏類型爲三角形：3）。有個重點注意的是，索引引用頂點相對 vertexBegin的，意味着索引 ‘0’ 將相當於頂點索引定位在 vertexBegin。

下面的僞代碼描述了每個 mesh shader workgroup 執行的原則。這一系列僞代碼只爲了闡明目的。

// This code is just a serial pseudo code,
// 這代碼僅僅是一些僞代碼
  // and doesn't reflect actual GLSL code that would
  // 並不反映着實際的 GLSL 代碼
  // leverage the workgroup's local thread invocations.
  // 影響 workgroup 中的定位線程調用

  for (int v = 0; v < meshlet.vertexCount; v++){
    int vertexIndex = texelFetch(vertexIndexBuffer, meshlet.vertexBegin + v).x;
    vec4 vertex = texelFetch(vertexBuffer, vertexIndex);
    gl_MeshVerticesNV[v].gl_Position = transform * vertex;
  }

  for (int p = 0; p < meshlet.primCount; p++){
    uvec3 triangle = getTriIndices(primitiveIndexBuffer, meshlet.primBegin + p);
    gl_PrimitiveIndicesNV[p * 3 + 0] = triangle.x;
    gl_PrimitiveIndicesNV[p * 3 + 1] = triangle.y;
    gl_PrimitiveIndicesNV[p * 3 + 2] = triangle.z;
  }

  // one thread writes the output primitives
  // 一個線程寫入輸出的圖元
  gl_PrimitiveCountNV = meshlet.primCount;

mesh shader 也可以看作是像下面的並行寫入方式：

void main() {
  ...

  // As the workgoupSize may be less than the max_vertices/max_primitives
  // workdgroup大小可以小於 max_vertecies/max_primitives
  // we still require an outer loop. Given their static nature
  // 我們仍然需要一個外部循環。讓它們爲 static 的
  // they should be unrolled by the compiler in the end.
  // 最後它們在編譯器被展開

  // Resolved at compile time
  // 在編譯時計算好
  const uint vertexLoops =
    (MAX_VERTEX_COUNT + GROUP_SIZE - 1) / GROUP_SIZE;

  for (uint loop = 0; loop < vertexLoops; loop++){
    // distribute execution across threads
    // 通過線程分佈執行
    uint v = gl_LocalInvocationID.x + loop * GROUP_SIZE;

    // Avoid branching to get pipelined memory loads.
    // 避免分支讓管線內存增加負載
    // Downside is we may redundantly compute the last
    // 下面是我們可能冗餘的計算
    // vertex several times
    // 頂點數量
    v = min(v, meshlet.vertexCount-1);
    {
      int vertexIndex = texelFetch( vertexIndexBuffer, 
                                    int(meshlet.vertexBegin + v)).x;
      vec4 vertex = texelFetch(vertexBuffer, vertexIndex);
      gl_MeshVerticesNV[v].gl_Position = transform * vertex;
    }
  }

  // Let's pack 8 indices into RG32 bit texture
  // 讓我們將 8 個索引打包到一個 RG32 位的紋理中
  uint primreadBegin = meshlet.primBegin / 8;
  uint primreadIndex = meshlet.primCount * 3 - 1;
  uint primreadMax   = primreadIndex / 8;

  // resolved at compile time and typically just 1
  // 編譯期間計算好，通常爲1
  const uint primreadLoops =
    (MAX_PRIMITIVE_COUNT * 3 + GROUP_SIZE * 8 - 1) 
      / (GROUP_SIZE * 8);

  for (uint loop = 0; loop < primreadLoops; loop++){
    uint p = gl_LocalInvocationID.x + loop * GROUP_SIZE;
    p = min(p, primreadMax);

    uvec2 topology = texelFetch(primitiveIndexBuffer, 
                                int(primreadBegin + p)).rg;

    // use a built-in function, we took special care before when 
    // 使用內置的函數，我們需要特別小心
    // sizing the meshlets to ensure we don't exceed the 
    // meshlets 的大小不超過
    // gl_PrimitiveIndicesNV array here
    // gl_PrimitiveIndicesNV 數據的大小

    writePackedPrimitiveIndices4x8NV(p * 8 + 0, topology.x);
    writePackedPrimitiveIndices4x8NV(p * 8 + 4, topology.y);
  }

  if (gl_LocalInvocationID.x == 0) {
    gl_PrimitiveCountNV = meshlet.primCount;
  }

這個例子只是一個簡單的實現。由於所有數據獲取都是由開發人員完成的，自定義編碼、通過子組內部函數或共享內存進行解壓縮，或者暫時使用頂點輸出，都可以節省額外的帶寬。

Cluster Culling with Task Shader - Task Shader 的剔除

我們嘗試擠入更多的信息到 meshlet descriptor（描述器）中去執行提前的剔除。我們以嘗試使用 128-bit 的描述器來編碼入之前提到的數值，以及 G.whilida 提出的相對於一個BB（BBox）和Cone（圓錐體）的背面剔除。當我們生成 meshlets，需要平衡 cluster-culling 特性與提升頂點複用性。這可以會有負面的影響。

task shader 下面剔除 32 個 meshlets。

layout(local_size_x=32) in;

taskNV out Task {
  uint      baseID;
  uint8_t   subIDs[GROUP_SIZE];
} OUT;

void main() {
  // we padded the buffer to ensure we don't access it out of bounds
  // 我們填補緩存的空隙，確保我們不會訪問出界
  uvec4 desc = meshletDescs[gl_GlobalInvocationID.x];

  // implement some early culling function
  // 實現一些提早剔除的函數
  bool render = gl_GlobalInvocationID.x < meshletCount && !earlyCull(desc);

  uvec4 vote  = subgroupBallot(render);
  uint  tasks = subgroupBallotBitCount(vote);

  if (gl_LocalInvocationID.x == 0) {
    // write the number of surviving meshlets, i.e. 
    // 寫入一些剩餘的 meshlets
    // mesh workgroups to spawn
    // 要生成的 mesh workdgroup
    gl_TaskCountNV = tasks;

    // where the meshletIDs started from for this task workgroup
    // meshletIDs 將從這個 task workdgroup 開始
    OUT.baseID = gl_WorkGroupID.x * GROUP_SIZE;
  }

  {
    // write which children survived into a compact array
    // 寫入剩餘下來的 children 到緊密的數組中
    uint idxOffset = subgroupBallotExclusiveBitCount(vote);
    if (render) {
      OUT.subIDs[idxOffset] = uint8_t(gl_LocalInvocationID.x);
    }
  }
}

對應的 mesh shader 現在使用的信息將來自 task shader 生成的對應的 meshlet。

taskNV in Task {
  uint      baseID;
  uint8_t   subIDs[GROUP_SIZE];
} IN;

void main() {
  // We can no longer use gl_WorkGroupID.x directly
  // 我們可以不在使用 gl_WorkGroupID.x
  // as it now encodes which child this workgroup is.
  // 現在編碼 child 到這個 workgroup
  uint meshletID = IN.baseID + IN.subIDs[gl_WorkGroupID.x];
  uvec4 desc = meshletDescs[meshletID];
  ...
}

我們渲染巨量的三角性模型的上下文中，僅在 task shader 剔除 meshlets。其他場合可能涉及提取不同的 meshlet 數據，依賴 level-of-detail 來決定，或完整的生成幾何體（例子，帶條，等）。下面的圖9是一個使用了 task shader 來爲 level-of-detail 計算用的Demo。

圖9。NVIDIA 行星 demo 使用了 mesh shading

[1]: Art by Rens
[2]: photo by Chris Christian – model by Russell Berkoff
[3]: Optimizing Graphics Pipeline with Compute – Graham Wihlidal
[4]: GPU-Driven Rendering Pipelines – Ulrich Haar & Sebastian Aaltonen
[5]: The filtered and culled Visibility Buffer – Wolfgang Engel

翻譯完後，我發現這個作者的表達能力真的不好，推薦閱讀：

extensions
- NV_mesh_shader.txt
- GLSL/extensions/nv/GLSL_NV_mesh_shader.txt
Geek3D
Advanced Mesh Shaders | Martin Fuller | DirectX Developer Day - Microsoft 的DX12 - 2020.03.19 的視頻介紹。
【Vulkan/MeshShader】第一個MeshShader 程序 - 知乎中的網友編寫的 mesh shader

先記錄一下，後面等有顯卡支持我再去實現 OpenGL 的 Mesh Shader demo

Introduction to Turing Mesh Shaders

文章目錄

Simple Introduction - 簡介

Motivation - 值得一提

Mesh Shading Pipeline - Mesh 着色管線

Meshlets and Mesh Shading - Meshlets 和 Mesh 着色

Pre-Computed Meshlets - 與計算的Meshlets

Data Structure - 數據結構

Rendering Resources and Data Flow - 渲染資源與數據流

Cluster Culling with Task Shader - Task Shader 的剔除

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

Introduction to Turing Mesh Shaders

LearnGL - 08.0.1 - Camera - GLM版前置篇

LearnGL - 09 - Include IMGUI - Dear ImGui - 添加Dashboard、Debugging Panel

IMGUI 系統 - Dear ImGUI

LearnGL - 08.1 - Camera - GLM 版

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結