trt is perf_client

perf_client

A critical part of optimizing the inference performance of your model is being able to measure changes in performance as you experiment with different optimization strategies. The perf_clientapplication performs this task for the Triton Inference Server. The perf_client is included with the client examples which are available from several sources.
優化推斷性能的有爭議的部分是perf_client，能夠在對模型選用不同的優化策略時度量性能，該部分包含在client例子中。

The perf_clientgenerates inference requests to your model and measures the throughput and latency of those requests. To get representative results, the perf_clientmeasures the throughput and latency over a time window, and then repeats the measurements until it gets stable values. By default the perf_clientuses average latency to determine stability but you can use the --percentile flag to stabilize results based on that confidence level. For example, if --percentile=95 is used the results will be stabilized using the 95-th percentile request latency. For example:
該部分發送請求到模型，計算吞吐量和延遲。爲了結果有代表性，perf_client通過一個時間窗口來計算這兩個參量，不斷重複直到這兩個參量基本穩定。默認perf_client中使用平均延遲來判斷是否穩定，也可以用置信度區間–percentile【可以多測試幾次來判斷對延遲參數產生的影響】參數來是結果穩定。比如，–percentile=95會使用第95部分來穩定請求的延遲參數。

$ perf_client -m resnet50_netdef --percentile=95
*** Measurement Settings ***
  Batch size: 1
  Measurement window: 5000 msec
  Stabilizing using p95 latency

Request concurrency: 1
  Client:
    Request count: 809
    Throughput: 161.8 infer/sec
    p50 latency: 6178 usec
    p90 latency: 6237 usec
    p95 latency: 6260 usec
    p99 latency: 6339 usec
    Avg HTTP time: 6153 usec (send/recv 72 usec + response wait 6081 usec)
  Server:
    Request count: 971
    Avg request latency: 4824 usec (overhead 10 usec + queue 39 usec + compute 4775 usec)

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, 161.8 infer/sec, latency 6260 usec

Request Concurrency

By default perf_clientmeasures your model’s latency and throughput using the lowest possible load on the model. To do this perf_clientsends one inference request to the server and waits for the response. When that response is received, the perf_clientimmediately sends another request, and then repeats this process during the measurement windows. The number of outstanding inference requests is referred to as the request concurrency, and so by default perf_client uses a request concurrency of 1.
默認，在測試延遲和吞時使用的可能是最慢的load，所以perf_client計算的時間是發送一個請求到服務器並等待反饋。收到反饋後，perf_client會立刻發送另一個請求，重複下去。請求的處理速率還取決於併發量，默認併發數爲1.

Using the --concurrency-range :: option you can have perf_client collect data for a range of request concurrency levels. Use the --help option to see complete documentation for this and other options. For example, to see the latency and throughput of your model for request concurrency values from 1 to 4:
可選參數–concurrency-range ::能設置兵法量，詳細用法查–help可知，比如下面設置併發量從1到4。

$ perf_client -m resnet50_netdef --concurrency-range 1:4
*** Measurement Settings ***
  Batch size: 1
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 4 concurrent requests
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 804
    Throughput: 160.8 infer/sec
    Avg latency: 6207 usec (standard deviation 267 usec)
    p50 latency: 6212 usec
...
Request concurrency: 4
  Client:
    Request count: 1042
    Throughput: 208.4 infer/sec
    Avg latency: 19185 usec (standard deviation 105 usec)
    p50 latency: 19168 usec
    p90 latency: 19218 usec
    p95 latency: 19265 usec
    p99 latency: 19583 usec
    Avg HTTP time: 19156 usec (send/recv 79 usec + response wait 19077 usec)
  Server:
    Request count: 1250
    Avg request latency: 18099 usec (overhead 9 usec + queue 13314 usec + compute 4776 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, 160.8 infer/sec, latency 6207 usec
Concurrency: 2, 209.2 infer/sec, latency 9548 usec
Concurrency: 3, 207.8 infer/sec, latency 14423 usec
Concurrency: 4, 208.4 infer/sec, latency 19185 usec

Understanding The Output

For each request concurrency level perf_clientreports latency and throughput as seen from the client (that is, as seen by perf_client) and also the average request latency on the server.
每個併發層面，perf_client會計算客戶端延遲、吞吐和服務器的平均請求延遲。

The server latency measures the total time from when the request is received at the server until the response is sent from the server. Because of the HTTP and GRPC libraries used to implement the server endpoints, total server latency is typically more accurate for HTTP requests as it measures time from first byte received until last byte sent. For both HTTP and GRPC the total server latency is broken-down into the following components:
服務器的延遲是收到請求到發出反饋的總時間，因爲HTTP和GRPC，HTTP服務的總服務器延遲更加典型和精確，它統計的是第一個字節接受到最後一個字節發送的的區域時間。HTTP和GRPC的總延遲都受下面成分影響：

queue: The average time spent in the inference schedule queue by a request waiting for an instance of the model to become available.
隊列：請求隊列中等待模型調用的請求所話費的平均時間

compute: The average time spent performing the actual inference, including any time needed to copy data to/from the GPU.
計算：實際推理的平均耗時，包含複製數據到GPU的時間

The client latency time is broken-down further for HTTP and GRPC as follows:
客戶端的延遲在HTTP和GRPC請求下受影響的因素有：

HTTP: send/recv indicates the time on the client spent sending the request and receiving the response. response wait indicates time waiting for the response from the server.
HTTP：客戶端的收發時間指發出請求和收到反饋的時間，等待反饋的時間指等待服務器的反饋時間

GRPC: (un)marshal request/response indicates the time spent marshalling the request data into the GRPC protobuf and unmarshalling the response data from the GRPC protobuf. response wait indicates time writing the GRPC request to the network, waiting for the response, and reading the GRPC response from the network.
GRPC：發送請求到GRPC protobuf以其獲取反饋的時間就是其收發時間，等待反饋的時間指從把請求寫入GRPC的網絡中到等待請求到從網絡中讀取GRPC反饋的時間。

Use the verbose (-v) option to perf_clientto see more output, including the stabilization passes run for each request concurrency level.
參數-v 可以看到更多輸出，比如像對每個併發級別的參數穩定過程。

Visualizing Latency vs. Throughput

The perf_clientprovides the -f option to generate a file containing CSV output of the results:
該模塊還提供-f參數可以生成結果的CSV文件。

$ perf_client -m resnet50_netdef --concurrency-range 1:4 -f perf.csv
$ cat perf.csv
Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency
1,160.8,68,1291,38,4801,7,6212,6289,6328,7407
3,207.8,70,1211,8346,4786,8,14379,14457,14536,15853
4,208.4,71,1014,13314,4776,8,19168,19218,19265,19583
2,209.2,67,1204,3511,4756,7,9545,9576,9588,9627

You can import the CSV file into a spreadsheet to help visualize the latency vs inferences/second tradeoff as well as see some components of the latency. Follow these steps:
可以把CSV文件導入到spreadsheet中查看延遲和吞吐及其他信息。

Open this spreadsheet
Make a copy from the File menu “Make a copy…”
Open the copy
Select the A1 cell on the “Raw Data” tab
From the File menu select “Import…”
Select “Upload” and upload the file
Select “Replace data at selected cell” and then select the “Import data” button

Input Data

Use the --help option to see complete documentation for all input data options. By default perf_clientsends random data to all the inputs of your model. You can select a different input data mode with the --input-data option:
參數–help可以看到所有輸入數據選項的文檔。默認perf_client按輸入shape的要求發送隨機數據到模型，你也可以選擇用不同的數據模式，使用–input-data參數就可以實現。

random: (default) Send random data for each input.
zero: Send zeros for each input.
directory path: A path to a directory containing a binary file for each input, named the same as the input. Each binary file must contain the data required for that input for a batch-1 request. Each file should contain the raw binary representation of the input in row-major order.
文件夾路徑，包含每個輸入的二進制文件，與input同名，每個二進制文件必須包含batch爲1的請求數據。每個文件應按行優先順序包含輸入數據的原始二進制表示形式。
file path: A path to a JSON file containing data to be used with every inference request. See the “Real Input Data” section for further details. –input-data can be provided multiple times with different file paths to specific multiple JSON files.
文件路徑，json文件得路徑，裏面內容是模型請求的數據。詳情參見Real Input Data模塊，–input-data可以多次使用來指明多個json文件的路徑。

For tensors with with STRING datatype there are additional options --string-length and --string-data that may be used in some cases (see --help for full documentation).
字符串了行得向量有特殊的參數，–string-length和–string-data來用於某些特殊場景。

For models that support batching you can use the -b option to indicate the batch-size of the requests that perf_clientshould send. For models with variable-sized inputs you must provide the --shape argument so that perf_clientknows what shape tensors to use. For example, for a model that has an input called IMAGE that has shape [ 3, N, M ], where N and M are variable-size dimensions, to tell perf_client to send batch-size 4 requests of shape [ 3, 224, 224 ]:
支持batch得模型可用-b參數來引入batch-size參數，可變shape輸入需要指明–shape參數。例如，如果模型接受的是[3,N,M]輸入，模型接受batch-size是4的請求可按如下方式書寫：

$ perf_client -m mymodel -b 4 --shape IMAGE:3,224,224

Real Input Data

The performance of some models is highly dependent on the data used. For such cases users can provide data to be used with every inference request made by client in a JSON file. The perf_clientwill use the provided data when sending inference requests in a round-robin fashion.
有些模型的性能依賴於輸入的數據，這種情況下，客戶端可以通過json文件來發送請求，客戶端會以循環方式發送請求。

Each entry in the “data” array must specify all input tensors with the exact size expected by the model from a single batch. The following example describes data for a model with inputs named, INPUT0 and INPUT1, shape [4, 4] and data type INT32:
下面data數組中的每個元素都要滿足輸入張量的形狀要求，當然，這是在單例模式下，下面的例子就給出了名爲INPUT0和INPUT1，形狀爲[4, 4]，類型爲INT32的模型輸入。

{
  "data" :
   [
      {
        "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
      },
      {
        "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
      },
      {
        "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
      },
      {
        "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
      }
      .
      .
      .
    ]
}

Kindly note that the [4, 4] tensor has been flattened in a row-major format for the inputs.
需要注意的是，[4, 4]張量是按行優先展開的。

A part from specifying explicit tensors, users can also provide Base64 encoded binary data for the tensors. Each data object must list its data in a row-major order. The following example highlights how this can be acheived:
顯示指明張量的，也可以將張量數據轉成base64編碼，每個數據對象以行優先的順序枚舉所有元素。下面是實現例子：

{
  "data" :
   [
      {
        "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="},
        "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="}
      },
      {
        "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="},
        "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="}
      },
      {
        "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="},
        "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="}
      },
      .
      .
      .
    ]
}

In case of sequence models, multiple data streams can be specified in the JSON file. Each sequence will get a data stream of its own and the client will ensure the data from each stream is played back to the same correlation id. The below example highlights how to specify data for multiple streams for a sequence model with a single input named INPUT, shape [1] and data type STRING:
在序列模型中，json文件中可以給出多個數據流，每個模型會獲取自己需要的數據，確保獲取的模型是list同一位置的數據。下面的例子可以看到如何使用，一個名爲INPUT的輸入，shape是1，數據類型是string。

{
  "data" :
    [
      [
        {
          "INPUT" : ["1"]
        },
        {
          "INPUT" : ["2"]
        },
        {
          "INPUT" : ["3"]
        },
        {
          "INPUT" : ["4"]
        }
      ],
      [
        {
          "INPUT" : ["1"]
        },
        {
          "INPUT" : ["1"]
        },
        {
          "INPUT" : ["1"]
        }
      ],
      [
        {
          "INPUT" : ["1"]
        },
        {
          "INPUT" : ["1"]
        }
      ]
    ]
}

The above example describes three data streams with lengths 4, 3 and 2 respectively. The perf_clientwill hence produce sequences of length 4, 3 and 2 in this case.
上面的例子描述了3個數據流，長度分別爲4,3,2。

Users can also provide an optional “shape” field to the tensors. This is especially useful while profiling the models with variable-sized tensors as input. The specified shape values are treated as an override and client still expects default input shapes to be provided as a command line option (see –shape) for variable-sized inputs. In the absence of “shape” field, the provided defaults will be used. Below is an example json file for a model with single input “INPUT”, shape [-1,-1] and data type INT32:
用戶也可提供“shape”參數，尤其是在可變輸入的模型中。該參數給定的shape被當成是覆蓋，客戶端仍然會期望從命令指數中給出可變shape。當shape缺省時，會使用默認的。下面是以INPUT爲輸入，[-1,-1]爲shape的INT32型輸入：

{
  "data" :
   [
      {
        "INPUT" :
              {
                  "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                  "shape": [2,8]
              }
      },
      {
        "INPUT" :
              {
                  "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                  "shape": [8,2]
              }
      },
      {
        "INPUT" :
              {
                  "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
              }
      },
      {
        "INPUT" :
              {
                  "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                  "shape": [4,4]
              }
      }
      .
      .
      .
    ]
}

Shared Memory

By default perf_clientsends input tensor data and receives output tensor data over the network. You can instead instruct perf_clientto use system shared memory or CUDA shared memory to communicate tensor data. By using these options you can model the performance that you can achieve by using shared memory in your application. Use --shared-memory=system to use system (CPU) shared memory or --shared-memory=cuda to use CUDA shared memory.
默認perf_client會發送輸入數據，接受輸出數據。也可以讓perf_client使用系統的共享內存或者CUDA的內存來獲取數據，使用這些設置能提高模型性能，參數–shared-memory=system是使用CPU的共享內存，–shared-memory=cuda是使用CUDA的共享存儲。

Communication Protocol

By default perf_clientuses HTTP to communicate with the inference server. The GRPC protocol can be specificed with the -i option. If GRPC is selected the --streaming option can also be specified for GRPC streaming.
默認perf_client使用HTTP與服務器進行數據交流，GRPC需要制定-i參數，如果選擇GRPC，–streaming可以在使用GRPC流時指定。

trt is perf_client

perf_client

Request Concurrency

Understanding The Output

Visualizing Latency vs. Throughput

Input Data

Real Input Data

Shared Memory

Communication Protocol

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

triton inference server翻譯之Optimization

triton inference server翻譯之Quickstart

triton inference server翻譯之Models And Schedulers

triton inference server翻譯之user guide

triton inference server翻譯之Model Configuration

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結