SV——Yikes! Why is My SystemVerilog Still So

 

0. 介紹

這個Cummings在2019年DVCon會議上的論文《 Yikes! Why is My SystemVerilog Still So Slooooow 》,主要講關於systemverilog仿真速度與coding之間關係。

 

1. System Verilog語義

1.1 logic類型有兩種語義

Introduced in SystemVerilog, the logic type can have either wire or variable storage and that storage type is determined from context by the simulator if it is not explicitly declared. This matters to simulation because wires can be collapsed to be the same object for higher simulation speed whereas variables cannot. Since the semantic for logic type is to default to variable storage in all cases except for the inputs or inouts of a design unit .

wire is implicit on the input port so the declaration is not needed

 

1.2 vector操作比bit操作快

Another important semantic is that the simulator will typically operate faster on a full vector than individual bits.

 

1.3 引用(ref)和值傳遞

Just keep in mind that all parameters in an argument list that follow the ref construct will pass by reference unless you explicitly use input, output or inout .

 

1.4 static arrry faster than dynamic array

The semantics of dynamic data structures (QDAs) are also sources of common performance issues that are generally true of SystemVerilog and most languages that have these types. An easy one to recognize is the use of static arrays instead of dynamic arrays wherever possible(dynamic array 有 memory footprint and garbage collection time ).

Since dynamic arrays are best for look-up and random insertion/deletion operations and queues are best for front or back operations with automatic resizing,

 

2. Memory and Garbage Collection – Neither are Free

2.1 對象的創建和釋放消耗時間

  1. 用ref引用方式傳遞動態對象或者深拷貝對象。

  2. 根據設計需要決定是創建一個對象,還是每次循環都要創建新的對象。

// call new every loop
task run();
    forever begin
        md=new();
        @(negedge vif.clk);
        ……
    end
endtask
// only once new
task run();
    md=new();
    forever begin
        md=new();
        ……
    end
endtask

 

2.2 減小class的仿真消耗

  1. 如果只需要一個container的話,用struct代替class.

    Wherever possible struct[s] should be used instead – either inside the class or instead of the class. For example, if the main purpose of the class is to be a container of heterogeneous data types, then a struct is a better choice .

    because that separate class will require heap management and potentially engage garbage collection but the simple struct will not .

  2. 將一些interface-heavy function放在interface中。

    Putting interface-heavy functionality into the interface rather than in classes is also more simulation efficient with the added benefit of being more reusable,

 

3 Leave Sleeping Processes to Lie

A very common process in SystemVerilog is the always block with a single sensitive signal, such as the clock.This static process is highly optimized in all simulators, but side-effects from dynamic tasks or functions such as DPI (or any external) functions, virtual class tasks/functions, and virtual interface tasks/functions may disable the optimization.

By moving the DPI call inside the conditional, the simulator might optimize the process wake up to posedge clk and txactive reducing the number of times the process executes.

import "DPI-C" function void dpi_tic(logic active, int count);
module BENM9A (input logic txactive, clk);
    int counter; // default value is 0
    initial $display("%m");
    always_ff @(posedge clk) begin
        //move DPI code into condition if it is conditional
        dpi_tic(txactive, counter);
        if (txactive)
        counter <= counter+1;
    end
endmodule
​
import "DPI-C" function void dpi_tic(logic active, int count);
module BENM9B (input logic txactive, clk);
    int counter; // default value is 0
    initial $display("%m");
    always_ff @(posedge clk)
        if (txactive) begin
            //move DPI code into condition if it is conditional
            dpi_tic(txactive, counter);
            counter <= counter+1;
        end
endmodule

4. UVM Best Practices

4.1 通過條件判斷string processing

只在需要處理string的時候再處理。

The unconditional array string processing even when the processed string was not printed was huge, exacting a penalty of 3,000-10,000 time slower than conditional string processing.

// 無條件string processing
function void get_data();
    string memlayout;
    // Format the memory layout into a string
    memlayout = " {\n";
    foreach(mem[i])
        memlayout = $sformatf("%s mem[%0d]:%8h",memlayout, i, mem[i]);
    memlayout = {memlayout, " }\n"};
    `uvm_info("MEMDATA", memlayout, UVM_HIGH)
endfunction
​
//有條件string processing
function void get_data();
    string memlayout;
    `ifdef FAST
    // Only do expensive string processing for >= UVM_HIGH verbosity
    if(uvm_report_enabled(UVM_HIGH, UVM_INFO, "MEMDATA")) begin
    `endif
    // Format the memory layout into a string
    memlayout = " {\n";
    foreach(mem[i])
        memlayout = $sformatf("%s mem[%0d]:%8h",memlayout, i, mem[i]);
    memlayout = {memlayout, " }\n"};
    `ifdef FAST
    end
    `endif
    `uvm_info("MEMDATA", memlayout, UVM_HIGH)
endfunction

4.2 減少TLM analysis port的執行

Turning off unused analysis port path sampling and broadcasting can significantly improve simulation performance.

// Unconditionally broadcast UVM analysis port transactions
task run_phase(uvm_phase phase);
    forever collect();
endtask
​
// Conditionally broadcast UVM analysis port transactions
task run_phase(uvm_phase phase);
    if(ap.size()) forever collect();
endtask
​
task collect();
    trans1 tr = trans1::type_id::create("tr");
    get_txn_from_interface(tr);
    ap.write(tr);
endtask

5. Verification Best Practices

與randomization、assertion和 coverage collection相關的性能提高。

5.1 降低隨機化的空間

the loop sets up a constraint on each array element based on its neighbor resulting in a list of 16-256 (randomized) integers with 32-bit variables that have to be solved simultaneously. Modifying the code to use post_randomize() and an array sort() method can improve runtime performance up to 1000x.

// 搜索空間很大。
class txn15;
    rand int addr;
    rand logic [15:0] payload[$];
    rand bit [2:0] del;
    constraint size_ct { payload.size() inside { [16:256]}; }
    constraint sort_ct {
        foreach (payload[i]) {
        // i must be greater than 0
        if(i) payload[i] >= payload[i-1];
        }
    }
endclass
​
// 通過在Post_randomize使用sort排序,大大降低randomization的仿真時間。            
class txn15;
    rand int addr;            
    rand logic [15:0] payload[$];
    rand bit [2:0] del;
    constraint size_ct { payload.size() inside { [16:256]}; }
    function void post_randomize();
        payload.sort();
    endfunction
endclass 
           

5.2 assertion

using single-cycle assertions wherever possible,and using single-clock assertions – even if that means splitting the assertion into two separate assertions – all result in improved performance. While local variables may be needed to manipulate data inside sequences and properties, they add overhead during simulation.

 

5.3 coverage

fewer coverage events will deliver faster simulation.

Coverage sampling events can be further reduced by having covergroup[s] share common expressions

A third method to reduce sampling events is to merge sample process that use the same event

`ifdef MERGED
// Sampling merged to a single event
always @(posedge valid iff collect_cov) begin
    c1.sample();
    c2.sample();
    c3.sample();
end
`else
always @(posedge valid iff collect_cov)
    c1.sample();
always @(posedge valid iff collect_cov)
    c2.sample();
always @(posedge valid iff collect_cov)
    c3.sample();
`endif

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章