转：Designing A Skid Buffer

Designing A Skid Buffer

from FPGA Resources by GateForge Consulting Ltd.

Networks-on-Chip (NoC) are very common, and have as a fundamental building block a point-to-point connection with a handshaking mechanism so each end can signal if they have data to send, or if they are able to receive data. When both ends agree, a data transfer occurs. The canonical modern example is the valid/ready handshake in the AXI protocol.

Aside: as specifications go, the AXI4 spec is worth your time. It gets complicated, but that's because it aims to be a flexible, all-purpose interface. However, the basics are quite clear and of broader relevance to NoC design.

However, pipelining handshaking interfaces is more complicated: simply adding a pipeline register to the valid, ready, and data lines will work, but now each burst of transfers take two cycles to start, and two cycles to stop. This isn't bad in terms of bandwidth if you have a block transfers to do, but now each receiving end has to be aware of how many pipeline stages are in the connection, and have sufficient buffering to absorb the data that keeps arriving after it signals it is no longer ready to receive more data.

This is the basis of credit-based connections (which I'm not getting into here), which maximize bandwidth over long pipelines, but are overkill if you simply need to add a single pipeline stage between two ends, without having to modify them, so as to meet timing or allow each end to send off one item of data without having to wait for a response (thus overlapping communication and computation, which is desirable).

Figuring Out The Requirements

To begin designing this single pipeline stage, let's imagine a single unit which can perform a valid/ready handshake and receive an incoming item of data, then performs the same handshake with the other end to send the data. The receiving side is called (in AXI terminology) the slave interface, and the sending side is the master interface. (This is for a write transfer. The handshakes are reversed for a read transfer. See the AXI spec for details.)

Ideally, the slave and master interfaces operate concurrently for maximum band in the same clock cycle, a new data item is received on the slave interface and put into a register, and that same register is simultaneously read out by the master interface. However, if the master interface is not transfering data on a given cycle, the slave interface must not transfer data during that cycle also, else we will overwrite the data register before it was read out. To avoid this problem, the slave interface should declare itself not ready in the same cycle as the master interface declaring itself not ready. But this forms a direct combinational connection between them, not a pipelined one. If we could connect both interfaces directly, and not affect timing or concurrency, we wouldn't need pipelining in the first place!

To resolve this contradiction, we need an extra buffer register to capture the incoming data during a clock cycle where the slave interface is transferring data, but the master interface isn't, and there is already data in the main register. Then, in the next cycle, the slave interface can signal it is no longer ready, and no data gets lost. We can imagine this extra buffer register as allowing the slave interface to "skid" to a stop, rather than stopping immediately, which we'd previously found contradicts our pipelining requirements.

Datapath Implementation

A good way to add this buffer register is to selectively feed the main register with data from either the incoming data stream, or the buffer register. This layout gives a neat registered output, which the CAD tools can then retime as necessary with any downstream logic, and forms the datapath of what will become our skid buffer. The Verilog implementation is straightforward.

`default_nettype none

module skid_buffer_datapath
#(
    parameter WORD_WIDTH = 0
)
(
    input   wire                        clock,

    // Data
    input   wire    [WORD_WIDTH-1:0]    data_in,
    output  reg     [WORD_WIDTH-1:0]    data_out,

    // Control
    input   wire                        data_out_wren,
    input   wire                        data_buffer_wren,
    input   wire                        use_buffered_data    
);

// --------------------------------------------------------------------------

    localparam WORD_ZERO = {WORD_WIDTH{1'b0}};

    initial begin
        data_out = WORD_ZERO;
    end

// --------------------------------------------------------------------------

    reg [WORD_WIDTH-1:0] data_buffer    = WORD_ZERO;
    reg [WORD_WIDTH-1:0] selected_data  = WORD_ZERO;

    always @(*) begin
        selected_data = (use_buffered_data == 1'b1) ? data_buffer : data_in;
    end

    always @(posedge clock) begin
        data_buffer <= (data_buffer_wren == 1'b1) ? data_in       : data_buffer;
        data_out    <= (data_out_wren    == 1'b1) ? selected_data : data_out;
    end

endmodule

Controlling The Datapath

To operate our datapath as a skid buffer, we need to understand which states we want to allow it to be in, and which state transitions we also allow. This skid buffer has three states:

It is Empty.
It is Busy, holding one item of data in the main register, either waiting or actively transferring data through that register.
It is Full, holding data in both registers, and stopped until the main register is emptied and simultaneously refilled from the buffer register, so no data is lost or reordered. (Without an available empty register, the slave interface cannot skid to a stop, so it must signal it is not ready.)

The operations which transition between these states are:

the slave interface inserting a data item into the datapath (+)
the master interface removing a data item from the datapath (-)
both interfaces inserting and removing at the same time (+-)

We can see from the resulting state diagram that when the datapath is empty, it can only support an insertion, and when it is full, it can only support a removal. If the interfaces try to remove while Empty, or insert while Full, data will be duplicated or lost, respectively.

Controlpath Implementation

This simple FSM description helped us clarify the problem, but it also glossed over the potential complexity of the implementation: 3 states, each connected to 2 signals (valid/ready) per interface, for a total of 16 possible transitions out of each state, or 48 possible state transitions total.

We don't want to have to manually enumerate all the transitions to then coalesce the equivalent ones and rule out all the impossible or illegal ones. Instead, if we express in logic the constraints on removals and insertions we determined from the state diagram, and the possible transformations on the datapath, we then get the state transition logic and datapath control signal logic almost for free.

I'll list the code in chunks here, with explanations in between.

First, the module and port definitions, and the initial values for the outputs, which match those of an Empty datapath:

`default_nettype none

module skid_buffer_fsm
// No parameters
(
    input   wire    clock,

    // Slave interface
    input   wire    s_valid,
    output  reg     s_ready,

    // Master Interface
    output  reg     m_valid,
    input   wire    m_ready,

    // Control to Datapath
    output  reg     data_out_wren,
    output  reg     data_buffer_wren,
    output  reg     use_buffered_data    
);

// --------------------------------------------------------------------------

    initial begin
        s_ready             = 1'b1; // empty at start, so accept data
        m_valid             = 1'b0;
        data_out_wren       = 1'b1; // empty at start, so accept data
        data_buffer_wren    = 1'b0;
        use_buffered_data   = 1'b0;
    end

Then, let's describe the possible states of the datapath, and initialize the state variable. This code describes a binary state encoding, but the CAD tool can re-encode and re-number the state encoding. Usually this is beneficial, but if the states+inputs fit in a single LUT, forcing binary encoding reduces area. See what works best (i.e.: reaches the highest speed) for your given FPGA.

    localparam STATE_BITS = 2;

    localparam [STATE_BITS-1:0] EMPTY = 'd0; // Output and buffer registers empty
    localparam [STATE_BITS-1:0] BUSY  = 'd1; // Output register holds data
    localparam [STATE_BITS-1:0] FULL  = 'd2; // Both output and buffer registers hold data
    // There is no case where only the buffer register would hold data.

    // No handling of erroneous and unreachable state 3.
    // We could check and raise an error flag.

    reg [STATE_BITS-1:0] state      = EMPTY;
    reg [STATE_BITS-1:0] state_next = EMPTY;

Now, let's express the constraints we figured out from the state diagram:

The slave interface can only insert when the datapath is not full.
The master interface can only remove data when the datapath is not empty.

We do this by computing the allowable output read/valid handshake signals based on the datapath state. We use state_next so we can have a nice registered output. This little bit of code prunes away a large number of invalid state transitions. If some other logic seems to be missing, first see if this code has made it unnecessary.

    always @(posedge clock) begin
        s_ready <= (state_next != FULL);
        m_valid <= (state_next != EMPTY); 
    end

After, let's describe the interface signal conditions which implement our two basic operations on the datapath: insert and remove. This also weeds out a number of possible state transitions.

    reg insert = 1'b0;
    reg remove = 1'b0;

    always @(*) begin
        insert = (s_valid == 1'b1) && (s_ready == 1'b1);
        remove = (m_valid == 1'b1) && (m_ready == 1'b1);
    end

Now that we have our datapath states and operations, let's use them to describe the possible transformations to the datapath, and in which state they can happen. You'll see that these exactly describe each of the 5 edges in the state diagram, and since we've pruned the space of possible interface conditions, we only need the minimum logic to describe them, and this logic gets re-used a lot later on, simplifying the code.

    reg load    = 1'b0; // Empty datapath inserts data into output register.
    reg flow    = 1'b0; // New inserted data into output register as the old data is removed.
    reg fill    = 1'b0; // New inserted data into buffer register. Data not removed from output register.
    reg flush   = 1'b0; // Move data from buffer register into output register. Remove old data. No new data inserted.
    reg unload  = 1'b0; // Remove data from output register, leaving the datapath empty.

    always @(*) begin
        load    = (state == EMPTY) && (insert == 1'b1);
        flow    = (state == BUSY)  && (insert == 1'b1) && (remove == 1'b1);
        fill    = (state == BUSY)  && (insert == 1'b1) && (remove == 1'b0);
        flush   = (state == FULL)  && (insert == 1'b0) && (remove == 1'b1);
        unload  = (state == BUSY)  && (insert == 1'b0) && (remove == 1'b1);
    end

And now we simply need to calculate the next state after each datapath transformations:

    always @(*) begin
        state_next = (load   == 1'b1) ? BUSY  : state;
        state_next = (flow   == 1'b1) ? BUSY  : state_next;
        state_next = (fill   == 1'b1) ? FULL  : state_next;
        state_next = (flush  == 1'b1) ? BUSY  : state_next;
        state_next = (unload == 1'b1) ? EMPTY : state_next;
    end

    always @(posedge clock) begin
        state <= state_next;
    end

Similarly, from the datapath transformations, we can compute the necessary control signals to the datapath. These are not registered here, as they end at registers in the datapath.

    always @(*) begin
        data_out_wren     = (load  == 1'b1) || (flow == 1'b1) || (flush == 1'b1);
        data_buffer_wren  = (fill  == 1'b1);
        use_buffered_data = (flush == 1'b1);
    end

And finally, we glue the datapath and FSM together into the skid buffer module proper:

`default_nettype none

module skid_buffer
#(
    parameter WORD_WIDTH = 0
)
(
    input   wire                        clock,

    // Slave interface
    input   wire                        s_valid,
    output  wire                        s_ready,
    input   wire    [WORD_WIDTH-1:0]    s_data,

    // Master interface
    output  wire                        m_valid,
    input   wire                        m_ready,
    output  wire    [WORD_WIDTH-1:0]    m_data
);

// --------------------------------------------------------------------------
// The FSM handles the master and slave port handshakes, and provides the
// datapath control signals.

    wire data_out_wren;
    wire data_buffer_wren;
    wire use_buffered_data;

    skid_buffer_fsm
    // No parameters
    controlpath
    (
        .clock              (clock),

        .s_valid            (s_valid),
        .s_ready            (s_ready),

        .m_valid            (m_valid),
        .m_ready            (m_ready),

        .data_out_wren      (data_out_wren),
        .data_buffer_wren   (data_buffer_wren),
        .use_buffered_data  (use_buffered_data)
    );

// --------------------------------------------------------------------------
// The datapath stores and steers the data.

    skid_buffer_datapath
    #(
        .WORD_WIDTH         (WORD_WIDTH)
    )
    datapath
    (
        .clock              (clock),

        .data_in            (s_data),
        .data_out           (m_data),

        .data_out_wren      (data_out_wren), 
        .data_buffer_wren   (data_buffer_wren),
        .use_buffered_data  (use_buffered_data)
    );

endmodule

For a 64-bit connection, the resulting skid buffer uses 128 registers for the buffers, 4 to 9 registers (and associated LUTs) for the FSM and interface outputs, depending on the particular state encoding chosen by the CAD tool, and easily reaches a high operating speed.

fpgacpu.ca