HLS Coding Style: Functions and Loops

Unsupported C Constructs

Functions

The top-level function becomes the top level of the RTL design after synthesis.
Sub-functions are synthesized into blocks in the RTL design.

The primary impact of a coding style on functions is on the function arguments and
interface.

If the arguments to a function are sized accurately, Vivado HLS can propagate this
information through the design. There is no need to create arbitrary precision types for
every variable.

1 #include "ap_cint.h"
2 int24 foo(int x, int y) {
3 int tmp;
4 tmp = (x * y);
5 return tmp
6 }

When this code is synthesized, the result is a 32-bit multiplier with the output truncated to
24-bit.
If the inputs are correctly sized to 12-bit types (int12) as shown in the following code
example, the final RTL uses a 24-bit multiplier.

1 #include "ap_cint.h"
2 typedef int12 din_t;
3 typedef int24 dout_t;
4 dout_t func_sized(din_t x, din_t y) {
5 int tmp;
6 tmp = (x * y);
7 return tmp
8 }

Loops

RECOMMENDED: Avoid use of global variables for loop index variables, as this can inhibit some optimizations.
IMPORTANT: When a loop or function is pipelined, Vivado HLS unrolls all loops in the hierarchy below the function or loop. If there is a loop with variable bounds in this hierarchy, it prevents pipelining.
TIP: When a loop or function is pipelined, any loop in the hierarchy below the loop or function being pipelined must be unrolled.

Variable Loop Bounds

Loop Pipelining

When pipelining loops, the most optimum balance between area and performance is
typically found by pipelining the inner most loop. This is also results in the fastest run time.
The following code example demonstrates the trade-offs when pipelining loops and
functions.

 1 #include "loop_pipeline.h"
 2 dout_t loop_pipeline(din_t A[N]) {
 3 int i,j;
 4 static dout_t acc;
 5 LOOP_I:for(i=0; i < 20; i++){
 6 LOOP_J: for(j=0; j < 20; j++){
 7 acc += A[i] * j;
 8 }
 9 }
10 return acc;
11 }

If the inner-most (LOOP_J) is pipelined, there is one copy of LOOP_J in hardware, (a single
multiplier). Vivado HLS automatically flattens the loops when possible, as in this case, and
effectively creates a new single loop of 20*20 iterations. Only 1 multiplier operation and 1
array access need to be scheduled, then the loop iterations can be scheduled as single
loop-body entity (20x20 loop iterations).

TIP: When a loop or function is pipelined, any loop in the hierarchy below the loop or function being pipelined must be unrolled.

If the outer-loop (LOOP_I) is pipelined, inner-loop (LOOP_J) is unrolled creating 20 copies
of the loop body: 20 multipliers and 20 array accesses must now be scheduled. Then each
iteration of LOOP_I can be scheduled as a single entity.

If the top-level function is pipelined, both loops must be unrolled: 400 multipliers and 400
arrays accessed must now be scheduled. It is very unlikely that Vivado HLS will produce a
design with 400 multiplications because in most designs data dependencies often prevent
maximal parallelism, for example, in this case, even if a dual-port RAM is used for A[N] the
design can only access two values of A[N] in any clock cycle.

The concept to appreciate when selecting at which level of the hierarchy to pipeline is to
understand that pipelining the inner-most loop gives the smallest hardware with generally
acceptable throughput for most applications. Pipelining the upper-levels of the hierarchy
unrolls all sub-loops and can create many more operations to schedule (which could impact
run time and memory capacity), but typically gives the highest performance design in terms
of throughput and latency.

To summarize the above options:
•Pipeline LOOP_J
Latency is approximately 400 cycles (20x20) and requires less than 100 LUTs and
registers (the I/O control and FSM are always present).
•Pipeline LOOP_I
Latency is approximately 20 cycles but requires a few hundred LUTs and registers. About
20 times the logic as first option, minus any logic optimizations that can be made.
•Pipeline function loop_pipeline
Latency is approximately 10 (20 dual-port accesses) but requires thousands of LUTs and
registers (about 400 times the logic of the first option minus any optimizations that can
be made).

Loop Parallelism

Loop Dependencies

Unrolling Loops in C++ Classes

Reference:

1. Xilinx UG902