Vivado_HLS 学习笔记3-循环的pipeline与展开

优化手段

1 优化之:循环展开

对某个标记的循环进行Directive-> Unroll.
对循环展开的越彻底(Directive-> Unroll -> complete),消耗的资源数和端口数越多,带来的吞吐量越大.需要trade-off.
那么该展开到什么程度呢? 可以全部展开(complete),可以部分展开(factor:8,展开为8份);

可以学习ug871的Ch4 lab3 and lab4.

2 优化之:移位寄存器

如果使用C数组实现移位寄存器,那么综合出来的会是memory.
对某个array(shift_reg)进行Directive-> Array_Partition -> type: complete

3 优化之: 循环PipeLine

对于循环, 可以选择Pipeline优化-> II (Initiation Interval)保持不填,即为Interval=1 -> 选择enable loop rewinding(告诉HLS当实现为RTL时,该循环不间断运行)
Pipeline将把循环内的操作进行流水设计.

4 工程文件组织

top.cpp: 顶层模块, 例化一个object,并调用一系列的methods.
top.h: 使用宏定义,对top.cpp中的object进行配置;
class.h(.cpp): 类的定义与实现,建议使用模板;
test.c: testbench要包括输入激励,对照结果,测试结果,对比打分.

5 循环合并

相同迭代次数(常数)的循环,如果可以并行(无数据依赖), 可以使用loop merging;
相同迭代次数(变量)的循环,如果可以并行(无数据依赖), 可以使用loop merging;

ROM的综合

1 要综合出来MEMORY,需要使用static关键词来定义数组,并一定要使用{}进行初始化.

static关键词的作用是保证初始化数组在函数执行时只被初始化一次，避免了每次调用函数时都需要初始化数组都要占用时间的问题。
2 const关键字,当数组只可以被读取时(会被综合为ROM). 例如`const coef_t coef[N] = {1,2,4,5,7,8};
当ROM内容较大时,可以使用#include的方式,注意只能这样使用(#include "coef.h"语句必须单独成一行)!

const coef_t coef[N] = {
                        #include "coef.h"
                        };

此时,"coef.h"内容为

1,
2,
3,
4  //注意最后一行的逗号用空格替代

3 复杂的情况请参考以下代码

// This template function selects the coefficient equation specific to the
// window function type chosen at class object instantiation.
template<int SZ, win_fn_t FT>
double coef_calc(int i)
{
   double coef_val;

   switch(FT) {
   case RECT:
      coef_val = 1.0;
      break;
   case HANN:
      coef_val = 0.5 * (1.0 - cos(2.0 * M_PI * i / double(SZ)));
      break;
   case HAMMING:
      coef_val = 0.54 - 0.46 * cos(2.0 * M_PI * i / double(SZ));
      break;
   case GAUSSIAN:
      const double gaussian_sigma = 0.5;
      double x = (i - SZ / 2) / (gaussian_sigma * (SZ / 2));
      coef_val = exp(-0.5 * x * x);
      break;
   }
   return coef_val;
}

// This template function is used to initialize the contents of the 
// coefficient table.  Currently, in order for the table to be mapped to
// a ROM it must be defined at file (global) scope, i.e. it cannot be
// a class method (member function).
template<class TC, int SZ, win_fn_t FT>
void init_coef_tab(TC *coeff)
{
   for (int i = 0; i < SZ; i++) {
      coeff[i] = coef_calc<SZ,FT>(i);
   }
};

template<class TI, class TO, class TC, int SZ, win_fn_t FT>
void window_fn<TI,TO,TC,SZ,FT>::apply(TO *outdata, TI *indata)
{
   TC coeff_tab[SZ];
   // To ensure coeff_tab is implemented as a ROM on the FPGA, it must
   // be initialized by a separate function. No hardware is synthesized
   // for ROM initialization.
   init_coef_tab<TC,SZ,FT>(coeff_tab);
winfn_loop:
   for (unsigned i = 0; i < SZ; i++) {
//#pragma AP PIPELINE // implemented as TCL directive in this example
      outdata[i] = coeff_tab[i] * indata[i];
   }
}