CAPI 初探及使用小结(1)

作者注：

限于能力和时间，文中定有不少错误，欢迎指出，邮箱yixiangrong@hotmail.com, 期待讨论。由于绝大部分是原创，即使拷贝也指明了出处(如有遗漏请指出），所以转载请表明出处http://www.cnblogs.com/e-shannon/

http://www.cnblogs.com/e-shannon/p/7495618.html

相关资料:http://bbs.eetop.cn/thread-636542-1-1.html

1 前言... 2

1.1 目的... 2

1.2 参考资料... 2

1.3 专业术语Glossary. 2

2 CAPI overview.. 4

2.1 背景... 4

2.1.1 行业背景... 4

2.1.2 技术背景以及开放式总线接口... 5

2.2 Cache. 5

2.2.1 浅析Cache. 5

2.2.2 Cache访问方式... 6

2.2.3 缓存映射方式和cache line. 7

2.3 Cache Coherency. 11

2.4 Power CPU的cache coherency系统... 12

3 CAPI 详细结构和流程... 14

3.1 CAPI 硬件结构... 14

3.1.1 CAPP. 14

3.1.2 PSL. 14

3.1.3 AFU.. 16

3.2 PSL 加速接口... 16

3.3 CAPI工作机制... 17

3.3.1 CAPI的流程... 17

3.3.2 CAPI 应用程序流程... 18

3.4 CAPI仿真平台搭建... 22

3.4.1 仿真的原理和模型... 22

3.4.2 仿真步骤... 23

3.5 CAPI 优势... 23

3.5.1 相比于PCIE IO 加速的优势... 23

3.5.2 相比于CPU+GPU优势... 25

3.5.3 劣势... 26

4 开放的coherent 加速接口... 26

4.1 OpenCAPI 28

4.1.1 DL. 29

4.1.2 TL（待续）... 31

4.2 OpenCAPI和CAPI的比较... 31

4.3 自问自答... 33

4.4 延伸阅读（可删）... 34

1 1 前言

1.1 目的

初步研究CAPI的加速原理，理解cache 一致性，对比CAPI和一般PCIE加速设备的优势和劣势。部分总结CAPI 1.0的使用，并简单列举CAPI现状,网站以及2.0的对比。简单介绍现今三个新的开放的CPU高速一致性接口(CCIX,Gen-Z,OpenCAPI)

CAPI的原理含CAPI2.0的总线接口，流程以及仿真步骤(可以指出历史和自己的弯路)

为了满足加速accelerators，业界正在为CPU高性能一致性接口(high performance coherence interface)定义开放的标准，2016年出现了openCAPI/Gen-Z/CCIX 三种open标准，本文也会略微提及

说是初步研究，是因为缺少CAPI的软件分析，比如具体如何减少了I/O overhead，相对于IO加速的优势没有性能对比。尤其是cache coherent带来的优势没有自己的具体指标，虽然引用了Power自己的数据。

CAPI全称coherent acceleration processor interface（一致性加速处理器接口），作为 Power 处理器架构的一个重要加速功能，提供用户一个可订制、高效易用、分担CPU负荷的硬件加速的解决方案，其实现载体是FPGA。Power8的时候，CAPI 的PSL(加密的IP核)是在ALTERA的FPGA上实现，自从ALTERA为intel收购后，改为Xilinx上的IP核，PSL的资源占用情况需要自行查询，本人手上有的资料是CAPI1.0在Altera的资源使用情况。由于CAPI2.0和1.0基础原理一致，加之自己主要接触到1.0，所以本文CAPI如无特殊说明，均是1.0[dream1] 。

限于精力和资源，也没有深入研究OpenCAPI

限于能力和时间，文中定有不少错误，欢迎指出，邮箱yixiangrong@hotmail.com, 期待讨论。由于绝大部分是原创，即使拷贝也指明了出处，所以转载请表明出处

http://www.cnblogs.com/e-shannon/p/7495618.html

1.2 参考资料

1) <OpenPOWER_CAPI_Education_Intro_Latest.ppt>

2) <CCIX,Gen-Z,penCAPI_Overview&Comparison.pdf>

3) <OpenPOWER and the Roadmap Ahead.pdf>

4) 网址来源

https://openpowerfoundation.org/?resource_lib=psl-afu-interface-capi-2-0

http://www.csdn.net/article/2015-06-17/2824990

http://www.openhw.org/module/forum/thread-597651-1-1.html

www-304.ibm.com/webapp/set2/sas/f/capi/CAPI_POWER8.pdf

5) <POWER9-VUG..pdf>

https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/61ad9cf2-c6a3-4d2c-b779-61ff0266d32a/page/1cb956e8-4160-4bea-a956-e51490c2b920/attachment/56cea2a9-a574-4fbb-8b2c-675432367250/media/POWER9-VUG.pdf

1.3 专业术语Glossary

CAPI : Coherent Accelerator Processor Interface

POWER: Performance Optimization With Enhanced RISC

HDK: Hardware development kit

SDK: Software development kit

CCIX: Cache Coherent Internconnect for Accelerators. www.ccixconsortium.com

OpenCAPI: Open Coherent Accelerator Processor Interfae opencapi.org

Gen-Z: genzconsortium.org

LRU: least recent used

HPC: High Performace Computing

DMI: Durable Memory interface （OpenPOWER and the Roadmap Ahead.pdf）

QPI: The Intel QuickPath Interconnect (QPI) is a point-to-point processor interconnect developed by Intel which replaced the front-side bus (FSB) in Xeon, Itanium, and certain desktop platforms starting in 2008.(wiki),与AMD的HyperTransport(HT)竞争

https://jingyan.baidu.com/article/6525d4b11f2c2bac7d2e943e.html

SMP: Symmetric Multi-Processor,一种UMA结构，多核CPU共享所有资源,SMP在POWER架构中采用[dream2]

NUMA: Non-Uniform Memory Access与SMP结构对比，多CPU分成几组，本地的内存访问速度快于远端的内存访问，所以是Non-Uniform. The trend in hardware has been towards more than one system bus, each serving a small set of processors. Each group of processors has its own memory and possibly its own I/O channels. However, each CPU can access memory associated with the other groups in a coherent way. Each group is called a NUMA node. The number of CPUs within a NUMA node depends on the hardware vendor. It is faster to access local memory than the memory associated with other NUMA nodes. This is the reason for the name, non-uniform memory access architecture.

http://www.cnblogs.com/yubo/archive/2010/04/23/1718810.html

https://technet.microsoft.com/en-us/library/ms178144(v=sql.105).aspx

MPP: Massive Parallel Processing多组SMP CPU组，组和组之间内存不能访问，通过网络节点互联，可以无限扩展[dream3]

NUMA与MPP的区别

http://www.cnblogs.com/yubo/archive/2010/04/23/1718810.html

　　从架构来看，NUMA与MPP具有许多相似之处：它们都由多个节点组成，每个节点都具有自己的CPU、内存、I/O，节点之间都可以通过节点互联机制进行信息交互。那么它们的区别在哪里？通过分析下面NUMA和MPP服务器的内部架构和工作原理不难发现其差异所在。

　　首先是节点互联机制不同，NUMA的节点互联机制是在同一个物理服务器内部实现的，当某个CPU需要进行远地内存访问时，它必须等待，这也是NUMA服务器无法实现CPU增加时性能线性扩展的主要原因。而MPP的节点互联机制是在不同的SMP服务器外部通过I/O 实现的，每个节点只访问本地内存和存储，节点之间的信息交互与节点本身的处理是并行进行的。因此MPP在增加节点时性能基本上可以实现线性扩展。

　　其次是内存访问机制不同。在NUMA服务器内部，任何一个CPU可以访问整个系统的内存，但远地访问的性能远远低于本地内存访问，因此在开发应用程序时应该尽量避免远地内存访问。在MPP服务器中，每个节点只访问本地内存，不存在远地内存访问的问题。

ISA: instruction set architechture

CAIA : Coherent Accelerator Interface Architecture defines a coherent accelerator interface structure for coherently attaching accelerators to the POWER systems using a standard PCIe bus. The intent is to allow implementation of a wide range of accelerator in order to optimally address many different market segments.

CAPP : Coherent Accelerator Processor Proxy

Design unit that snoops the PowerBus commands and provides coherency responses reflecting the state of the caches in PSL. Issues commands to PSL so that it can provide data responses.

PSL : Power Service Layer

The PSL provides the address translation and system memory cache for the AFUs. In addition, the PSL provides miscellaneous facilities for the host processor to manage the virtualization of the AFUs, interrupts, and memory management.

AFU : Accelerator Function Unit

Effective Address(EA)/Real Address(RA)….power ISA book III
AFU使用有效地址即CPU的地址空间（业界也称为虚拟地址），PSL则将有效地址翻译为实际地址（业界也称为物理地址）The AFU uses Effective Addressing, which is the process’s address space (industry calls this “virtual”). The PSL translates the Effective Address into a Real Address (industry calls this “physical”) for accessing memory within the PowerPC system.

MMIO: Memory-mapped input/output.

WED: work element discriptor工作单元描述符。当应用程序申请使用AFU时，一个处理单元被加入到处理单元链上，这个处理单元链描述了整个应用的处理状态。处理单元同时含有一个WED，工作单元描述符，这个WED可以是描述job也可以是一个指针，指向更丰富的描述，来告知AFU的工作内容。When an application requests use of an AFU, a process element is added to the process-element linked list that describes the application’s process state. The process element also contains a work element descriptor (WED) provided by the application. The WED can contain the full description of the job to be performed or a pointer to other main memory structures in the application’s memory space. Several programming models are described providing for an AFU to be used by any application or for an AFU to be dedicated to a single application.

[dream1]其他CAPI2.0特点见后（比如支持PCIE gen4,达到16GT/s /per lane，Power9支持）

[dream2]http://www.cnblogs.com/yubo/archive/2010/04/23/1718810.html指出SMP的缺点是共享内存，如果增加CPU，那么内存访问冲突大幅增加，造成CPU资源浪费，性能下降。所以2-4是合理的

问题是 POWER9是SMP结构吗？它有8核，怎么提高效率的呢？知乎上又说SMP扩展性好，这是怎么回事？

[dream3]是否现在的超算，银河就是MPP架构？