Neon整数除法解决方案：vrecpe

Neon整数除法解决方案：vrecpe_u32解析

转载请注明出处：http://www.cnblogs.com/pepetang/p/7777243.html

接上篇文章，笔者在学习使用ARM提供的Neon Intrinsics接口的过程中，代码实现方面遇到的问题基本都能通过谷歌解决，唯独遇到整数除法实现纠结了好长时间。一方面采用查表法对除数取其倒数的时候，由于Neon提供的查表指令限制太多，一度放弃；而针对无符号整型的近似倒数指令一直搞不清楚怎么用，为了尽快完成代码编写任务，当时以该指令浮点版本加上类型转换草草收场。回过头来进行性能优化的时候，还是想一探究竟。想起前面在ARM论坛上看到的一个关于除法实现的讨论，决定参考其中一个答案试试看。该答案摘抄如下：

vrecpe.u32 takes normalized inputs, similar to how floating point significant data is usually stored. What that means is that the input has no leading zeroes past the first bit that's always 0. So the top two bits will always be 01.

Another way to look at it is that vrecpe.u32 works on values between 0.5 and 1.0 (non-inclusive), where the format is 0.1.31. That means no sign bits, 1 whole bit, and 31 fraction bits. Due to the input constraints the top bit will always be 0.

参考Neon汇编指令文档可得知该指令视<=0x7fffffff的整数为非法输入，尝试输入0x80000000~0xffffffff的时候得到的输出也是0x80000000~0xffffffff这个范围。参考以下语句笔者找到了得到“normalized inputs”的方法：通过vclz指令得到一个无符号整型左移到合法输入范围0x80000000~0xffffffff所需要的位数。

You can find the normalization shift with a count leading zeroes instruction. In your case you'll want to use vclz.u16. But you need to leave that integer bit, so you want to set shift equal to clz(x) - 1.

接下来就是通过vrecpe得到的结果（实际上为了防止乘法溢出，应对此结果做右移处理）去做除法测试进行倒推，得到的结论是vrecpe_u32的输出结果是Q31的整型。以下是除法接口实现：

 0 #include "arm_neon.h"
 1 /* calculate division of integers with neon intrinsics */
 2 uint32x4_t divTest(uint32x4_t y, uint32x4_t x)
 3 {
 4     uint32x4_t res;
 5     uint32x4_t flagZero;
 6     uint32x4_t shiftL, shiftR;
 7     uint32x4_t x_norm, x_recp;
 8 
 9     flagZero = vceqq_u32(x, vmovq_n_u32(0));//when the divisor is 0
10     shiftL = vclzq_u32(x);
11     shiftR = vsubq_s32(shiftL, vmovq_n_s32(40));//-[(32 - shiftL) + (31 - 23)]
12     
13     x_norm = vshlq_u32(x, shiftL);//input: Q32
14     x_recp = vrecpeq_u32(x_norm);//output： Q31
15     x_recp = vshrq_n_u32(x_recp, 23);//avoid overflow
16     x_recp = vbslq_u32(flagZero, vmovq_n_u32(0), x_recp);
17 
18     res = vmulq_u32(y, x_recp);
19     res = vrshlq_u32(res, shiftR);
20 
21     return res;
22 }

经测试，确实如该答主所言，这里得到的倒数精度只有8位。不过精度要求不高的话倒是可以就这么用了，缺点就是性能一般。如果有读者了解更优实现，请留言告诉我哦。