How to improve software performance with NEON

How to improve software performance with NEON

以下内容摘录自《xapp1206-boost-sw-performance-zynq7soc-w-neon.pdf》和《Neon_Introduction_for_Avnet_training.pdf》

How to improve software performance with NEON

NEON Basics

From a software perspective, NEON technology is based on single instruction, multiple data (SIMD) operations in ARMv7 processors, which implement the advanced SIMD architecture extensions. This demands a new set of instructions with new functions, and also a new development methodology. From a hardware perspective, NEON is a separate hardware unit on Cortex-A series processors, together with a vector floating point (VFP) unit. If an algorithm can be designed to exploit dedicated hardware, performance can be maximized.

SIMD Introduction

On 32-bit microprocessors, such as the Cortex-A series processors, it is relatively inefficient to run large numbers of 8-bit or 16-bit operations. The processor ALU, registers, and datapath are designed for 32-bit calculations. If they are used for 8/16-bit operations, additional instructions are needed to handle overflow. SIMD enables a single instruction to treat a register value as multiple data elements and to perform multiple, identical operations on those elements.

Registers

NEON architecture allows for 64-bit or 128-bit parallelism. Its register bank can be viewed as either sixteen 128-bit registers (Q0-Q15) or as thirty-two 64-bit registers (D0-D31). Each of the Q0-Q15 registers maps to a pair of D registers.
NEON and VFP share the thirty-two 64-bit registers in hardware. This means that VFP is present in VFPv3-D32 form, which has 32 double-precision floating-point registers. This makes support for context switching simpler. Code that saves and restores VFP contexts also saves and restores NEON contexts.

Data Types

Data type specifiers in NEON instructions consist of a letter that indicates the type of data and a number that indicates the width. They are separated from the instruction mnemonic by a point. The following options are available:

Unsigned integer U8 U16 U32 U64
Signed integer S8 S16 S32 S64
Integer of unspecified type I8 I16 I32 I64
Floating-point number F16 F32
Polynomial over {0,1} P8

Methods

Using NEON Optimized Libraries

Given the widespread use of the ARM Cortex A9 processor, a large user community has developed, providing a rich ecosystem of NEON-optimized software libraries available to software algorithm designers.

Using Compiler Automatic Vectorization

Introduction

The easiest way to optimize for NEON is through use of compilers. GCC has several
optimization levels, along with a wide range of individual options to enable or disable particular
optimizations.
Compiler optimization levels are set using the command line option -On, as follows:

-O0. (default). No optimization is performed. Each line of source code is mapped directly to the corresponding instructions in the executable file. This provides the clearest view for source level debugging but the lowest level of performance.
-O1. Enables the most common forms of optimization that do not require decisions regarding size or speed. It can often produce a faster executable than -O0.
-O2. Enables further optimizations, such as instruction scheduling. Again, optimizations that have potential speed versus size implications are not being employed here.
-O3. Enables more aggressive optimizations, such as aggressive function inlining, and it typically increases speed at the expense of image size. Moreover, this option enables -ftree-vectorize, causing the compiler to attempt to automatically generate NEON code from standard C or C++. However, in practice, this optimization level cannot always produce binaries faster than -O2. Check the software performance case-by-case.
-Os. Selects optimizations that attempt to minimize the size of the image, even at the expense of speed. (This is not a point of focus in this document.)
-Ofast. Disregards strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard compliant programs. It turns on -ffast-math.

In addition to the optimization levels, you must set other compiler options to tell the compiler to generate NEON instructions:

-std=c99. The C99 standard introduces some new features that can be used for NEON optimization.
-mcpu=cortex-a9. Specifies the name of the target ARM processor. GCC uses this name to determine what kind of instructions it can issue when generating assembly code.
-mfpu=neon. Specifies which floating-point hardware (or hardware emulation) is available on the target. Because the Zynq-7000 device has an integrated NEON hardware unit, and because you plan to use it to accelerate software, you must specify your intention to the compiler clearly, using the name neon.
-ftree-vectorize. Performs loop vectorization on trees. By default, this is enabled at -O3.
-mvectorize-with-neon-quad. By default, GCC 4.4 vectorizes for double-word only. In most cases, using quad-word can better code performance and density, at the cost of smaller numbers of usable registers.
-mfloat-abi=name. Specifies which floating-point ABI is used. Permitted values are: soft,softfp, and hard.
- soft causes the GCC to generate output containing library calls for floating-point operations. This is used when there are no hardware floating-point units in the system.
- softfp allows the generation of instructions using a hardware floating-point unit, but still uses the soft-float calling conventions. This results in better compatibility.
- hard allows generation of floating-point instructions and uses FPU-specific calling conventions. If using the option hard, you must compile and link the entire source code with the same setting.
-ffast-math. This option is not turned on by any -O option except -Ofastbecause it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It might, however, yield faster code for programs that do not require the guarantees of these specifications.

Note: The compiler might not always vectorize C language code as expected, so you must ensure that compilers generate appropriate instructions:

Read the disassembly. This is the most straightforward method, but it requires a full understanding of NEON instructions.
Use the compiler option -ftree-vectorizer-verbose=n. This option controls the amount of debugging output the vectorizer prints. This information is written to standard error, unless -fdump-tree-all or -fdump-tree-vect are specified, in which case it is output to the usual dump listing file, .vect. For n=0, no diagnostic information is reported. For n=9, all the information the vectorizer generated during its analysis and transformation is reported. This is the same verbosity level that -fdump-tree-vect-details uses.

C Code Modifications

Because the C and C++ standards do not provide syntax that specifies parallel behavior, it is difficult for compilers to determine when it is safe to generate NEON code. Without enough proof, compilers do not vectorize the code, so you must modify code to provide additional hints to the compiler. Such source code modifications are within the standard language specifications, so they do not affect code portability across platforms and architectures.
The following are recommended techniques for modifying code for NEON:

Loops can be modified for better vectorizing

Short, simple loops work the best (even if it means multiple loops in your code)
Avoid breaks / loop-carried dependencies(a loop in which the result of one iteration is affected by the result of previous iterations) / conditions(sometimes it could be replaced by bitwise operations) inside loops
Try to make sure the number of iteration is known to the compiler, and the iteration count can be decided as a multiple of N (register length/data type size) at the coding stage
Functions called inside a lop should be inlined

Pointer issues

Using arrays with indexing vectorizesbetter than using pointer
Indirect addressing (multiple indexing or de-reference) doesn't vectorize
Use restrict key word (make for parallel optimization), which can inform the compiler that the location accessed through a specific pointer is not to be accessed through any other pointer within the current scope.

Use suitable data types

For best performance, always use the smallest data type that can hold the required values.
When optimizing algorithms operating on 16-bit or 8-bit data without SIMD, treating the data as 32-bit variables can sometimes yield better performance. This is because the compiler must generate additional instructions to ensure the result does not overflow by a half-word or byte.
However, when targeting automatic vectorization with NEON, using the smallest data type that can hold the required values is always the best choice. In a given time period, the NEON engine can process twice as many 8-bit values as 16-bit values. Also, some NEON instructions do not support some data types, and some only support certain operations. For example, NEON does not support double-precision floating-point data types, so using a double-precision where a single-precision float is adequate can prevent the compiler from vectorizing code. NEON supports 64-bit integers only for certain operations, so avoid use of long variables where possible.
NEON includes a group of instructions that can perform structured load and store operations. These instructions can only be used for vectorized access to data structures where all members are of the same size. Accessing 2/3/4-channel interleaved data with these instructions can also accelerate NEON memory access performance.

Deviation of NEON Computation Results

For integers, the order of computation does not matter. For example, summing an array of integers forward or backward always produces the same result. However, this is not true for floating-point numbers because of the coding precision. Thus, the NEON-optimized code might produce a result different from non-NEON optimized code for floating-point numbers. Typically, however, the difference is not significant. When you need to validate the code by comparing the computational result, be aware that the term "equal" for a data type of float or double does not mean exactly the same thing, but the difference is acceptable.

Using NEON Intrinsics^[1]

NEON C/C++ intrinsics are available for armcc, GCC/g++, and llvm. They use the same syntax.
Essentially, NEON intrinsics are a C function wrapper of NEON assembler instructions. There are new data type definitions that correspond to NEON registers (both D-registers and Q-registers) containing different sized elements, allowing C variables to be created that map directly onto NEON registers. These variables can be passed to NEON intrinsic functions directly. The compiler then generates NEON instructions instead of incurring an actual subroutine call.
NEON intrinsics provide low-level access to NEON instructions but with the compiler doing some of the hard work normally associated with writing assembly language, such as:

Register allocation.
Code scheduling or re-ordering instructions to achieve the highest performance. The C compilers can be told which processor is being targeted, and they can reorder code to ensure the CPU pipeline is running in an optimized way.

The main disadvantage with intrinsics is that you cannot force the compiler to output exactly the code you want. So in some cases, there is still a possibility of further improvement by using NEON assembler code.

NEON Types in C

The ARM C Language Extensions^[2] contains a full list of NEON types. The format is:

<basic type>x<number of elements>_t

To use NEON types and intrinsics, a header file, arm_neon.h, must be included.
Table 6 can give developers some basic ideas about NEON types.

Table 6: NEON Type Definitions

64-bit type (D-register)	128-bit type (Q-register)
int8x8_t	int8x16_t
int16x4_t	int16x8_t
int32x2_t	int32x4_t
int64x1_t	int64x2_t
uint8x8_t	uint8x16_t
uint16x4_t	uint16x8_t
uint32x2_t	uint32x4_t
uint64x1_t	uint64x2_t
float16x4_t	float16x8_t
float32x2_t	float32x4_t
poly8x8_t	poly8x16_t
poly16x4_t	poly16x8_t

There are also combination types, which include two, three, or four of each of the above in a larger struct type. These types are used to map the registers accessed by NEON load/store operations, which can load/store up to four registers in a single instruction. For example:

struct int16x4x2_t
{
int16x4_t val[2];
}<var_name>;

These types are only used by loads, stores, transpose, interleave, and de-interleave instructions. To perform operations on the actual data, select the individual registers using the syntax shown below:

<var_name>.val[0] and <var_name>.val[1]

Techniques Specific to NEON Intrinsics

Declaring a Variable

Example:

uint32x2_t vec64a, vec64b; // create two D-register variables

Using Constants

The following code replicates a constant into each element of a vector:

uint8x8 start_value = vdup_n_u8(0);

To load a general 64-bit constant into a vector:

uint8x8 start_value = vreinterpret_u8_u64(vcreate_u64(0x123456789ABCDEFULL));

Moving Results Back to Normal C Variables

To access a result from a NEON register, you can store it to memory using VST or move it back to ARM using a get lane type operation:

result = vget_lane_u32(vec64a, 0); // extract lane 0

Accessing Two D-registers of a Q-register

This can be done using vget_low and vget_high, as shown below:

vec64a = vget_low_u32(vec128); // split 128 bit vector
vec64b = vget_high_u32(vec128); // into 2x 64 bit vectors

Casting NEON Variables Between Different Types

NEON intrinsics are strongly typed, and you cannot perform type casts as freely as you can in C language. If there must be casts between vectors of different types, use vreinterpret, which does not actually generate any code but does enable you to cast the NEON types:

uint8x8_t byteval;
uint32x2_t wordval;
byteval = vreinterpret_u8_u32(wordval);

Note that the destination type u8 is listed first after vreinterpret.

To give you a broader perspective on how NEON intrinsics can be used the following is an example of calculating a dot product from two vectors, with moderate complexity:

float dot_product_intrinsic(float * __restrict vec1,
float * __restrict vec2, int n)
{
    float32x4_t vec1_q, vec2_q;
    float32x4_t sum_q = {0.0, 0.0, 0.0, 0.0};
    float32x2_t tmp[2];
    float result;
    for( int i=0; i<( n & ~3); i+=4 )
    {
        vec1_q=vld1q_f32(&vec1[i]);
        vec2_q=vld1q_f32(&vec2[i]);
        sum_q = vmlaq_f32(sum_q, vec1_q, vec2_q );
    }
    tmp[0] = vget_high_f32(sum_q);
    tmp[1] = vget_low_f32 (sum_q);
    tmp[0] = vpadd_f32(tmp[0], tmp[1]);
    tmp[0] = vpadd_f32(tmp[0], tmp[0]);
    result = vget_lane_f32(tmp[0], 0);
    return result;
}

Note: As stated above, to use NEON types and intrinsics, a header file, arm_neon.h, must be included.

Compiling NEON Intrinsics with GCC

Unlike the complex options for compiling C code with automatic vectorization, compiling NEON intrinsics is fairly simple, and only a few compiler options are needed:

-On. (default). Set the optimization levels.
-mcpu=cortex-a9. Set the processor type for Zynq-7000 AP SoC as cortex-a9
-mfpu=neon. Tell the compiler to generate NEON instructions for Zynq-7000 AP SoC.

Optimizing NEON Assembler Code

Sometimes NEON assembler code is the only way to achieve optimal performance. Carefully hand-written assembler code can yield the best results from NEON, especially for performance-critical applications.
The disadvantage is obvious. First, it is difficult to maintain assembler code. Even though all Cortex-A series processors support NEON instructions, the hardware implementations are different, so instruction timing and movement in the pipeline are different. This means that NEON optimization is processor-dependent. Code running faster on one Cortex-A series processor might not work as well on another Cortex-A series processor. Second, it is difficult to write assembler code. To be successful, you must know the details of the underlying hardware features, such as pipelining, scheduling issues, memory access behavior, and scheduling hazards. These factors are briefly described below.

Memory Access Optimizations

Typically, NEON is used to process large amounts of data. One crucial optimization is to ensure that the algorithm uses cache in the most efficient way possible. It is also important to consider the number of active memory locations. A typical optimization is one in which you design the algorithm to process small memory regions called tiles, one by one, to maximize the cache and translation lookaside buffer (TLB) hit rate, and to minimize memory access to external Dynamic RAM.

Alignment

Even though NEON architecture provides full unaligned support for NEON data access, instruction opcode contains an alignment hint which permits implementations to be faster when the address is aligned and a hint is specified.

Instruction Scheduling

To write faster code for NEON, you must be aware of how to schedule code for the specific ARM processor. For the Zynq-7000 AP SoC, this would be the Cortex-A9.
Result-use scheduling is the main performance optimization when writing NEON code. NEON instructions typically issue in one or two cycles, but the result is not always ready in the next cycle (except when the simplest NEON instructions are issued, such as VADD and VMOV). Some instructions have considerable latency, for example the VMLA multiply-accumulate instruction (five cycles for an integer; seven cycles for a floating-point). To prevent a stall, take into consideration the amount of time between the current instruction and the next one using its result. Despite having a few cycles result latency, these instructions are fully pipelined, so several operations can be in flight at once.
Another typical scheduling issue is interlock. Without adequate hardware knowledge, it is possible to load the data from memory to registers, then process them immediately. If the memory access receives a cache hit there is no problem. However, if a cache hit is missed, the CPU must wait tens of cycles to load data from external memory into the cache before proceeding. Thus, you usually need to place instructions that are not dependent upon the VLD instruction between the VLD and the instruction using its result. Using the Cortex-A9 preload engine can improve the cache hit rate. This is discussed later.
Also be aware that external memory is slow and has a long latency compared to on-chip memory. The CPU uses cache and write buffers to alleviate this issue. Sometimes, if there are long bursts of memory write, the write buffer fills up, and the next VST instruction stalls. Therefore, when writing assembly instructions, it is best to distribute memory access instructions with data processing instructions.

Improving Memory Access Efficiency

To reduce the gap between processor and memory subsystems, engineers introduced cache into modern SoCs. Preloading data into cache before actually using it can improve the cache hit rate, thus improving system performance. The ARM Cortex-A9 implements a preload engine and provides instructions to do this.
The following sections introduce some techniques for improving memory efficiency:

Loading and Storing Multiple Data in a Burst
Using the Preload Engine to Improve the Cache Hit Rate
Using Tiles to Prevent Cache Thrashing

Loading and Storing Multiple Data in a Burst

Loading and storing multiple instructions allows successive words to be read from or written to memory. These are extremely useful for stack push/pop and for memory copying. Generally, loading and storing multiple instructions can yield better performance than the equivalent multiple load-and-store instructions, especially when cache is not enabled or a memory region is marked as non-cacheable in the translation table.
Only word values can be operated in this way on a word aligned address. The operands are a base register (with an optional denoting write-back of the base register) with a list of registers between braces.
NEON supports loading and storing multiple instructions. For example:

VLDMmode{cond} Rn{!}, Registers
VSTMmode{cond} Rn{!}, Registers

The Mode should be one of the following:

IA, Increment address after each transfer. This is the default, and can be omitted.
DB, Decrement address before each transfer.
EA, Empty ascending stack operation. This is the same as DB for loads and IA for saves.
FD, Full descending stack operation. This is the same as IA for loads, and DB for saves.

Note that NEON has some special instructions for interleaving and de-interleaving:

VLDn (Vector load multiple n-element structures) loads multiple n-element structures from memory into one or more NEON registers, with de-interleaving (unless n == 1). Every element of each register is loaded.
VSTn (Vector store multiple n-element structures) writes multiple n-element structures to memory from one or more NEON registers, with interleaving (unless n == 1). Every element of each register is stored.

VLD2 loads two or four registers, de-interleaving even and odd elements. This could be used, for example, to split left and right channel stereo audio data. Similarly, VLD3 could be used to split RGB pixels or XYZ coordinates into separate channels. Correspondingly, VLD4 could be used with ARGB or CMYK images.

Note: These special NEON instructions cannot be expressed by pure C language. You must use NEON intrinsics or assembler code to have the compiler generate machine instructions.

Using the Preload Engine to Improve the Cache Hit Rate

Accesses to the external memory system are usually slow. If you can pre-fetch instructions or data into the cache before you need them, you can minimize CPU stall time and maximize CPU performance.
From a hardware perspective, all preload instructions are handled by a dedicated unit in the Cortex-A9 processor with dedicated resources. This avoids using resources in the integer core or the load store unit.
From a software perspective, cache preloading means three instructions, PLD (data cache preload), PLI (instruction cache preload) and PLDW (preload data with intent to write). The PLD instruction might generate a cache line-fill on a data cache miss, while the processor continues to execute other instructions. If used correctly, PLD can significantly improve performance by hiding memory access latencies. There is also a PLI instruction that enables you to give the processor hints that an instruction load from a particular address is likely to happen soon. This can cause the processor to preload the instructions to its instruction cache.
You can also try to optimize memcpy(), written by C with data preloading. The performance boost is around 25%. This is not as significant as the above because there is no computation to compensate the data preload latency.

Using Tiles to Prevent Cache Thrashing

For Zynq-7000 devices, each of the two Cortex-A9 processors has separate 32 KB level-1 instruction and data caches, and both caches are 4-way set-associative. The L2 cache is designed as an 8-way set-associative 512 KB cache for dual Cortex-A9 cores. These parameters are critical to predicting when cache thrashing will occur.
A typical scenario for cache thrashing is accessing elements in the same column sequentially within a two-dimensional array. The document Cortex-A Series Programmer's Guide^[3] provides an example of matrix multiplication. Below is a straightforward code example for matrix multiplication:

for(i=0;i<N;i++)
    for(j=0;j<N;j++)
        for(k=0;k<N;k++)
            result[i][j] += a[i][k]*b[k][j];

In this case, the contents of matrix a are accessed sequentially, but accessing matrix b advances by row. Therefore, cache misses are likely while accessing each element in matrix b because matrix sizes are so large that the cache cannot contain them.
To solve this issue, divide the large matrix into smaller partitions and confine the computations within these smaller partitions. The partitions are also known as tiles. Here you assume the data type for matrix is int. For Cortex-A9 processors, the L1 cache line is 32 bytes, so one L1 cache line can hold eight elements. In this case you can rewrite the code using 8*8 tiles to improve cache hit rates. Below is an example of optimized code:

for (io = 0; io < N; io += 8)
    for (jo = 0; jo < N; jo += 8)  
        for (ko = 0; ko < N; ko += 8)
            for (ii = 0, rresult = &result[io][jo], ra = &a[io][ko];
            ii < 8; ii++, rresult += N, ra += N)
                for (ki = 0, rb = &b[ko][jo];ki < 8; ki++, rb += N)
                    for (ji = 0; ji < 8; ji++)
                        rresult[ji] += ra[ki] * rb[ji];

Because NEON is often used to process large data sets, properly changing an algorithm with the tiling technique can produce higher cache hit rates, thus much better performance. You can also try using compiler automatic vectorization in your projects to achieve additional (more modest) improvements. As demonstrated in lab1, compilers are not good at automatic vectorization on complex loops. If more performance gain is needed, you must perform manual optimization on computations within tiles.

GCC documentation, available at gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html. ↩︎
RealView Compilation Tools Compiler Reference Guide, available at infocenter.arm.com. ↩︎
Cortex™-A Series Programmer's Guide silver.arm.com/download/download.tm?pv=1296010 ↩︎