20

A simple NEON optimization method

juin

This post is a translation of “Méthode simple d’optimisation de code NEON

Since I discovered NEON, I had the opportunity to encode several small programs to test the beast.
NEON really is one of the highlights of Cortex. Its use is really simple, the big amount of registers and its instructions with 3 registers make its usage easier than on other processors.
Highly pipelined, the order of instructions is quite significant.
Whichever program you will achieve with NEON, you rarely falls on the optimal version of its implementation the first time. Finally, once you have pipelined at best, there is a last small optimization that allows almost every time to gain a few more cycles …

Anatomy of a NEON code

a “correct” NEON program still looks like this:

	@ r0 number of iterations
.loop:
	@ Loading data

	@ Calculations

	@ Recording computed data

	subs		r0, r0, #1
	bne		.loop

First we load registers with NEON data sources.
Then we performs the calculations.
Finally, we store the results.

Due to the NEON pipeline, some instructions will have to wait. This is usually the case of recording the results.

It is often (always) a good idea to rearrange your code.
A smart reorganization is to insert the loading instructions between the calculation and the recording.
This gives this kind of algorithm.

	@ r0 number of iterations
	sub		r0, r0, #1					@ n - 1 iteration
	@ Loading data
.loop:
	@ Calculations

	subs		r0, r0, #1
	beq		.exit_loop

	@ Loading data of the next iteration

	@ Recording computed data
	b			.loop
.exit_loop:

	@ Recording last computed data

The idea is to space the end of the calculation of data recording by inserting between these instructions the reading of the following data.
As a bonus we also separated the data read from the calculation, which will also save some significant cycles.

It is important to note that this reorganization is only possible if the registers “source” (used when loading) are differents from the registers “destination” (used when recording).

Example

When I started programming NEON, I discovered the site hilbert-space.de and especially the conversion code of RGB to grayscale. I’ve husked, analyzed, modified and optimized it as much as I could for hours …
The program is extremely simple it is always a good subject of study …

Sample assembly code NEON

	@ build the three constants:
	mov         r3, #77
	mov         r4, #151
	mov         r5, #28
	vdup.8      d3, r3
	vdup.8      d4, r4
	vdup.8      d5, r5

.loop:
	@ load 8 pixels:
	vld3.8      {d0-d2}, [r1]!

	@ do the weight average:
	vmull.u8    q3, d0, d3
	vmlal.u8    q3, d1, d4
	vmlal.u8    q3, d2, d5
	vshrn.u16   d8, q3, #8

	@ save result
	vst1.8      {d8}, [r0]!

	subs        r2, r2, #1
	bne         .loop

I’ve just changed the destination register which is now d8.
This code takes 16 cycles per iteration. see the result in the cycle counter.

Rem: to know the average number of cycles taken by an iteration of a loop, you just have to read execution cycle associated to the branch instruction of the loop.

This code have 2 instruction that produce stall cycles.

  • The VSHRN instruction that we will not try to optimize here
  • The VST1 instruction that we seek to improve performance

So, I apply the rule presented above.
We will separate the end of the calculation and the data recording by inserting between them instructions the reading of the following data.
The result is:

Optimized assembly code NEON

	@ build the three constants:
	mov         r3, #77
	mov         r4, #151
	mov         r5, #28
	vdup.8      d3, r3
	vdup.8      d4, r4
	vdup.8      d5, r5
	sub			r2, r2, "1"

	@ load 8 pixels:
	vld3.8      {d0-d2}, [r1]!

.loop:
	@ do the weight average:
	vmull.u8    q3, d0, d3
	vmlal.u8    q3, d1, d4
	vmlal.u8    q3, d2, d5
	vshrn.u16   d8, q3, #8

	subs        r2, r2, #1
	beq         .exit_loop

	@ load 8 pixels:
	vld3.8      {d0-d2}, [r1]!

	@ save result
	vst1.8      {d8}, [r0]!
	b				.loop
.exit_loop:
	vst1.8      {d8}, [r0]!

Now the code takes only 13 cycles. see the result in the cycle counter.

Rem: It is tempting to move up the reading before VSHRN since it is the instruction that generated more cycles of latency. This would actually be a good idea. The program runs then in 12 cycles.
But this optimization is beyond the scope of this post who just wants to say that a large majority of source code NEON (maybe all) can be optimized by inserting the reading of the following iteration before recording of data calculated by the current iteration.

What about the data loading ?

If we consider the outcome of the cycle counter, one can believe that we do not gain anything in terms of cycle while reading data (since there is no wait cycle in the first instruction calculation) …
In fact, it’s not quite the case … There is indeed a real interest in this optimization.

The information returned by the cycle counter assumes that all memory accesses are cached (and the connections are always correctly predicted).
But, in case the read data are not in the cache, there is a real interest to read the data as soon as possible (before using them). It should be understood that due to the queue NEON, NEON instructions are generally executed long time after decoding the ARM (this is why the execution cycles of NEON instructions in the cycle counter usually displays the higher numbers than the execution cycles of ARM instructions).

Note: The following is only speculation. Even if this appears to be correct!

The Cortex also has a list of memory access.
Basically, the cortex begins to read the data, if possible, not when the VLD3 instruction will be executed by NEON but rather when the instruction goes through the pipeline of the ARM.
This means that sooner the ARM knows what you want to read, the better!

The list of NEON memory access has a length limited to 8 instructions. It could therefore, imagine reading more than one iteration in advance. But then the program will quickly become complicated.

Conclusion

As far as I can see, this simple method works with all NEON program and always offers in all cases a gain of a few cycles per iteration.
Obviously, this is not enough to achieve the best performance of your code.
Yet, in the end, once you have all pipelined what could be, this ultimate trick could save you a little more time.

Feel free to send me a better translation of this post at pulsar[at]webshaker.net
You can also send me a translation to another language if you want.

 | Tags: ,

Répondre

Human control : 8 + 2 =