*This post is a translation of “Agrandissement bilinéaire avec NEON” *

Between two versions of the cycle counter, I felt like a little fun with an exercise a little more playful.

So today, a small stop in the field of image processing … activity where normally, NEON is expected to excel.

#### Introduction

Bilinear interpolation of an image allows you to resize an image keeping a “correct” quality .

The algorithm is relatively simple and can easily (at least in part) exploit computing capabilities of NEON (without which I speak here!)

I will not go into details of the algorithm. This is not the purpose here that is rather how far we can optimize this type of algorithm using NEON.

I’ll just say:

- Bilinear interpolation is just a linear interpolation performed on the two axes X and Y.
- The algorithm is not quite the same if you want to enlarge or reduce an image.
- The algorithm can be achieved in one or two steps (once vertically and once horizontally).

The version we will discuss here is the version that allows you to enlarge an image.

#### C Source Code

I first developed the algorithm in C in order to compare!

I wanted to be fair!

I’ve tried to optimize the C algorithm!

- I did not use floating point numbers
- I resized the image in one pass (where NEON will probably need 2)
- I’ve used 32-bit memory access and then I cut the components R, G, B and A with shifts and masks rather than accessed to the memory for each component

void strech_c(unsigned int *bSrc, unsigned int *bDst, int wSrc, int hSrc, int wDst, int hDst) { unsigned int wStepFixed16b, hStepFixed16b, wCoef, hCoef, x, y; unsigned int pixel1, pixel2, pixel3, pixel4; unsigned int hc1, hc2, wc1, wc2, offsetX, offsetY; unsigned int r, g, b, a; wStepFixed16b = ((wSrc - 1) < < 16) / (wDst - 1); hStepFixed16b = ((hSrc - 1) << 16) / (hDst - 1); hCoef = 0; for (y = 0 ; y < hDst ; y++) { offsetY = (hCoef >> 16); hc2 = (hCoef >> 9) & 127; hc1 = 128 - hc2; wCoef = 0; for (x = 0 ; x < wDst ; x++) { offsetX = (wCoef >> 16); wc2 = (wCoef >> 9) & 127; wc1 = 128 - wc2; pixel1 = *(bSrc + offsetY * wSrc + offsetX); pixel2 = *(bSrc + (offsetY + 1) * wSrc + offsetX); pixel3 = *(bSrc + offsetY * wSrc + offsetX + 1); pixel4 = *(bSrc + (offsetY + 1) * wSrc + offsetX + 1); r = ((((pixel1 >> 24) & 255) * hc1 + ((pixel2 >> 24) & 255) * hc2) * wc1 + (((pixel3 >> 24) & 255) * hc1 + ((pixel4 >> 24) & 255) * hc2) * wc2) >> 14; g = ((((pixel1 >> 16) & 255) * hc1 + ((pixel2 >> 16) & 255) * hc2) * wc1 + (((pixel3 >> 16) & 255) * hc1 + ((pixel4 >> 16) & 255) * hc2) * wc2) >> 14; b = ((((pixel1 >> & 255) * hc1 + ((pixel2 >> & 255) * hc2) * wc1 + (((pixel3 >> & 255) * hc1 + ((pixel4 >> & 255) * hc2) * wc2) >> 14; a = ((((pixel1 >> 0) & 255) * hc1 + ((pixel2 >> 0) & 255) * hc2) * wc1 + (((pixel3 >> 0) & 255) * hc1 + ((pixel4 >> 0) & 255) * hc2) * wc2) >> 14; *bDst++ = (r < < 24) + (g << 16) + (b << + (a); wCoef += wStepFixed16b; } hCoef += hStepFixed16b; } }

Bilinear filtering finds the 4 pixels "sources" that are closest to the pixel that is to be traced.

It calculates coefficients (here named HC1, HC2, and wc1 wc2) depending of the proximity of these pixels.

The destination pixel color is finally a weighted sum of the 4 closest pixels.

Again this is the algorithm for enlarging an image. If you try to reduce an image with this algorithm, it should work but will not make a very satisfactory result.

Here is the result. (The "nearest" enlarge version was made with photoshop!)

Source image 128*100 | Nearest enlarge 320*248 | Bilinear filtering 320*248 |

The C version allows resizing of the image in 8.97 ms.

#### Interpolation with NEON

NEON allows to make calculations on several pixels (or components at a time). However, this possibility requires that we can load the datas we want to manipulate into NEON registers (especially in the different slots of those registers).

To put it simply, it is easy to process data stored consecutively in memory. By cons, It is much more complicated read separated datas to bring them into NEON registers.

All that to say that it will be quite easy to make the vertical interpolation. By Cons, horizontal interpolation may be quite complicated. It could even be slower version C is to say!

As it is easier (and faster) to interpolate in the vertical direction and that the bilinear interpolation can be decomposed into two linear interpolation, the algorithm that I will implement will make the 4 following:

- Image Interpolation in the vertical direction
- transposition (rotation!) of the resulting image in order to swap X and Y datas
- Image Interpolation in the vertical direction (again)
- reverse transposition of the resulting image to be restored in the original direction.

Let say it right now, “I have not tested yet”. I might have unpleasant surprises. The algorithm does require a temporary buffer for storing transposed image. Memory access are clearly a weakness of our low-power processors.

#### First version

Here is a first version of a linear interpolation algorithm operating only on the Y axis

strech: @ r0 = *src @ r1 = *dst @ r2 = width src @ r3 = height src @ r4 = width dst @ r5 = height dst -> r12 push {r4-r11, lr} ldr r4, [sp, #36] ldr r5, [sp, #40] mov r10, r0 mov r11, r1 mov r12, r5 mov r9, r2 mov r8, r3 lsl r0, r8, #16 @ height src * 65536 sub r0, r0, #0x10000 @ height src - 1 sub r1, r5, #1 @ height dst - 1 bl udiv mov r1, r0 @ r1 = hStepFixed16b (fixed 16bit) mov r0, #0 @ r0 = hCoef b .strech_loop .align 4 .strech_loop: lsr r2, r0, #16 @ r2 = offsetY = (hCoef >> 16); ubfx r3, r0, #9, #7 @ hc2 = (hCoef >> 9) & 127; mul r2, r2, r9 rsb r4, r3, #128 @ hc1 = 128 - hc2; vdup.u8 d31, r3 vdup.u8 d30, r4 add r3, r10, r2, lsl #2 @ ptr Pixel Start add r4, r3, r9, lsl #2 @ ptr Pixel Start + 1 line lsr r5, r9, #2 @ nb 4 pixels blocs in oneline .strech_copy_line: vld1.32 {q0}, [r3]! @ load 4 pixel 1 vld1.32 {q2}, [r4]! @ load 4 pixel 2 vmull.u8 q4, d0, d30 @ pixel1 * (1 - coef) vmlal.u8 q4, d4, d31 @ + pixel2 * coef vmull.u8 q5, d1, d30 @ pixel1 * (1 - coef) vmlal.u8 q5, d5, d31 @ + pixel2 * coef vqrshrn.u16 d16, q4, #7 @ reduce to 8 bit vqrshrn.u16 d17, q5, #7 @ reduce to 8 bit vst1.32 {q8}, [r11]! @ write result subs r5, r5, #1 bne .strech_copy_line add r0, r0, r1 @ hCoef += hStepFixed16b; subs r12, r12, #1 bne .strech_loop pop {r4-r11, pc}

Some comments about this implementation:

- We need a division (which is just after these notes)
- content between ARM code “strech_loop” and “strech_copy_line” will run at the same time as the NEON code. Its running time should be zero (or more exactly not significatif)
- For each block of 4 pixels, I must do two reads and 1 write
- No optimization has been done on this code (as it will greatly change)

Here is the division function (I took the ARM website) …

udiv: mov r3, r1 @ Put divisor in r3 cmp r3, r0, lsr #1 @ double it until .udiv_j1: movls r3, r3, lsl #1 @ 2 * r3 > r0 cmp r3, r0, lsr #1 bls .udiv_j1 @ the b means search backwards mov r2, #0 @ initialize quotient .udiv_j2: cmp r0, r3 @ can we subtract r3? subcs r0, r0, r3 @ if we can, do so adc r2, r2, r2 @ double r2 mov r3, r3, lsr #1 @ halve r3, cmp r3, r1 @ and loop until bhs .udiv_j2 @ less than divisor mov r1, r0 mov r0, r2 mov pc, lr

Here is the result

Source image 128*100 | Bilinear filtering (Y axis only) 128*248 |

This first unoptimized NEON version (but far from doing the same thing as the C version given above) scales the image on the **Y axis ** in 0.18 ms.

The NEON program therefore runs 49 times faster. But we have yet to see the cost that will have both transpositions and the second interpolation.

On the other hand we can hope to optimize this first NEON code. Let say that a complet algorithm running 10 to 15 times faster than C code would be a correct result and a goal I set for myself.

#### Transpose

The quickest way to transpose a 4×4 matrix with 32-bit data (RGBA pixels here) contained in the records q0, q1, q2 and q3 is this.

vtrn.32 q0, q1 vtrn.32 q2, q3 vswp d1, d4 vswp d3, d6

This code run in 6 cycles … It is perfectly acceptable.

We’re not going to make 4 separate passes on our image, it would take too more time (although it should be tested). We will try to perform the interpolation on Y axis and transposition in stride.

This will not be simple!

We will not have enough available registers ARM. We’ll have to see how to save these precious resources without resorting to long memory access and greedy.

#### Interpolation and transposition in the same pass

To reduce the number of needed ARM registers, we will use the post increment. The two read access are always performed on two consecutive lines. So we can use the same pointer for both readings. We must therefore 4 + 2 pointers offset pointers.

For writing it’s even simpler. One register is sufficient (in addition to the register containing the pointer of the destination buffer)!

As we have 15 registers (yes, we’re not going to use r15!)

We still have 8!

With 8 registers it will perform …

The registers will be used like this

@r0 : hCoef @r1 : hStepFixed16b @r2 : working Src Ptr1 @r3 : working Src Ptr2 @r4 : working Src Ptr3 @r5 : working Src Ptr4 @r6 : trash register // nb Bloc of 4 pixel (used in strech_copy_line loop) @r7 : trash register // working ptr Dest Image @r8 : hSrc // offset Next DstLine @r9 : wSrc // offset Next SrcLine @r10 : ptr Source Image @r11 : ptr Dest Image @r12 : offset Previous SrcLine + 16 @r13 : preload Ptr. @r14 : hDst (used for strech_loop loop)

As we will need r13 (used here to preload the cache), we can not use the stack to save registers. This is not very serious, we will use a data area.

To achieve acceptable performance we will have to impose some constraints.

- our memory buffers are aligned on addresses multiple of 8. (personally I use addresses aligned on multiple of 16!)
- The image sizes are multiples of 4 in both width and height. A picture may be contained in a memory area a little larger than necessary. (personally I use a multiple of 8. It is the size of a jpeg DCT)
- Our pixels are 32 bits, even if the alpha component is absent. Again, the extra bandwidth required might be compensated by the simplification of treatment

So the complete algorithm which performs interpolation and transposition in a single pass.

strech: @r0 : hCoef @r1 : hStepFixed16b @r2 : working Src Ptr1 @r3 : working Src Ptr2 @r4 : working Src Ptr3 @r5 : working Src Ptr4 @r6 : trash register // nb Bloc of 4 pixel (used in strech_copy_line loop) @r7 : trash register // working ptr Dest Image @r8 : hSrc // offset Next DstLine @r9 : wSrc // offset Next SrcLine @r10 : ptr Source Image @r11 : ptr Dest Image @r12 : offset Previous SrcLine + 16 @r13 : preload Ptr. @r14 : hDst (used for strech_loop loop) movw r12, #:lower16:.strech_save_buffer movt r12, #:upper16:.strech_save_buffer stmia r12, {r4-r11, r13, lr} @ saving all registers mov r10, r0 @ r10 : ptr Source Image mov r11, r1 @ r11 : ptr Dest Image ldr r12, [sp, #4] @ r12 : hDst mov r9, r2 @ r9 : wSrc mov r8, r3 @ r8 : hSrc sub r0, r8, #1 @ hSrc - 1 sub r1, r12, #1 @ hDst - 1 lsl r0, r0, #16 @ hSrc * 65536 bl udiv mov r1, r0 @ r1 : hStepFixed16b (fixed 16bit) mov r0, #0 @ r0 : hCoef mov r14, r12 @ r14 : hDst lsl r9, r9, #2 @ r9 : wSrc * 4 (nb byte wSrcLine). Will be used to jump to next Src line. lsl r8, r14, #2 @ r8 : hDst * 4 (nb byte wDstLine after transpose). Will be used to jump to next Dst line rsb r12, r9, #16 @ r12 : offset previous line + 16 bytes. will be used to go back to the previous and skip 16 bytes. .strech_loop: lsr r2, r0, #16 @ r2 = offsetY = (hCoef >> 16); ubfx r6, r0, #9, #7 @ hc2 = (hCoef >> 9) & 127; mla r2, r2, r9, r10 @ r2 : ptr Source 1 rsb r7, r6, #128 @ hc1 = 128 - hc2; add r0, r0, r1 vdup.u8 d31, r6 vdup.u8 d30, r7 lsr r3, r0, #16 @ r2 = offsetY = (hCoef >> 16); ubfx r6, r0, #9, #7 @ hc2 = (hCoef >> 9) & 127; mla r3, r3, r9, r10 @ r3 : ptr Source 2 rsb r7, r6, #128 @ hc1 = 128 - hc2; add r0, r0, r1 vdup.u8 d29, r6 vdup.u8 d28, r7 lsr r4, r0, #16 @ r2 = offsetY = (hCoef >> 16); ubfx r6, r0, #9, #7 @ hc2 = (hCoef >> 9) & 127; mla r4, r4, r9, r10 @ r4 : ptr Source 3 rsb r7, r6, #128 @ hc1 = 128 - hc2; add r0, r0, r1 vdup.u8 d27, r6 vdup.u8 d26, r7 lsr r5, r0, #16 @ r2 = offsetY = (hCoef >> 16); ubfx r6, r0, #9, #7 @ hc2 = (hCoef >> 9) & 127; mla r5, r5, r9, r10 @ r5 : ptr Source 3 rsb r7, r6, #128 @ hc1 = 128 - hc2; add r0, r0, r1 vdup.u8 d25, r6 vdup.u8 d24, r7 add r13, r2, r9, lsl #2 @ r13 : ptr on next 4 lines of pixel. Used for preload lsr r6, r9, #4 @ nb bloc of 4 pixels mov r7, r11 @ r7 : work Dest Ptr .strech_copy_line: vld1.32 {q2}, [r2:128], r9 @ load 4 pixel (line 1 src 1) + jump to next line vld1.32 {q3}, [r2:128], r12 @ load 4 pixel (line 1 src 2) + jump to previous line + 16 vld1.32 {q4}, [r3:128], r9 @ load 4 pixel (line 2 src 1) + jump to next line vld1.32 {q5}, [r3:128], r12 @ load 4 pixel (line 2 src 2) + jump to previous line + 16 vld1.32 {q6}, [r4:128], r9 @ load 4 pixel (line 3 src 1) + jump to next line vld1.32 {q7}, [r4:128], r12 @ load 4 pixel (line 3 src 2) + jump to previous line + 16 vld1.32 {q8}, [r5:128], r9 @ load 4 pixel (line 4 src 1) + jump to next line vld1.32 {q9}, [r5:128], r12 @ load 4 pixel (line 4 src 2) + jump to previous line + 16 pld [r13] add r13, r13, #128 vmull.u8 q0, d4, d30 @ pixel1 * (1 - coef) vmlal.u8 q0, d6, d31 @ + pixel2 * coef vmull.u8 q1, d5, d30 @ pixel1 * (1 - coef) vmlal.u8 q1, d7, d31 @ + pixel2 * coef vmull.u8 q2, d8, d28 @ pixel1 * (1 - coef) vmlal.u8 q2, d10, d29 @ + pixel2 * coef vmull.u8 q3, d9, d28 @ pixel1 * (1 - coef) vmlal.u8 q3, d11, d29 @ + pixel2 * coef vmull.u8 q4, d12, d26 @ pixel1 * (1 - coef) vmlal.u8 q4, d14, d27 @ + pixel2 * coef vmull.u8 q5, d13, d26 @ pixel1 * (1 - coef) vmlal.u8 q5, d15, d27 @ + pixel2 * coef vmull.u8 q6, d16, d24 @ pixel1 * (1 - coef) vmlal.u8 q6, d18, d25 @ + pixel2 * coef vmull.u8 q7, d17, d24 @ pixel1 * (1 - coef) vmlal.u8 q7, d19, d25 @ + pixel2 * coef vqrshrn.u16 d16, q0, #7 @ reduce to 8 bit vqrshrn.u16 d17, q4, #7 @ reduce to 8 bit vqrshrn.u16 d18, q2, #7 @ reduce to 8 bit vqrshrn.u16 d19, q6, #7 @ reduce to 8 bit vqrshrn.u16 d20, q1, #7 @ reduce to 8 bit vqrshrn.u16 d21, q5, #7 @ reduce to 8 bit vqrshrn.u16 d22, q3, #7 @ reduce to 8 bit vqrshrn.u16 d23, q7, #7 @ reduce to 8 bit vtrn.32 q8, q9 @ tranpose vtrn.32 q10, q11 vst1.32 {q8}, [r7:128], r8 @ write result vst1.32 {q9}, [r7:128], r8 @ write result vst1.32 {q10}, [r7:128], r8 @ write result vst1.32 {q11}, [r7:128], r8 @ write result subs r6, r6, #1 bne .strech_copy_line add r11, r11, #16 @ r11 : translate for next lane subs r14, r14, #4 bne .strech_loop movw r12, #:lower16:.strech_save_buffer movt r12, #:upper16:.strech_save_buffer ldmia r12, {r4-r11, r13, pc} @ reload all registers and return

Source image 128*100 | Enlarge & Transpose 248*128 | NEON Bilinear Filtering 320*248 |

When you copy the code into the cycle counter, you can believe you will get interesting performance. In fact, it’s not really the case!

To resize the image, you must call two times the previous code…

The sizing finally runs in 0.83 ms – so much slower than the hope I had founded!

The reason is probably due to “cache miss”! Reading multiple distinct data lines simultaneously has been fatal to the cortex.

Moreover, our basic picture is very small. More it will be more big, more the performance will drop significantly.

#### Conclusion

For the given sample, the assembly version is 10.8 times faster than the C version This is not bad! But we were entitled to expect better!

Well I do not confess myself conquered! It should be possible to optimize a little more this code. I will continue my investigations (or even test other approaches. Why not try the simple pass algorithm).

*Update of 27/05/2011: Well! There is not much to do. By reorganizing the instructions I succed to reduce executing time to 0.74ms (12.1 times faster than the C version).
The single passe version has been extremely complex to do and it goes much slower than the version proposed here.
I also try to treat two 4*4 blocks of pixels at each iteration, but the result was slightly lower !!!*

I’ll publish another post on the bilinear interpolation used in the context of image reduction. Reduction requires more computation and less access random memory. So all that make NEON much more efficient. I do not know if the image reduction is faster than the image enlargment, but I’m pretty sure that the performance gap between C and assembly will be much greater.

*Feel free to send me a better translation of this post at pulsar[at]webshaker.net
You can also send me a translation to another language if you want.*

Hi there,

I just found your webpage while searching for people who code NEON stuff. Your bilinear zooming algorithm is really nice. I did something like this without NEON many many years ago on ARM. I also did some experiments with NEON on the BeagleBoard regarding Mandelbrot fractals to utilize the full SIMD capability for that iteration algorithm. That small Application is there: http://www.mikusite.de/riscos/FracNEONVFP.zip

I just got to say that i don’t use GCC at the moment as I’m running my BeagleBoard on RiscOS operating system. Quite rare but fun. So you might not be able to compile it directly as I use a special assembler, but the source code should be readable anyway. So just e-mail me if you are interested in exchanging some experience…it’s rare to find assembler coders on ARM nowdays

Yours,

Michael

Hi.

I’ll have a look to your code for sure

I have had a Archimedes 20 years ago ! So I know the RiscOS. When I bought my beagleboard, I’de like to install the RiscOS too, but I needed a RiscOS to format the SDCard And I do not have one…

I’ll contact you because… Yes I’m still looking for assembly coder.

See You soon Michael !

Hello, do you have any example of how to call the assembler algorithm from android ndk using C?

Thanks

Hello, I’ve searched your blog but I cannot find the algorithm to reduce an image. Can you help me?

Alex