25

Bilinear enlarge with NEON

mai

This post is a translation of “Agrandissement bilinéaire avec NEON

Between two versions of the cycle counter, I felt like a little fun with an exercise a little more playful.
So today, a small stop in the field of image processing … activity where normally, NEON is expected to excel.

Introduction

Bilinear interpolation of an image allows you to resize an image keeping a “correct” quality .
The algorithm is relatively simple and can easily (at least in part) exploit computing capabilities of NEON (without which I speak here!)
I will not go into details of the algorithm. This is not the purpose here that is rather how far we can optimize this type of algorithm using NEON.

I’ll just say:

  • Bilinear interpolation is just a linear interpolation performed on the two axes X and Y.
  • The algorithm is not quite the same if you want to enlarge or reduce an image.
  • The algorithm can be achieved in one or two steps (once vertically and once horizontally).

The version we will discuss here is the version that allows you to enlarge an image.

C Source Code

I first developed the algorithm in C in order to compare!

I wanted to be fair!
I’ve tried to optimize the C algorithm!

  • I did not use floating point numbers
  • I resized the image in one pass (where NEON will probably need 2)
  • I’ve used 32-bit memory access and then I cut the components R, G, B and A with shifts and masks rather than accessed to the memory for each component
void strech_c(unsigned int *bSrc, unsigned int *bDst, int wSrc, int hSrc, int wDst, int hDst)
{
	unsigned int wStepFixed16b, hStepFixed16b, wCoef, hCoef, x, y;
	unsigned int pixel1, pixel2, pixel3, pixel4;
	unsigned int hc1, hc2, wc1, wc2, offsetX, offsetY;
	unsigned int r, g, b, a;

	wStepFixed16b = ((wSrc - 1) < < 16) / (wDst - 1);
	hStepFixed16b = ((hSrc - 1) << 16) / (hDst - 1);

	hCoef = 0;

	for (y = 0 ; y < hDst ; y++)
	{
		offsetY = (hCoef >> 16);
		hc2 = (hCoef >> 9) & 127;
		hc1 = 128 - hc2;

		wCoef = 0;
		for (x = 0 ; x < wDst ; x++)
		{
			offsetX = (wCoef >> 16);
			wc2 = (wCoef >> 9) & 127;
			wc1 = 128 - wc2;

			pixel1 = *(bSrc + offsetY * wSrc + offsetX);
			pixel2 = *(bSrc + (offsetY + 1) * wSrc + offsetX);
			pixel3 = *(bSrc + offsetY * wSrc + offsetX + 1);
			pixel4 = *(bSrc + (offsetY + 1) * wSrc + offsetX + 1);

			r = ((((pixel1 >> 24) & 255) * hc1 + ((pixel2 >> 24) & 255) * hc2) * wc1 +
				(((pixel3 >> 24) & 255) * hc1 + ((pixel4 >> 24) & 255) * hc2) * wc2) >> 14;
			g = ((((pixel1 >> 16) & 255) * hc1 + ((pixel2 >> 16) & 255) * hc2) * wc1 +
				(((pixel3 >> 16) & 255) * hc1 + ((pixel4 >> 16) & 255) * hc2) * wc2) >> 14;
			b = ((((pixel1 >> 8) & 255) * hc1 + ((pixel2 >> 8) & 255) * hc2) * wc1 +
				(((pixel3 >> 8) & 255) * hc1 + ((pixel4 >> 8) & 255) * hc2) * wc2) >> 14;
			a = ((((pixel1 >> 0) & 255) * hc1 + ((pixel2 >> 0) & 255) * hc2) * wc1 +
				(((pixel3 >> 0) & 255) * hc1 + ((pixel4 >> 0) & 255) * hc2) * wc2) >> 14;

			*bDst++ = (r < < 24) + (g << 16) + (b << 8) + (a);

			wCoef += wStepFixed16b;
		}
		hCoef += hStepFixed16b;
	}
}

Bilinear filtering finds the 4 pixels "sources" that are closest to the pixel that is to be traced.
It calculates coefficients (here named HC1, HC2, and wc1 wc2) depending of the proximity of these pixels.
The destination pixel color is finally a weighted sum of the 4 closest pixels.

Again this is the algorithm for enlarging an image. If you try to reduce an image with this algorithm, it should work but will not make a very satisfactory result.

Here is the result. (The "nearest" enlarge version was made ​​with photoshop!)

Source image 128*100 Nearest enlarge 320*248 Bilinear filtering 320*248

The C version allows resizing of the image in 8.97 ms.

Interpolation with NEON

NEON allows to make calculations on several pixels (or components at a time). However, this possibility requires that we can load the datas we want to manipulate into NEON registers (especially in the different slots of those registers).
To put it simply, it is easy to process data stored consecutively in memory. By cons, It is much more complicated read separated datas to bring them into NEON registers.

All that to say that it will be quite easy to make the vertical interpolation. By Cons, horizontal interpolation may be quite complicated. It could even be slower version C is to say!

As it is easier (and faster) to interpolate in the vertical direction and that the bilinear interpolation can be decomposed into two linear interpolation, the algorithm that I will implement will make the 4 following:

  • Image Interpolation in the vertical direction
  • transposition (rotation!) of the resulting image in order to swap X and Y datas
  • Image Interpolation in the vertical direction (again)
  • reverse transposition of the resulting image to be restored in the original direction.

Let say it right now, “I have not tested yet”. I might have unpleasant surprises. The algorithm does require a temporary buffer for storing transposed image. Memory access are clearly a weakness of our low-power processors.

First version

Here is a first version of a linear interpolation algorithm operating only on the Y axis

strech:
@ r0 = *src
@ r1 = *dst
@ r2 = width src
@ r3 = height src
@ r4 = width dst
@ r5 = height dst -> r12

	push			{r4-r11, lr}
	ldr			r4, [sp, #36]
	ldr			r5, [sp, #40]

	mov			r10, r0
	mov			r11, r1
	mov			r12, r5
	mov			r9, r2
	mov			r8, r3

	lsl			r0, r8, #16							@ height src * 65536
	sub			r0, r0, #0x10000					@ height src - 1
	sub			r1, r5, #1							@ height dst - 1

	bl				udiv
	mov			r1, r0								@ r1 = hStepFixed16b (fixed 16bit)
	mov			r0, #0								@ r0 = hCoef 

	b				.strech_loop
	.align		4
.strech_loop:
	lsr			r2, r0, #16							@ r2 = offsetY = (hCoef >> 16);
	ubfx			r3, r0, #9, #7						@ hc2 = (hCoef >> 9) & 127;
	mul			r2, r2, r9
	rsb			r4, r3, #128						@ hc1 = 128 - hc2; 

	vdup.u8		d31, r3
	vdup.u8		d30, r4

	add			r3, r10, r2, lsl #2				@ ptr Pixel Start
	add			r4, r3, r9, lsl #2				@ ptr Pixel Start + 1 line
	lsr			r5, r9, #2							@ nb 4 pixels blocs in oneline
.strech_copy_line:
	vld1.32		{q0}, [r3]!							@ load 4 pixel 1
	vld1.32		{q2}, [r4]!							@ load 4 pixel 2
	vmull.u8		q4, d0, d30							@ pixel1 * (1 - coef)
	vmlal.u8		q4, d4, d31							@ + pixel2 * coef
	vmull.u8		q5, d1, d30							@ pixel1 * (1 - coef)
	vmlal.u8		q5, d5, d31							@ + pixel2 * coef

	vqrshrn.u16 d16, q4, #7							@ reduce to 8 bit
	vqrshrn.u16 d17, q5, #7							@ reduce to 8 bit

	vst1.32		{q8}, [r11]!						@ write result
	subs			r5, r5, #1
	bne			.strech_copy_line 

	add			r0, r0, r1							@ hCoef += hStepFixed16b;
	subs			r12, r12, #1
	bne			.strech_loop 

	pop			{r4-r11, pc}

Some comments about this implementation:

  • We need a division (which is just after these notes)
  • content between ARM code “strech_loop” and “strech_copy_line” will run at the same time as the NEON code. Its running time should be zero (or more exactly not significatif)
  • For each block of 4 pixels, I must do two reads and 1 write
  • No optimization has been done on this code (as it will greatly change)

Here is the division function (I took the ARM website) …

udiv:
	mov			r3, r1								@ Put divisor in r3
	cmp			r3, r0, lsr #1						@ double it until
.udiv_j1:
	movls			r3, r3, lsl #1						@ 2 * r3 > r0
	cmp			r3, r0, lsr #1
	bls			.udiv_j1								@ the b means search backwards
	mov			r2, #0								@ initialize quotient
.udiv_j2:
	cmp			r0, r3								@ can we subtract r3?
	subcs			r0, r0, r3							@ if we can, do so
	adc			r2, r2, r2							@ double r2
	mov			r3, r3, lsr #1						@ halve r3,
	cmp			r3, r1								@ and loop until
	bhs			.udiv_j2								@ less than divisor
	mov			r1, r0
	mov			r0, r2
	mov			pc, lr

Here is the result

Source image 128*100 Bilinear filtering (Y axis only) 128*248

This first unoptimized NEON version (but far from doing the same thing as the C version given above) scales the image on the Y axis in 0.18 ms.
The NEON program therefore runs 49 times faster. But we have yet to see the cost that will have both transpositions and the second interpolation.
On the other hand we can hope to optimize this first NEON code. Let say that a complet algorithm running 10 to 15 times faster than C code would be a correct result and a goal I set for myself.

Transpose

The quickest way to transpose a 4×4 matrix with 32-bit data (RGBA pixels here) contained in the records q0, q1, q2 and q3 is this.

vtrn.32		q0, q1
vtrn.32		q2, q3
vswp			d1, d4
vswp			d3, d6

This code run in 6 cycles … It is perfectly acceptable.

We’re not going to make 4 separate passes on our image, it would take too more time (although it should be tested). We will try to perform the interpolation on Y axis and transposition in stride.
This will not be simple!
We will not have enough available registers ARM. We’ll have to see how to save these precious resources without resorting to long memory access and greedy.

Interpolation and transposition in the same pass

To reduce the number of needed ARM registers, we will use the post increment. The two read access are always performed on two consecutive lines. So we can use the same pointer for both readings. We must therefore 4 + 2 pointers offset pointers.
For writing it’s even simpler. One register is sufficient (in addition to the register containing the pointer of the destination buffer)!
As we have 15 registers (yes, we’re not going to use r15!)
We still have 8!
With 8 registers it will perform …

The registers will be used like this

@r0 : hCoef
@r1 : hStepFixed16b
@r2 : working Src Ptr1
@r3 : working Src Ptr2
@r4 : working Src Ptr3
@r5 : working Src Ptr4
@r6 : trash register // nb Bloc of 4 pixel (used in strech_copy_line loop)
@r7 : trash register // working ptr Dest Image
@r8 : hSrc // offset Next DstLine
@r9 : wSrc // offset Next SrcLine
@r10 : ptr Source Image
@r11 : ptr Dest Image
@r12 : offset Previous SrcLine  + 16
@r13 : preload Ptr.
@r14 : hDst (used for strech_loop loop)

As we will need r13 (used here to preload the cache), we can not use the stack to save registers. This is not very serious, we will use a data area.

To achieve acceptable performance we will have to impose some constraints.

  • our memory buffers are aligned on addresses multiple of 8. (personally I use addresses aligned on multiple of 16!)
  • The image sizes are multiples of 4 in both width and height. A picture may be contained in a memory area a little larger than necessary. (personally I use a multiple of 8. It is the size of a jpeg DCT)
  • Our pixels are 32 bits, even if the alpha component is absent. Again, the extra bandwidth required might be compensated by the simplification of treatment

So the complete algorithm which performs interpolation and transposition in a single pass.

strech:
@r0 : hCoef
@r1 : hStepFixed16b
@r2 : working Src Ptr1
@r3 : working Src Ptr2
@r4 : working Src Ptr3
@r5 : working Src Ptr4
@r6 : trash register // nb Bloc of 4 pixel (used in strech_copy_line loop)
@r7 : trash register // working ptr Dest Image
@r8 : hSrc // offset Next DstLine
@r9 : wSrc // offset Next SrcLine
@r10 : ptr Source Image
@r11 : ptr Dest Image
@r12 : offset Previous SrcLine  + 16
@r13 : preload Ptr.
@r14 : hDst (used for strech_loop loop)

	movw        r12, #:lower16:.strech_save_buffer
	movt        r12, #:upper16:.strech_save_buffer
	stmia			r12, {r4-r11, r13, lr}			@ saving all registers

	mov         r10, r0								@ r10 : ptr Source Image
	mov         r11, r1								@ r11 : ptr Dest Image
	ldr         r12, [sp, #4]						@ r12 : hDst
	mov			r9, r2								@ r9 : wSrc
	mov			r8, r3								@ r8 : hSrc

	sub			r0, r8, #1							@ hSrc - 1
	sub         r1, r12, #1							@ hDst - 1
	lsl         r0, r0, #16							@ hSrc * 65536

	bl          udiv
	mov         r1, r0                        @ r1 : hStepFixed16b (fixed 16bit)
	mov         r0, #0                        @ r0 : hCoef 

	mov         r14, r12								@ r14 : hDst
	lsl			r9, r9, #2							@ r9 : wSrc * 4 (nb byte wSrcLine). Will be used to jump to next Src line.
	lsl			r8, r14, #2							@ r8 : hDst * 4 (nb byte wDstLine after transpose). Will be used to jump to next Dst line
	rsb			r12, r9, #16						@ r12 : offset previous line + 16 bytes. will be used to go back to the previous and skip 16 bytes.

.strech_loop:
	lsr			r2, r0, #16							@ r2 = offsetY = (hCoef >> 16);
	ubfx			r6, r0, #9, #7						@ hc2 = (hCoef >> 9) & 127;
	mla			r2, r2, r9, r10					@ r2 : ptr Source 1
	rsb			r7, r6, #128						@ hc1 = 128 - hc2;
	add			r0, r0, r1
	vdup.u8		d31, r6
	vdup.u8		d30, r7

	lsr			r3, r0, #16							@ r2 = offsetY = (hCoef >> 16);
	ubfx			r6, r0, #9, #7						@ hc2 = (hCoef >> 9) & 127;
	mla			r3, r3, r9, r10					@ r3 : ptr Source 2
	rsb			r7, r6, #128						@ hc1 = 128 - hc2;
	add			r0, r0, r1
	vdup.u8		d29, r6
	vdup.u8		d28, r7

	lsr			r4, r0, #16							@ r2 = offsetY = (hCoef >> 16);
	ubfx			r6, r0, #9, #7						@ hc2 = (hCoef >> 9) & 127;
	mla			r4, r4, r9, r10					@ r4 : ptr Source 3
	rsb			r7, r6, #128						@ hc1 = 128 - hc2;
	add			r0, r0, r1
	vdup.u8		d27, r6
	vdup.u8		d26, r7

	lsr			r5, r0, #16							@ r2 = offsetY = (hCoef >> 16);
	ubfx			r6, r0, #9, #7						@ hc2 = (hCoef >> 9) & 127;
	mla			r5, r5, r9, r10					@ r5 : ptr Source 3
	rsb			r7, r6, #128						@ hc1 = 128 - hc2;
	add			r0, r0, r1
	vdup.u8		d25, r6
	vdup.u8		d24, r7

	add			r13, r2, r9, lsl #2				@ r13 : ptr on next 4 lines of pixel. Used for preload
	lsr			r6, r9, #4							@ nb bloc of 4 pixels
	mov			r7, r11								@ r7 : work Dest Ptr

.strech_copy_line:
	vld1.32		{q2}, [r2:128], r9				@ load 4 pixel (line 1 src 1) + jump to next line
	vld1.32		{q3}, [r2:128], r12				@ load 4 pixel (line 1 src 2) + jump to previous line + 16
	vld1.32		{q4}, [r3:128], r9				@ load 4 pixel (line 2 src 1) + jump to next line
	vld1.32		{q5}, [r3:128], r12				@ load 4 pixel (line 2 src 2) + jump to previous line + 16
	vld1.32		{q6}, [r4:128], r9				@ load 4 pixel (line 3 src 1) + jump to next line
	vld1.32		{q7}, [r4:128], r12				@ load 4 pixel (line 3 src 2) + jump to previous line + 16
	vld1.32		{q8}, [r5:128], r9				@ load 4 pixel (line 4 src 1) + jump to next line
	vld1.32		{q9}, [r5:128], r12				@ load 4 pixel (line 4 src 2) + jump to previous line + 16

	pld			[r13]
	add			r13, r13, #128

	vmull.u8		q0, d4, d30							@ pixel1 * (1 - coef)
	vmlal.u8		q0, d6, d31							@ + pixel2 * coef
	vmull.u8		q1, d5, d30							@ pixel1 * (1 - coef)
	vmlal.u8		q1, d7, d31							@ + pixel2 * coef

	vmull.u8		q2, d8, d28							@ pixel1 * (1 - coef)
	vmlal.u8		q2, d10, d29						@ + pixel2 * coef
	vmull.u8		q3, d9, d28							@ pixel1 * (1 - coef)
	vmlal.u8		q3, d11, d29						@ + pixel2 * coef

	vmull.u8		q4, d12, d26						@ pixel1 * (1 - coef)
	vmlal.u8		q4, d14, d27						@ + pixel2 * coef
	vmull.u8		q5, d13, d26						@ pixel1 * (1 - coef)
	vmlal.u8		q5, d15, d27						@ + pixel2 * coef

	vmull.u8		q6, d16, d24						@ pixel1 * (1 - coef)
	vmlal.u8		q6, d18, d25						@ + pixel2 * coef
	vmull.u8		q7, d17, d24						@ pixel1 * (1 - coef)
	vmlal.u8		q7, d19, d25						@ + pixel2 * coef

	vqrshrn.u16 d16, q0, #7							@ reduce to 8 bit
	vqrshrn.u16 d17, q4, #7							@ reduce to 8 bit
	vqrshrn.u16 d18, q2, #7							@ reduce to 8 bit
	vqrshrn.u16 d19, q6, #7							@ reduce to 8 bit
	vqrshrn.u16 d20, q1, #7							@ reduce to 8 bit
	vqrshrn.u16 d21, q5, #7							@ reduce to 8 bit
	vqrshrn.u16 d22, q3, #7							@ reduce to 8 bit
	vqrshrn.u16 d23, q7, #7							@ reduce to 8 bit

	vtrn.32		q8, q9								@ tranpose
	vtrn.32		q10, q11

	vst1.32		{q8}, [r7:128], r8				@ write result
	vst1.32		{q9}, [r7:128], r8				@ write result
	vst1.32		{q10}, [r7:128], r8				@ write result
	vst1.32		{q11}, [r7:128], r8				@ write result

	subs			r6, r6, #1
	bne			.strech_copy_line

	add			r11, r11, #16						@ r11 : translate for next lane
	subs			r14, r14, #4
	bne			.strech_loop

	movw        r12, #:lower16:.strech_save_buffer
	movt        r12, #:upper16:.strech_save_buffer
	ldmia			r12, {r4-r11, r13, pc}			@ reload all registers and return
Source image 128*100 Enlarge & Transpose 248*128 NEON Bilinear Filtering 320*248

When you copy the code into the cycle counter, you can believe you will get interesting performance. In fact, it’s not really the case!

To resize the image, you must call two times the previous code…
The sizing finally runs in 0.83 ms – so much slower than the hope I had founded!
The reason is probably due to “cache miss”! Reading multiple distinct data lines simultaneously has been fatal to the cortex.
Moreover, our basic picture is very small. More it will be more big, more the performance will drop significantly.

Conclusion

For the given sample, the assembly version is 10.8 times faster than the C version This is not bad! But we were entitled to expect better!
Well I do not confess myself conquered! It should be possible to optimize a little more this code. I will continue my investigations (or even test other approaches. Why not try the simple pass algorithm).

Update of 27/05/2011: Well! There is not much to do. By reorganizing the instructions I succed to reduce executing time to 0.74ms (12.1 times faster than the C version).
The single passe version has been extremely complex to do and it goes much slower than the version proposed here.
I also try to treat two 4*4 blocks of pixels at each iteration, but the result was slightly lower !!!

I’ll publish another post on the bilinear interpolation used in the context of image reduction. Reduction requires more computation and less access random memory. So all that make NEON much more efficient. I do not know if the image reduction is faster than the image enlargment, but I’m pretty sure that the performance gap between C and assembly will be much greater.

Feel free to send me a better translation of this post at pulsar[at]webshaker.net
You can also send me a translation to another language if you want.

 | Tags: ,

4 Responses to “Bilinear enlarge with NEON”

  1. Hi there,

    I just found your webpage while searching for people who code NEON stuff. Your bilinear zooming algorithm is really nice. I did something like this without NEON many many years ago on ARM. I also did some experiments with NEON on the BeagleBoard regarding Mandelbrot fractals to utilize the full SIMD capability for that iteration algorithm. That small Application is there: http://www.mikusite.de/riscos/FracNEONVFP.zip

    I just got to say that i don’t use GCC at the moment as I’m running my BeagleBoard on RiscOS operating system. Quite rare but fun. So you might not be able to compile it directly as I use a special assembler, but the source code should be readable anyway. So just e-mail me if you are interested in exchanging some experience…it’s rare to find assembler coders on ARM nowdays ;-)

    Yours,

    Michael

  2. Etienne SOBOLE dit :

    Hi.

    I’ll have a look to your code for sure ;)
    I have had a Archimedes 20 years ago ! So I know the RiscOS. When I bought my beagleboard, I’de like to install the RiscOS too, but I needed a RiscOS to format the SDCard And I do not have one…
    I’ll contact you because… Yes I’m still looking for assembly coder.

    See You soon Michael !

  3. Fernando dit :

    Hello, do you have any example of how to call the assembler algorithm from android ndk using C?

    Thanks

  4. Alessandro dit :

    Hello, I’ve searched your blog but I cannot find the algorithm to reduce an image. Can you help me?

    Alex

Répondre

Human control : 9 + 2 =