Really unoptimized ARM (and Thumb) code generated by GCC 4.1.1

"David Lamy-Charrier" <david.lamy@xxxxxxxxx> · Thu, 14 Dec 2006 19:12:59 +0100

Hi,

We were using GCC 3.4.0 to generate Thumb code for ARM processor,
switching to GCC 4.1.1 has improved our code size (we always use -Os switch),
but has severely altered the execution speed.

After further investigation, we isolate one the problem in the
following example:

Source code:
void foo(int *a)
{	int i;
	for (i = 0; i < 1000000; i++)
   a[0] += a[1];
}
The result with GCC 3.4.0 with -mthumb -Os was:
00000000 <foo>:
  0:	b500      	push	{lr}
  2:	6803      	ldr	r3, [r0, #0]
  4:	4a03      	ldr	r2, [pc, #12]	(14 <.text+0x14>)
  6:	6841      	ldr	r1, [r0, #4]
  8:	3a01      	sub	r2, #1
  a:	185b      	add	r3, r3, r1
  c:	2a00      	cmp	r2, #0
  e:	d1fb      	bne	8 <foo+0x8>
 10:	6003      	str	r3, [r0, #0]
 12:	bd00      	pop	{pc}
 14:	4240      	neg	r0, r0
 16:	000f      	lsl	r7, r1, #0

when compiled for ARM with GCC 4.1.1 (and mainline too) with -mthumb
-O1, we get:
00000000 <foo>:
  0:	b510      	push	{r4, lr}
  2:	1c04      	adds	r4, r0, #0
  4:	2200      	movs	r2, #0
  6:	6841      	ldr	r1, [r0, #4]
  8:	4803      	ldr	r0, [pc, #12]	(18 <.text+0x18>)
  a:	6823      	ldr	r3, [r4, #0]
  c:	185b      	adds	r3, r3, r1
  e:	3201      	adds	r2, #1
 10:	4282      	cmp	r2, r0
 12:	d1fb      	bne.n	c <foo+0xc>
 14:	6023      	str	r3, [r4, #0]
 16:	bd10      	pop	{r4, pc}
 18:	4240      	negs	r0, r0
 1a:	000f      	lsls	r7, r1, #0

-> No so bad but slower than 3.4.0

when compiled with -mthumb -Os, we get:
00000000 <foo>:
  0:	b510      	push	{r4, lr}
  2:	6802      	ldr	r2, [r0, #0]
  4:	6844      	ldr	r4, [r0, #4]
  6:	2100      	movs	r1, #0
  8:	4b03      	ldr	r3, [pc, #12]	(18 <.text+0x18>)
  a:	3101      	adds	r1, #1
  c:	1912      	adds	r2, r2, r4
  e:	4299      	cmp	r1, r3
 10:	d1fa      	bne.n	8 <foo+0x8>
 12:	6002      	str	r2, [r0, #0]
 14:	bd10      	pop	{r4, pc}
 16:	0000      	lsls	r0, r0, #0
 18:	4240      	negs	r0, r0
 1a:	000f      	lsls	r7, r1, #0

 -> The Load of the loop end value is performed within the loop !

when compiled with -mthumb -O3, we get:
00000000 <foo>:
  0:	b530      	push	{r4, r5, lr}
  2:	6802      	ldr	r2, [r0, #0]
  4:	4d05      	ldr	r5, [pc, #20]	(1c <.text+0x1c>)
  6:	1d04      	adds	r4, r0, #4
  8:	2100      	movs	r1, #0
  a:	6823      	ldr	r3, [r4, #0]
  c:	3101      	adds	r1, #1
  e:	18d3      	adds	r3, r2, r3
 10:	1c1a      	adds	r2, r3, #0
 12:	6003      	str	r3, [r0, #0]
 14:	42a9      	cmp	r1, r5
 16:	d1f8      	bne.n	a <foo+0xa>
 18:	bd30      	pop	{r4, r5, pc}
 1a:	0000      	lsls	r0, r0, #0
 1c:	4240      	negs	r0, r0
 1e:	000f      	lsls	r7, r1, #0

 -> Amazingly slow !

 Does anybody has a magic set of options to generate an efficient and
small code as 3.4.0 did.
 Thanks in advance for any hints on this problem.

 David