MIPS GCC always generates outline memcpy when optimizing for size?

Anders Montonen <Anders.Montonen@xxxxxx> · Sun, 2 Feb 2014 14:19:52 +0200

Hi,

It seems that GCC configured for MIPS will always generate a call to memcpy when optimizing for size, is this expected behaviour? I have encountered this both with a self-built GCC 4.8.1 (configured for mipsel-sde-elf) and Microchip's XC32 compiler, which is based on GCC 4.5.2.

Here's what I get when building the following code with GCC 4.8.1 and -march=m4k:

#include <stdint.h>

uint32_t foo(const uint8_t *pA)
{
    uint32_t sum = 0;
    uint32_t tmp, ii;

    for (ii = 0; ii < 256; ii++)
    {
        __builtin_memcpy(&tmp, &pA[ii*sizeof(tmp)], sizeof(tmp));
        sum += tmp;
    }

    return sum;
}

With -O1, the following is generated:

00000000 <foo>:
   0:	27bdfff8 	addiu	sp,sp,-8
   4:	00002821 	move	a1,zero
   8:	00001021 	move	v0,zero
   c:	24070400 	li	a3,1024
  10:	00851821 	addu	v1,a0,a1
  14:	88660003 	lwl	a2,3(v1)
  18:	98660000 	lwr	a2,0(v1)
  1c:	afa60000 	sw	a2,0(sp)
  20:	24a50004 	addiu	a1,a1,4
  24:	14a7fffa 	bne	a1,a3,10 <foo+0x10>
  28:	00461021 	addu	v0,v0,a2
  2c:	03e00008 	jr	ra
  30:	27bd0008 	addiu	sp,sp,8

But with -Os, I get this:

00000000 <foo>:
   0:	27bdffd0 	addiu	sp,sp,-48
   4:	afb30028 	sw	s3,40(sp)
   8:	afb20024 	sw	s2,36(sp)
   c:	afb10020 	sw	s1,32(sp)
  10:	afb0001c 	sw	s0,28(sp)
  14:	afbf002c 	sw	ra,44(sp)
  18:	00809821 	move	s3,a0
  1c:	00008021 	move	s0,zero
  20:	00008821 	move	s1,zero
  24:	24120400 	li	s2,1024
  28:	02702821 	addu	a1,s3,s0
  2c:	27a40010 	addiu	a0,sp,16
  30:	0c000000 	jal	0 <foo>
  34:	24060004 	li	a2,4
  38:	8fa20010 	lw	v0,16(sp)
  3c:	26100004 	addiu	s0,s0,4
  40:	1612fff9 	bne	s0,s2,28 <foo+0x28>
  44:	02228821 	addu	s1,s1,v0
  48:	8fbf002c 	lw	ra,44(sp)
  4c:	02201021 	move	v0,s1
  50:	8fb30028 	lw	s3,40(sp)
  54:	8fb20024 	lw	s2,36(sp)
  58:	8fb10020 	lw	s1,32(sp)
  5c:	8fb0001c 	lw	s0,28(sp)
  60:	03e00008 	jr	ra
  64:	27bd0030 	addiu	sp,sp,48

-O2 produces identical code to -O1, modulo allocated registers and scheduling. As a sidenote, the store of tmp to the stack is unnecessary and could be optimized away.

Regards,
Anders Montonen
(I am not subscribed to the list, so please cc me)