Surprisingly bad code generated near char*

Avi Kivity <avi@xxxxxxxxxxxx> · Thu, 18 Aug 2016 11:45:55 +0300

I wanted to test how restrict helps in code generation.  I started with 
this example:

  struct s { int a; int b; };

  inline
  void encode(int a, char* p) {
    for (unsigned i = 0; i < sizeof(a); ++i) {
      p[i] = reinterpret_cast<const char*>(&a)[i];
    }
  }

  void f(s* x, char* p) {
    encode(x->a, p + 0);
    encode(x->b, p + 4);
  }

simulating serialization code.  My expectations were that without 
__restrict, I'd have four instructions:

   mov (%rdi), %eax

   mov %eax, %(rsi)

   mov 4(%rdi), %eax

   mov %eax, 4(%rsi)

while x and p can alias, a and p cannot, because a is a local variable.  
I further hoped that adding __restrict would remove two instructions:

   mov (%rdi), %rax

   mov %rax, (%rsi)

since the compiler now knows that x and p do not alias.

However, the generated code is much poorer than this (-O2):

   0:    8b 07                    mov    (%rdi),%eax
   2:    89 c1                    mov    %eax,%ecx
   4:    88 06                    mov    %al,(%rsi)
   6:    66 c1 e9 08              shr    $0x8,%cx
   a:    88 4e 01                 mov    %cl,0x1(%rsi)
   d:    89 c1                    mov    %eax,%ecx
   f:    c1 e8 18                 shr    $0x18,%eax
  12:    c1 e9 10                 shr    $0x10,%ecx
  15:    88 46 03                 mov    %al,0x3(%rsi)
  18:    88 4e 02                 mov    %cl,0x2(%rsi)
  1b:    8b 47 04                 mov    0x4(%rdi),%eax
  1e:    89 c7                    mov    %eax,%edi
  20:    89 c1                    mov    %eax,%ecx
  22:    88 46 04                 mov    %al,0x4(%rsi)
  25:    66 c1 ef 08              shr    $0x8,%di
  29:    c1 e9 10                 shr    $0x10,%ecx
  2c:    c1 e8 18                 shr    $0x18,%eax
  2f:    40 88 7e 05              mov    %dil,0x5(%rsi)
  33:    88 4e 06                 mov    %cl,0x6(%rsi)
  36:    88 46 07                 mov    %al,0x7(%rsi)

gcc doesn't even recognize the idiom of writing a word's four bytes 
sequentially.  With -O3, there is some improvement:

   0:    8b 07                    mov    (%rdi),%eax
   2:    89 06                    mov    %eax,(%rsi)
   4:    8b 47 04                 mov    0x4(%rdi),%eax
   7:    89 c1                    mov    %eax,%ecx
   9:    88 46 04                 mov    %al,0x4(%rsi)
   c:    66 c1 e9 08              shr    $0x8,%cx
  10:    88 4e 05                 mov    %cl,0x5(%rsi)
  13:    89 c1                    mov    %eax,%ecx
  15:    c1 e8 18                 shr    $0x18,%eax
  18:    c1 e9 10                 shr    $0x10,%ecx
  1b:    88 46 07                 mov    %al,0x7(%rsi)
  1e:    88 4e 06                 mov    %cl,0x6(%rsi)

the copy of the first word is optimized, but the second one is not, even 
though they're exactly the same.

Adding __restrict did not help.

Is this a problem in gcc, or are my expectations incorrect? I'm 
particularly worried that gcc recognized the copy idiom, but did not 
apply it to the second word, and required -O3 to optimize it.

Using std::copy_n() helped, but __restrict did not:

   0:    8b 07                    mov    (%rdi),%eax
   2:    89 06                    mov    %eax,(%rsi)
   4:    8b 47 04                 mov    0x4(%rdi),%eax
   7:    89 46 04                 mov    %eax,0x4(%rsi)

so the optimization opportunity is still missed.

gcc (GCC) 5.3.1 20160406 (Red Hat 5.3.1-6)