__builtin_popcount

Zeev Tarantov <zeev.tarantov@xxxxxxxxx> · Sat, 21 May 2011 12:29:51 +0300

Function "int __builtin_popcount (unsigned int x)" documented at:
http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fpopcount-3160

on x86-64 with -O2, both using version 4.5.2 and a recent 4.6 branch,
I get this code emitted:

0000000000402230 <__popcountdi2>:
  402230:       48 8b 15 a9 0d 20 00    mov    0x200da9(%rip),%rdx
   # 602fe0 <_DYNAMIC+0x1a8>
  402237:       31 c0                   xor    %eax,%eax
  402239:       31 c9                   xor    %ecx,%ecx
  40223b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  402240:       48 89 fe                mov    %rdi,%rsi
  402243:       48 d3 ee                shr    %cl,%rsi
  402246:       83 c1 08                add    $0x8,%ecx
  402249:       81 e6 ff 00 00 00       and    $0xff,%esi
  40224f:       0f b6 34 32             movzbl (%rdx,%rsi,1),%esi
  402253:       01 f0                   add    %esi,%eax
  402255:       83 f9 40                cmp    $0x40,%ecx
  402258:       75 e6                   jne    402240 <__popcountdi2+0x10>
  40225a:       f3 c3                   repz retq

This computes the population count using an 8-bit look up table by
iterating over the 8 bytes of the input and summing the looked-up
values.
This is the right code for "int popcount(unsigned long x)", not for
"int popcount (unsigned int x)".
It performs twice the amount of work needed.
The only way a 64bit function would be ok to use when a 32bit is
required is if it would short-circuit when input reaches zero, like
this:

        .cfi_startproc
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
.L2:
        movzbl  %dil, %edx
        shrq    $8, %rdi
        movzbl  bitsums(%rdx), %edx
        addl    %edx, %eax
        testq   %rdi, %rdi
        jne     .L2
        rep
        ret
        .cfi_endproc

My question is whether this behavior is intentional, and whether gcc
could expose a built-in 64 bit popcount function instead of a 32-bit
one.

-Z.T.