Fwd: optimizing 128bit integer arithmetic on ia32

Mathieu Lacage <mathieu.lacage@xxxxxxxxx> · Tue, 24 Aug 2010 08:34:39 +0200

I can't get used to the lack of reply-to field for gcc-help.

On Tue, Aug 24, 2010 at 00:33, Segher Boessenkool
<segher@xxxxxxxxxxxxxxxxxxx> wrote:
>> inline uint64x64_t &operator += (uint64x64_t &lhs, const uint64x64_t &rhs)
>> {
>>  lhs._v.hi += rhs._v.hi;
>>  lhs._v.lo += rhs._v.lo;
>>  if (lhs._v.hi < rhs._v.lo)
>
> if (lhs._v.lo  < rhs._v.lo)

gah :/

>
>>    {
>>      lhs._v.hi++;
>>    }
>>  return lhs;
>> }
>
> Does the generated code look any better with that correction?  If not, you
> want to tell us the exact command line and GCC version you used.

[mlacage@diese simulator]$ gcc --version
gcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7)

Yes, it looks much better, I get:

 804b850:       8b 55 e4                mov    -0x1c(%ebp),%edx
 804b853:       8b 7d 08                mov    0x8(%ebp),%edi
 804b856:       8b 44 0a 08             mov    0x8(%edx,%ecx,1),%eax
 804b85a:       8b 54 0a 0c             mov    0xc(%edx,%ecx,1),%edx
 804b85e:       01 47 08                add    %eax,0x8(%edi)
 804b861:       8b 45 e4                mov    -0x1c(%ebp),%eax
 804b864:       11 57 0c                adc    %edx,0xc(%edi)
 804b867:       03 c 08                 add    (%eax,%ecx,1),%ebx
 804b86a:       13 74 08 04             adc    0x4(%eax,%ecx,1),%esi
 804b86e:       89 f                    mov    %ebx,(%edi)
 804b870:       89 77 04                mov    %esi,0x4(%edi)
 804b873:       3b 74 08 04             cmp    0x4(%eax,%ecx,1),%esi
 804b877:       77 12                   ja     804b88b
<ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t,
ns3::uint64x64_t, long)+0xfb>

But the above is not as good as the following quick hack that is most
likely not correct but that should be close to the minimal code:
 asm ("mov 0(%0),%%eax\n\t"
      "add 0(%1),%%eax\n\t"
      "mov %%eax,0(%0)\n\t"
      "mov 4(%0),%%eax\n\t"
      "adc 4(%1),%%eax\n\t"
      "mov %%eax,4(%0)\n\t"
      "mov 8(%0),%%eax\n\t"
      "adc 8(%1),%%eax\n\t"
      "mov %%eax,8(%0)\n\t"
      "mov 12(%0),%%eax\n\t"
      "adc 12(%1),%%eax\n\t"
      "mov %%eax,12(%0)\n\t"
       :
      : "r" (&lhs._v), "r" (&rhs._v)
      : "%eax", "cc");

I get around 3.5ns for the handcoded assembly while I get 5.1ns for
the compiler-generated one.

Mathieu
--
Mathieu Lacage <mathieu.lacage@xxxxxxxxx>

-- 
Mathieu Lacage <mathieu.lacage@xxxxxxxxx>