optimizing 128bit integer arithmetic on ia32

Mathieu Lacage <mathieu.lacage@xxxxxxxxx> · Mon, 23 Aug 2010 18:37:52 +0200

hi,

Here is the code I have:

struct uint64x64_t
{
  struct {uint64_t hi; uint64_t lo;} _v;
};

inline uint64x64_t &operator += (uint64x64_t &lhs, const uint64x64_t &rhs)
{
  lhs._v.hi += rhs._v.hi;
  lhs._v.lo += rhs._v.lo;
  if (lhs._v.hi < rhs._v.lo)
    {
      lhs._v.hi++;
    }
  return lhs;
}

One of the benchmarks I run is a loop to call += above for a couple
billion times with -O3:
  - 64x64 class above: 5.5ns per addition
  - 64bit addition: 1.3ns per addition
  - 32bit addition 1.3ns per addition

A quick look at the assembly for the inner loop of my benchmark shows
this for the += code:

#define SIZE 128
  uint64x64_t result = a;
  for (long int i = 0; i < n; i++)
    {
      result += array[i%SIZE];
    }

 804b6c0:       89 f3                   mov    %esi,%ebx
 804b6c2:       83 e3 7f                and    $0x7f,%ebx
 804b6c5:       c1 e3 04                shl    $0x4,%ebx
 804b6c8:       8b 44 0b 08             mov    0x8(%ebx,%ecx,1),%eax
 804b6cc:       8b 54 0b 0c             mov    0xc(%ebx,%ecx,1),%edx
 804b6d0:       01 47 08                add    %eax,0x8(%edi)
 804b6d3:       8b 04 0b                mov    (%ebx,%ecx,1),%eax
 804b6d6:       11 57 0c                adc    %edx,0xc(%edi)
 804b6d9:       8b 54 0b 04             mov    0x4(%ebx,%ecx,1),%edx
 804b6dd:       01 07                   add    %eax,(%edi)
 804b6df:       11 57 04                adc    %edx,0x4(%edi)
 804b6e2:       8b 57 0c                mov    0xc(%edi),%edx
 804b6e5:       3b 54 0b 04             cmp    0x4(%ebx,%ecx,1),%edx
 804b6e9:       8b 47 08                mov    0x8(%edi),%eax
 804b6ec:       77 13                   ja     804b701
<ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t,
ns3::uint64x64_t, long)+0xe1>
 804b6ee:       72 05                   jb     804b6f5
<ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t,
ns3::uint64x64_t, long)+0xd5>
 804b6f0:       3b 04 0b                cmp    (%ebx,%ecx,1),%eax
 804b6f3:       73 0c                   jae    804b701
<ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t,
ns3::uint64x64_t, long)+0xe1>
 804b6f5:       83 c0 01                add    $0x1,%eax
 804b6f8:       83 d2 00                adc    $0x0,%edx
 804b6fb:       89 47 08                mov    %eax,0x8(%edi)
 804b6fe:       89 57 0c                mov    %edx,0xc(%edi)
 804b701:       83 c6 01                add    $0x1,%esi
 804b704:       39 75 14                cmp    %esi,0x14(%ebp)
 804b707:       7f b7                   jg     804b6c0
<ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t,
ns3::uint64x64_t, long)+0xa0>

That looks actually pretty good except for the fact that instead of
generating add+adc+adc+adc, we generate 3 distinct pairs of add+adc
(which is not suprising given the C++ code I wrote). So, the question
is: is there a way to improve upon the code generated to get the
'perfect' add+adc*3 sequence (short of adding support to gcc for
__uint128_t on ia32 or writing inline assembly) ?

Mathieu
-- 
Mathieu Lacage <mathieu.lacage@xxxxxxxxx>