hi, Here is the code I have: struct uint64x64_t { struct {uint64_t hi; uint64_t lo;} _v; }; inline uint64x64_t &operator += (uint64x64_t &lhs, const uint64x64_t &rhs) { lhs._v.hi += rhs._v.hi; lhs._v.lo += rhs._v.lo; if (lhs._v.hi < rhs._v.lo) { lhs._v.hi++; } return lhs; } One of the benchmarks I run is a loop to call += above for a couple billion times with -O3: - 64x64 class above: 5.5ns per addition - 64bit addition: 1.3ns per addition - 32bit addition 1.3ns per addition A quick look at the assembly for the inner loop of my benchmark shows this for the += code: #define SIZE 128 uint64x64_t result = a; for (long int i = 0; i < n; i++) { result += array[i%SIZE]; } 804b6c0: 89 f3 mov %esi,%ebx 804b6c2: 83 e3 7f and $0x7f,%ebx 804b6c5: c1 e3 04 shl $0x4,%ebx 804b6c8: 8b 44 0b 08 mov 0x8(%ebx,%ecx,1),%eax 804b6cc: 8b 54 0b 0c mov 0xc(%ebx,%ecx,1),%edx 804b6d0: 01 47 08 add %eax,0x8(%edi) 804b6d3: 8b 04 0b mov (%ebx,%ecx,1),%eax 804b6d6: 11 57 0c adc %edx,0xc(%edi) 804b6d9: 8b 54 0b 04 mov 0x4(%ebx,%ecx,1),%edx 804b6dd: 01 07 add %eax,(%edi) 804b6df: 11 57 04 adc %edx,0x4(%edi) 804b6e2: 8b 57 0c mov 0xc(%edi),%edx 804b6e5: 3b 54 0b 04 cmp 0x4(%ebx,%ecx,1),%edx 804b6e9: 8b 47 08 mov 0x8(%edi),%eax 804b6ec: 77 13 ja 804b701 <ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t, ns3::uint64x64_t, long)+0xe1> 804b6ee: 72 05 jb 804b6f5 <ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t, ns3::uint64x64_t, long)+0xd5> 804b6f0: 3b 04 0b cmp (%ebx,%ecx,1),%eax 804b6f3: 73 0c jae 804b701 <ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t, ns3::uint64x64_t, long)+0xe1> 804b6f5: 83 c0 01 add $0x1,%eax 804b6f8: 83 d2 00 adc $0x0,%edx 804b6fb: 89 47 08 mov %eax,0x8(%edi) 804b6fe: 89 57 0c mov %edx,0xc(%edi) 804b701: 83 c6 01 add $0x1,%esi 804b704: 39 75 14 cmp %esi,0x14(%ebp) 804b707: 7f b7 jg 804b6c0 <ns3::uint64x64_t run_add<ns3::uint64x64_t>(ns3::uint64x64_t, ns3::uint64x64_t, long)+0xa0> That looks actually pretty good except for the fact that instead of generating add+adc+adc+adc, we generate 3 distinct pairs of add+adc (which is not suprising given the C++ code I wrote). So, the question is: is there a way to improve upon the code generated to get the 'perfect' add+adc*3 sequence (short of adding support to gcc for __uint128_t on ia32 or writing inline assembly) ? Mathieu -- Mathieu Lacage <mathieu.lacage@xxxxxxxxx>