From: Robin Murphy > Sent: 22 April 2020 12:02 .. > Sure - I have a nagging feeling that it could still do better WRT > pipelining the loads anyway, so I'm happy to come back and reconsider > the local codegen later. It certainly doesn't deserve to stand in the > way of cross-arch rework. How fast does that loop actually run? To my mind it seems to do a lot of operations on each 64bit value. I'd have thought that a loop based on: sum64 = *ptr; sum64_high = *ptr++ >> 32; and then fixing up the result would be faster. The x86-64 code is also bad! On intel cpu prior to haswell a simple: sum_64 += *ptr32++; is faster than the current code. (Although you can do a lot better even on ivy bridge.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)