Re: [PATCH 1/2] bswap.h: drop unaligned loads

René Scharfe <l.s.r@xxxxxx> · Fri, 25 Sep 2020 00:02:38 +0200

Am 24.09.20 um 21:21 schrieb Jeff King:
> Our put_be32() routine and its variants (get_be32(), put_be64(), etc)
> has two implementations: on some platforms we cast memory in place and
> use nothl()/htonl(), which can cause unaligned memory access. And on
> others, we pick out the individual bytes using bitshifts.
>
> This introduces extra complexity, and sometimes causes compilers to
> generate warnings about type-punning. And it's not clear there's any
> performance advantage.
>
> This split goes back to 660231aa97 (block-sha1: support for
> architectures with memory alignment restrictions, 2009-08-12). The
> unaligned versions were part of the original block-sha1 code in
> d7c208a92e (Add new optimized C 'block-sha1' routines, 2009-08-05),
> which says it is:
>
>    Based on the mozilla SHA1 routine, but doing the input data accesses a
>    word at a time and with 'htonl()' instead of loading bytes and shifting.
>
> Back then, Linus provided timings versus the mozilla code which showed a
> 27% improvement:
>
>   https://lore.kernel.org/git/alpine.LFD.2.01.0908051545000.3390@localhost.localdomain/
>
> However, the unaligned loads were either not the useful part of that
> speedup, or perhaps compilers and processors have changed since then.
> Here are times for computing the sha1 of 4GB of random data, with and
> without -DNO_UNALIGNED_LOADS (and BLK_SHA1=1, of course). This is with
> gcc 10, -O2, and the processor is a Core i9-9880H.
>
>   [stock]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
>     Range (min … max):    6.550 s …  6.841 s    10 runs
>
>   [-DNO_UNALIGNED_LOADS]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):      6.418 s ±  0.015 s    [User: 6.058 s, System: 0.360 s]
>     Range (min … max):    6.394 s …  6.447 s    10 runs
>
> And here's the same test run on an AMD A8-7600, using gcc 8.
>
>   [stock]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):     11.721 s ±  0.113 s    [User: 10.761 s, System: 0.951 s]
>     Range (min … max):   11.509 s … 11.861 s    10 runs
>
>   [-DNO_UNALIGNED_LOADS]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):     11.744 s ±  0.066 s    [User: 10.807 s, System: 0.928 s]
>     Range (min … max):   11.637 s … 11.863 s    10 runs

Yay, benchmarks!  GCC 10.2 with -O2 on an i5-9600K without NO_UNALIGNED_LOADS:

  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.547 s ±  0.015 s    [User: 6.127 s, System: 0.395 s]
    Range (min … max):    6.531 s …  6.583 s    10 runs

... and with NO_UNALIGNED_LOADS set:

  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.496 s ±  0.011 s    [User: 6.135 s, System: 0.360 s]
    Range (min … max):    6.486 s …  6.519 s    10 runs

clang 10 without NO_UNALIGNED_LOADS:

  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.697 s ±  0.028 s    [User: 6.343 s, System: 0.354 s]
    Range (min … max):    6.675 s …  6.754 s    10 runs

... and with NO_UNALIGNED_LOADS set:

  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.714 s ±  0.049 s    [User: 6.320 s, System: 0.375 s]
    Range (min … max):    6.651 s …  6.791 s    10 runs

> +cc René because I know he is going to feed the two of them into
>     godbolt; I could do that, too, but he will provide much better analysis
>     on top ;)

Weeell, I don't know about that, but I couldn't resist taking a quick
look at what some compilers do with the 32-bit functions, which are the
ones used in block-sha1: https://www.godbolt.org/z/rhKMTM.

Older versions of gcc and clang didn't see through the shifting
put_be32() implementation.  If you go back further there are also
versions that didn't optimize the shifting get_be32().  And the latest
icc still can't do that.

gcc 10.2 just optimizes all functions to a bswap and a mov.  Can't do
any better than that, can you?

But why do we then see a difference in our benchmark results?  Not sure,
but https://www.godbolt.org/z/7xh8ao shows that gcc is shuffling some
instructions around depending on the implementation.  Switch to clang if
you want to see more vigorous shuffling.

The performance of bigger pieces of code seems to be a matter of luck
to some extent. :-/

René