Re: [PATCH 06/13] reftable: (de)serialization for the polymorphic record type.

Jeff King <peff@xxxxxxxx> · Thu, 24 Sep 2020 03:31:25 -0400

On Thu, Sep 24, 2020 at 03:21:51AM -0400, Jeff King wrote:

> > I originally had
> > 
> > +void put_be64(uint8_t *out, uint64_t v)
> > +{
> > +       int i = sizeof(uint64_t);
> > +        while (i--) {
> > +               out[i] = (uint8_t)(v & 0xff);
> > +               v >>= 8;
> > +       }
> > +}
> > 
> > in my reftable library, which is portable. Is there a reason for the
> > magic with htonll and friends?
> 
> Presumably it was thought to be faster. This comes originally from the
> block-sha1 code in 660231aa97 (block-sha1: support for architectures
> with memory alignment restrictions, 2009-08-12). I don't know how it
> compares in practice, and especially these days.
> 
> Our fallback routines are similar to an unrolled version of what you
> wrote above.

We should be able to measure it pretty easily, since block-sha1 uses a
lot of get_be32/put_be32. I generated a 4GB random file, built with
BLK_SHA1=Yes and -O2, and timed:

  t/helper/test-tool sha1 <foo.rand

Then I did the same, but building with -DNO_UNALIGNED_LOADS. The latter
actually ran faster, by a small margin. Here are the hyperfine results:

  [stock]
  Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
  Range (min … max):    6.550 s …  6.841 s    10 runs

  [-DNO_UNALIGNED_LOADS]
  Time (mean ± σ):      6.418 s ±  0.015 s    [User: 6.058 s, System: 0.360 s]
  Range (min … max):    6.394 s …  6.447 s    10 runs

For casual use as in reftables I doubt the difference is even
measurable. But this result implies that perhaps we ought to just be
using the fallback version all the time.

-Peff