On Thu, Sep 24, 2020 at 03:21:51AM -0400, Jeff King wrote: > > I originally had > > > > +void put_be64(uint8_t *out, uint64_t v) > > +{ > > + int i = sizeof(uint64_t); > > + while (i--) { > > + out[i] = (uint8_t)(v & 0xff); > > + v >>= 8; > > + } > > +} > > > > in my reftable library, which is portable. Is there a reason for the > > magic with htonll and friends? > > Presumably it was thought to be faster. This comes originally from the > block-sha1 code in 660231aa97 (block-sha1: support for architectures > with memory alignment restrictions, 2009-08-12). I don't know how it > compares in practice, and especially these days. > > Our fallback routines are similar to an unrolled version of what you > wrote above. We should be able to measure it pretty easily, since block-sha1 uses a lot of get_be32/put_be32. I generated a 4GB random file, built with BLK_SHA1=Yes and -O2, and timed: t/helper/test-tool sha1 <foo.rand Then I did the same, but building with -DNO_UNALIGNED_LOADS. The latter actually ran faster, by a small margin. Here are the hyperfine results: [stock] Time (mean ± σ): 6.638 s ± 0.081 s [User: 6.269 s, System: 0.368 s] Range (min … max): 6.550 s … 6.841 s 10 runs [-DNO_UNALIGNED_LOADS] Time (mean ± σ): 6.418 s ± 0.015 s [User: 6.058 s, System: 0.360 s] Range (min … max): 6.394 s … 6.447 s 10 runs For casual use as in reftables I doubt the difference is even measurable. But this result implies that perhaps we ought to just be using the fallback version all the time. -Peff