On Wed, Aug 22, 2018 at 8:20 AM Jeff King <peff@xxxxxxxx> wrote: > > On Wed, Aug 22, 2018 at 05:36:26AM +0000, brian m. carlson wrote: > > > On Tue, Aug 21, 2018 at 11:03:44PM -0400, Jeff King wrote: > > > So I wonder if there's some other way to tell the compiler that we'll > > > only have a few values. An enum comes to mind, though I don't think the > > > enum rules are strict enough to make this guarantee (after all, it's OK > > > to bitwise-OR enums, so they clearly don't specify all possible values). > > > > I was thinking about this: > > > > diff --git a/cache.h b/cache.h > > index 1398b2a4e4..1f5c6e9319 100644 > > --- a/cache.h > > +++ b/cache.h > > @@ -1033,7 +1033,14 @@ extern const struct object_id null_oid; > > > > static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2) > > { > > - return memcmp(sha1, sha2, the_hash_algo->rawsz); > > + switch (the_hash_algo->rawsz) { > > + case 20: > > + return memcmp(sha1, sha2, 20); > > + case 32: > > + return memcmp(sha1, sha2, 32); > > + default: > > + assert(0); > > + } > > } > > Unfortunately this version doesn't seem to be any faster than the status > quo. And looking at the generated asm, it still looks to be calling > memcpy(). Removing the "case 32" branch switches it back to fast > assembly (this is all using gcc 8.2.0, btw). So I think we're deep into > guessing what the optimizer is going to do, and there's a good chance > that other versions are going to optimize it differently. > > We might be better off just writing it out manually. Unfortunately, it's > a bit hard because the neg/0/pos return is more expensive to compute > than pure equality. And only the compiler knows at each inlined site > whether we actually want equality. So now we're back to switching every > caller to use hasheq() if that's what they want. > > But _if_ we're OK with that, and _if_ we don't mind some ifdefs for > portability, then this seems as fast as the original (memcmp+constant) > code on my machine: > > diff --git a/cache.h b/cache.h > index b1fd3d58ab..c406105f3c 100644 > --- a/cache.h > +++ b/cache.h > @@ -1023,7 +1023,16 @@ extern const struct object_id null_oid; > > static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2) > { > - return memcmp(sha1, sha2, the_hash_algo->rawsz); > + switch (the_hash_algo->rawsz) { > + case 20: > + if (*(uint32_t *)sha1 == *(uint32_t *)sha2 && > + *(unsigned __int128 *)(sha1+4) == *(unsigned __int128 *)(sha2+4)) > + return 0; > + case 32: > + return memcmp(sha1, sha2, 32); > + default: > + assert(0); > + } > } > > static inline int oidcmp(const struct object_id *oid1, const struct object_id *oid2) > > Which is really no surprise, because the generated asm looks about the > same. There are obviously alignment questions there. It's possible it > could even be written portably as a simple loop. Or maybe not. We used > to do that, but modern compilers were able to optimize the memcmp > better. Maybe that's changed. Or maybe they were simply unwilling to > unroll a 20-length loop to find out that it could be turned into a few > quad-word compares. > > > That would make it obvious that there are at most two options. > > Unfortunately, gcc for me determines that the buffer in walker.c is 20 > > bytes in size and steadfastly refuses to compile because it doesn't know > > that the value will never be 32 in our codebase currently. I'd need to > > send in more patches before it would compile. > > Yeah, I see that warning all over the place (everywhere that calls > is_null_oid(), which is passing in a 20-byte buffer). > > > I don't know if something like this is an improvement or now, but this > > seems to at least compile: > > > > diff --git a/cache.h b/cache.h > > index 1398b2a4e4..3207f74771 100644 > > --- a/cache.h > > +++ b/cache.h > > @@ -1033,7 +1033,13 @@ extern const struct object_id null_oid; > > > > static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2) > > { > > - return memcmp(sha1, sha2, the_hash_algo->rawsz); > > + switch (the_hash_algo->rawsz) { > > + case 20: > > + case 32: > > + return memcmp(sha1, sha2, the_hash_algo->rawsz); > > + default: > > + assert(0); > > + } > > I think that would end up with the same slow code, as gcc would rather > call memcmp than expand out the two sets of asm. > > > I won't have time to sit down and test this out until tomorrow afternoon > > at the earliest. If you want to send in something in the mean time, > > even if that limits things to just 20 for now, that's fine. > > I don't have a good option. The assert() thing works until I add in the > "32" branch, but that's just punting the issue off until you add support > for the new hash. > > Hand-rolling our own asm or C is a portability headache, and we need to > change all of the callsites to use a new hasheq(). > > Hiding it behind a per-hash function is conceptually cleanest, but not > quite as fast. And it also requires hasheq(). > > So all of the solutions seem non-trivial. Again, I'm starting to wonder > if it's worth chasing this few percent. Did you try __builtin_expect? It's a GCC builtin for these sorts of situations, and sometimes helps: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html I.e. you'd tell GCC we expect to have the 20 there with: if (__builtin_expect(the_hash_algo->rawsz == 20, 1)) { ... } The perl codebase has LIKELY() and UNLIKELY() macros for this which if the feature isn't available fall back on just plain C code: https://github.com/Perl/perl5/blob/v5.27.7/perl.h#L3335-L3344