Re: [PATCH] git gc: Speed it up by 18% via faster hash comparisons

Erik Faye-Lund <kusmabite@xxxxxxxxx> · Thu, 28 Apr 2011 18:00:19 +0200



On Thu, Apr 28, 2011 at 5:14 PM, Ingo Molnar <mingo@xxxxxxx> wrote:
>
> * Ingo Molnar <mingo@xxxxxxx> wrote:
>
>> * Jonathan Nieder <jrnieder@xxxxxxxxx> wrote:
>>
>> > Hi,
>> >
>> > A side note for amusement.
>> >
>> > Erik Faye-Lund wrote:
>> >
>> > > --- a/cache.h
>> > > +++ b/cache.h
>> > > @@ -681,13 +681,17 @@ extern char *sha1_pack_name(const unsigned char *sha1);
>> > >  extern char *sha1_pack_index_name(const unsigned char *sha1);
>> > >  extern const char *find_unique_abbrev(const unsigned char *sha1, int);
>> > >  extern const unsigned char null_sha1[20];
>> > > -static inline int is_null_sha1(const unsigned char *sha1)
>> > > +static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2)
>> > >  {
>> > > - return !memcmp(sha1, null_sha1, 20);
>> > > + /* early out for fast mis-match */
>> > > + if (*sha1 != *sha2)
>> > > +         return *sha1 - *sha2;
>> > > +
>> > > + return memcmp(sha1 + 1, sha2 + 1, 19);
>> > >  }
>> >
>> > On the off-chance that sha1 and sha2 are nicely aligned, a more
>> > redundant
>> >
>> >     if (*sha1 != *sha2)
>> >             return *sha1 - *sha2;
>> >
>> >     return memcmp(sha1, sha2, 20);
>> >
>> > would take advantage of that (yes, this is just superstition, but it
>> > somehow seems comforting anyway).
>>
>> Your variant also makes the code slightly more compact as the sha1+1 and sha2+1
>> addresses do not have to be computed. I'll re-test and resend this variant.
>
> Seems to perform measurably worse:
>
>  #
>  # Open-coded loop:
>  #
>  Performance counter stats for './git gc' (10 runs):
>
>       2358.560100 task-clock               #    0.763 CPUs utilized            ( +-  0.06% )
>             1,870 context-switches         #    0.001 M/sec                    ( +-  3.09% )
>               170 CPU-migrations           #    0.000 M/sec                    ( +-  3.54% )
>            38,230 page-faults              #    0.016 M/sec                    ( +-  0.03% )
>     7,513,529,543 cycles                   #    3.186 GHz                      ( +-  0.06% )
>     1,634,103,128 stalled-cycles           #   21.75% of all cycles are idle   ( +-  0.28% )
>    11,068,971,207 instructions             #    1.47  insns per cycle
>                                            #    0.15  stalled cycles per insn  ( +-  0.04% )
>     2,487,656,519 branches                 # 1054.735 M/sec                    ( +-  0.03% )
>        59,233,604 branch-misses            #    2.38% of all branches          ( +-  0.09% )
>
>        3.092183093  seconds time elapsed  ( +-  3.49% )
>
>  #
>  # Front test + memcmp:
>  #
>  Performance counter stats for './git gc' (10 runs):
>
>       2723.468639 task-clock               #    0.833 CPUs utilized            ( +-  0.22% )
>             1,751 context-switches         #    0.001 M/sec                    ( +-  2.02% )
>               167 CPU-migrations           #    0.000 M/sec                    ( +-  1.23% )
>            38,230 page-faults              #    0.014 M/sec                    ( +-  0.03% )
>     8,684,682,538 cycles                   #    3.189 GHz                      ( +-  0.21% )
>     2,062,906,208 stalled-cycles           #   23.75% of all cycles are idle   ( +-  0.60% )
>     9,019,624,641 instructions             #    1.04  insns per cycle
>                                            #    0.23  stalled cycles per insn  ( +-  0.04% )
>     1,771,179,402 branches                 #  650.340 M/sec                    ( +-  0.04% )
>        75,026,810 branch-misses            #    4.24% of all branches          ( +-  0.04% )
>
>        3.271415104  seconds time elapsed  ( +-  1.97% )
>
> So i think the open-coded loop variant i posted is faster.
>
> The key observation is that there's two cases that matter to performance:
>
>  - the hashes are different: in this case the front test catches 99% of the cases
>  - the hashes are *equal*: in this case the open-coded loop performs better than the memcmp
>
> My patch addresses both cases.
>

Thanks. I also timed on my end (on Windows), and I came to the same
conclusion (but the improvements of your original was somewhat smaller
in my end; could be due to the test-case). It seems like the early-out
wasn't the only reason your original patch performed faster. It could
be that memcmp (probably) didn't get inlined, and the extra function
call outweighs the complexity. Or there's something else going on that
both affects glibc and msvcrt.dll.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html