Re: [PATCH 4/4] core.abbrev: raise the default abbreviation to 12 hexdigits

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 29 Sep 2016 12:40:43 -0700

On Thu, Sep 29, 2016 at 12:16 PM, Jeff King <peff@xxxxxxxx> wrote:
>
> Hmm. So at length 7, we expect collisions at 2^14, which is 16384. That
> seems really low. I mean, by the birthday paradox that's where expect
> a 50% chance of a collision. But that's a single collision. We
> definitely don't expect them to be common at that size.
>
> So I suspect this could be a bit looser.

So I have to admit that I was surprised by how quickly it actually
decided that 7 isn't enough. In fact, the reason I initially said that
git used 8 digits was that I didn't count very closely, and just
verified that it was more than the default 7.

But quite frankly, I think the math is correct, and part of that is
that the logic is all about not just the current state, but the
"reasonably near future".

So it is indeed fairly aggressive, and the moment you have more
objects than the "we'd expect to probably see  _one_ collision" it
grows the size. But looking at the kernel situation, that really is
what we'd want, because the whole problem with the existing code is
that it only takes the *current* situation into account. That's what
we want to get away from. We want git to pick a number that is sane
from a standpoint of "this project is still growing".

And git _already_ has commits that are ambiguous in 8 hex digits and
need 9. Yes, it's rare today, but the reason I'm telling kernel
developers to use 12 is because while a size-11 collision is very rare
today, it does actually happen, and we want o pick a value where it is
rare enough that even in the near future it's not going to be a big
deal.

Don't get me wrong: collisions aren't fatal. So it's not like we have
to absolutely avoid them, and I really like your patch series exactly
because it makes collisions even less of a deal (particularly since I
expect people will not upgrade immediately, so we'll continue to see
even new 7-hex-digit short forms even in the kernel). So it's a
balance of making the hex string long enough that it's simply not a
big worry.

So I'm sure it *could* be looser, but I actually also really suspect
that git truly *should* use a 9-digit abbreviation rather than 8 (and
7 is definitely starting to be borderline, I think).

> As far as the implementation, I was surprised to see it touch
> get_short_sha1() at all. That's, after all, for lookups, and we would
> never want to require more characters on the reading side.

Heh. The implementation is crap. It was literally a "how can I make
the smallest possible patch" implementation. I was finishing it off
while at a talk by Nicolas Pitre at Linaro Connect where I am right
now.

So I agree - it does extra work just because that's where it all
slotted in with minimal effort.

At a minimum, once it finds a good new default, it should just memoize
that. So a minimal fix to the "it's stupldly recalculating things over
rand over again" would be to just set "default_abbrev" to the value it
finds acceptable after the first time it finds something, so that it
doesn't end up looping _again_ in the future.

But you could easily also just instead have it do something like

      if (default_abbrev < 0)
            default_abbrev = initialize_abbrev();

at startup time if "abbrev_commit" is set, and just do it once and for
all rather rthan the odd loping behavior.

I really just wanted to see how well the concept worked, and I was
happy to see that it gave what I thought were the "correct" numbers.
And the loop was salready there ...

            Linus