On Thu, Sep 29, 2016 at 12:16 PM, Jeff King <peff@xxxxxxxx> wrote: > > Hmm. So at length 7, we expect collisions at 2^14, which is 16384. That > seems really low. I mean, by the birthday paradox that's where expect > a 50% chance of a collision. But that's a single collision. We > definitely don't expect them to be common at that size. > > So I suspect this could be a bit looser. So I have to admit that I was surprised by how quickly it actually decided that 7 isn't enough. In fact, the reason I initially said that git used 8 digits was that I didn't count very closely, and just verified that it was more than the default 7. But quite frankly, I think the math is correct, and part of that is that the logic is all about not just the current state, but the "reasonably near future". So it is indeed fairly aggressive, and the moment you have more objects than the "we'd expect to probably see _one_ collision" it grows the size. But looking at the kernel situation, that really is what we'd want, because the whole problem with the existing code is that it only takes the *current* situation into account. That's what we want to get away from. We want git to pick a number that is sane from a standpoint of "this project is still growing". And git _already_ has commits that are ambiguous in 8 hex digits and need 9. Yes, it's rare today, but the reason I'm telling kernel developers to use 12 is because while a size-11 collision is very rare today, it does actually happen, and we want o pick a value where it is rare enough that even in the near future it's not going to be a big deal. Don't get me wrong: collisions aren't fatal. So it's not like we have to absolutely avoid them, and I really like your patch series exactly because it makes collisions even less of a deal (particularly since I expect people will not upgrade immediately, so we'll continue to see even new 7-hex-digit short forms even in the kernel). So it's a balance of making the hex string long enough that it's simply not a big worry. So I'm sure it *could* be looser, but I actually also really suspect that git truly *should* use a 9-digit abbreviation rather than 8 (and 7 is definitely starting to be borderline, I think). > As far as the implementation, I was surprised to see it touch > get_short_sha1() at all. That's, after all, for lookups, and we would > never want to require more characters on the reading side. Heh. The implementation is crap. It was literally a "how can I make the smallest possible patch" implementation. I was finishing it off while at a talk by Nicolas Pitre at Linaro Connect where I am right now. So I agree - it does extra work just because that's where it all slotted in with minimal effort. At a minimum, once it finds a good new default, it should just memoize that. So a minimal fix to the "it's stupldly recalculating things over rand over again" would be to just set "default_abbrev" to the value it finds acceptable after the first time it finds something, so that it doesn't end up looping _again_ in the future. But you could easily also just instead have it do something like if (default_abbrev < 0) default_abbrev = initialize_abbrev(); at startup time if "abbrev_commit" is set, and just do it once and for all rather rthan the odd loping behavior. I really just wanted to see how well the concept worked, and I was happy to see that it gave what I thought were the "correct" numbers. And the loop was salready there ... Linus