Re: [PATCH] find_unique_abbrev(): honor caller-supplied "len" better

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 10 Mar 2011 18:21:36 -0800

On Thu, Mar 10, 2011 at 5:33 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> That 979f79 one already have enough other objects with similar names, so
> compared to 83c3c that doesn't, it is natural that you would need more
> digits to protect its uniqueness, no?  The result shouldn't be affected by
> the value of "short" as long as it is not long enough, as that is merely
> specifying "at least this many letters"

Yes, uniqueness in that sense is sane and has a good definition.

But that's _not_ the case when you then randomly add extra <n> digits to it.

Why? Because that <n> is meaningless, because what <n> means depends
on what the base number was. And the base number is different for
different objects.

The case of n=0 is special, because it is the "current state". But
what does "n=1" mean?

Let me make it more explicit by making an extreme example about this.
It's extreme just because I'm going to assume that the shortest
abbreviation is 1 (in reality it's 4) but that doesn't really change
the math, it just makes the numbers smaller/easier.

So let's say that we have a repository with just 100 objects. What
does that mean? In practical terms, it means that it is not impossible
that you will have some object that is unique in a single digit (yes,
it may be a bit unlikely, but it's not unreasonable), while you'll
have other objects that need three digits. And most will be unique in
two.

Shortening the numbers that way has a _meaning_: the notion of
"unique" is clearly meaningful. Sure, different objects get different
lengths, but the different lengths have a very real reason for them.

So then (again, to make the numbers small, and the math simple), let's
assume abbrevguard=1. What does that MEAN?

And I claim that it means something totally _different_ for the
different objects, and that's the crazy thing. Because now we're
talking about possible _future_ objects, but the likely _future_
uniqueness of "unique in one digit" is TOTALLY DIFFERENT from the
future uniqueness of "unique in three digits"!

The single-digit uniqueness is going to be gone _long long_ before the
three-digit uniqueness is gone. Adding a single digit to the object
that currently happens to need only a single-digit unique id will
_not_ do a whole lot of future-proofing - if you add another one
hundred objects, that object may well need three digits to be unique.
But if you add a single digit to the one that currently already needed
three digits, you're likely to need to add an order of magnitude more
objects to need to extend the three digits to four.

See what I'm trying to say? This is why I think abbrevguard is a
broken concept, when it is relative to "how unique is the object now".

If the abbrevguard was related to the maximum number of digits
required for _any_ object in the current repository, it would be
meaningful - it would actually be about the _size_ of the current
repository, and thus indirectly about the size of a future one. But it
isn't. It's always relative to the "local uniqueness", which is only
valid for the *current* state, and has very little to do with future
growth.

Now, to put things in terms of a real repository ("git" itself), and
two extreme cases from it:

 - commit 1dae(0d38b8119de2b67f87e800c860ed958bbed6) currently unique
in four digits

 - commit 979f7929(51913d75f992f87022b75610303a614f) currently unique
in eight digits

and think about what "abbrevguard" means for those two commits.

Let's pick an abbrev-guard of two digits. For the first one, it means
that you use six digits total, and for the second one, you'd use 10
digits total.

What does that mean for future work? How many objects do we need to
add for clashes to start happening?

For the first commit, it's _almost certain_ that if you double the
size of the repository, those six digits will no longer be unique. For
the second case? I can pretty much guarantee that EVEN IF you didn't
have any abbrev-guard at all, and you doubled the size of the git
repository, the thing would still be unique in eight digits.

Why? It's simply *much*much* less likely that you'll get future
clashes from new objects in the eight digits. The likelihood of a
clash with the currently unique 4-digit object is 16^4=65536 times
higher than a clash with the currently 8-digit unique shortening.

So it's senseless to add an equal number of digits to the two objects.
They simply don't have the same likelihood of future collisions.

So what is mathematically the sensible thing? It's actually to extend
both objects to the _same_ number of digits. It's _more_ sensible to
extend the current four-digit number to eight digits, than it is to
extend the currently unique in eight digits by even a single digit. It
would future-proof things a fair amount, exactly because the
likelihood of future objects clashing with the two objects are totally
different.

That's why I said that it would be sensible to make the abbrevguard be
relative to the current worst-case uniqueness. Because THAT actually
is what is probable. If we currently have a maximum uniqueness
requirement of 8 characters, it is _probable_ that by the time the
project has grown by a factor of 4, we'll need 9 characters (I think,
I may have gotten the math wrong).

But it is somewhat expensive to calculate "max current uniqueness", so
I would suggest ditching the whole notion of "how many extra numbers
do I need for futureproofing", and go for just setting the absolute
value of DEFAULT_ABBREV.

                               Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html