On Thu, Mar 10, 2011 at 5:33 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > > That 979f79 one already have enough other objects with similar names, so > compared to 83c3c that doesn't, it is natural that you would need more > digits to protect its uniqueness, no? The result shouldn't be affected by > the value of "short" as long as it is not long enough, as that is merely > specifying "at least this many letters" Yes, uniqueness in that sense is sane and has a good definition. But that's _not_ the case when you then randomly add extra <n> digits to it. Why? Because that <n> is meaningless, because what <n> means depends on what the base number was. And the base number is different for different objects. The case of n=0 is special, because it is the "current state". But what does "n=1" mean? Let me make it more explicit by making an extreme example about this. It's extreme just because I'm going to assume that the shortest abbreviation is 1 (in reality it's 4) but that doesn't really change the math, it just makes the numbers smaller/easier. So let's say that we have a repository with just 100 objects. What does that mean? In practical terms, it means that it is not impossible that you will have some object that is unique in a single digit (yes, it may be a bit unlikely, but it's not unreasonable), while you'll have other objects that need three digits. And most will be unique in two. Shortening the numbers that way has a _meaning_: the notion of "unique" is clearly meaningful. Sure, different objects get different lengths, but the different lengths have a very real reason for them. So then (again, to make the numbers small, and the math simple), let's assume abbrevguard=1. What does that MEAN? And I claim that it means something totally _different_ for the different objects, and that's the crazy thing. Because now we're talking about possible _future_ objects, but the likely _future_ uniqueness of "unique in one digit" is TOTALLY DIFFERENT from the future uniqueness of "unique in three digits"! The single-digit uniqueness is going to be gone _long long_ before the three-digit uniqueness is gone. Adding a single digit to the object that currently happens to need only a single-digit unique id will _not_ do a whole lot of future-proofing - if you add another one hundred objects, that object may well need three digits to be unique. But if you add a single digit to the one that currently already needed three digits, you're likely to need to add an order of magnitude more objects to need to extend the three digits to four. See what I'm trying to say? This is why I think abbrevguard is a broken concept, when it is relative to "how unique is the object now". If the abbrevguard was related to the maximum number of digits required for _any_ object in the current repository, it would be meaningful - it would actually be about the _size_ of the current repository, and thus indirectly about the size of a future one. But it isn't. It's always relative to the "local uniqueness", which is only valid for the *current* state, and has very little to do with future growth. Now, to put things in terms of a real repository ("git" itself), and two extreme cases from it: - commit 1dae(0d38b8119de2b67f87e800c860ed958bbed6) currently unique in four digits - commit 979f7929(51913d75f992f87022b75610303a614f) currently unique in eight digits and think about what "abbrevguard" means for those two commits. Let's pick an abbrev-guard of two digits. For the first one, it means that you use six digits total, and for the second one, you'd use 10 digits total. What does that mean for future work? How many objects do we need to add for clashes to start happening? For the first commit, it's _almost certain_ that if you double the size of the repository, those six digits will no longer be unique. For the second case? I can pretty much guarantee that EVEN IF you didn't have any abbrev-guard at all, and you doubled the size of the git repository, the thing would still be unique in eight digits. Why? It's simply *much*much* less likely that you'll get future clashes from new objects in the eight digits. The likelihood of a clash with the currently unique 4-digit object is 16^4=65536 times higher than a clash with the currently 8-digit unique shortening. So it's senseless to add an equal number of digits to the two objects. They simply don't have the same likelihood of future collisions. So what is mathematically the sensible thing? It's actually to extend both objects to the _same_ number of digits. It's _more_ sensible to extend the current four-digit number to eight digits, than it is to extend the currently unique in eight digits by even a single digit. It would future-proof things a fair amount, exactly because the likelihood of future objects clashing with the two objects are totally different. That's why I said that it would be sensible to make the abbrevguard be relative to the current worst-case uniqueness. Because THAT actually is what is probable. If we currently have a maximum uniqueness requirement of 8 characters, it is _probable_ that by the time the project has grown by a factor of 4, we'll need 9 characters (I think, I may have gotten the math wrong). But it is somewhat expensive to calculate "max current uniqueness", so I would suggest ditching the whole notion of "how many extra numbers do I need for futureproofing", and go for just setting the absolute value of DEFAULT_ABBREV. Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html