On Thu, 5 Dec 2024 at 10:16, Geert Uytterhoeven <geert+renesas@xxxxxxxxx> wrote: > > Hence according to the Birthday Paradox, collisions of 12-chararacter > git commit IDs are imminent, or already happening. Note that ambiguous commit IDs are not even remotely as scary as this implies. Yes, the current kernel tree has over ten million objects, and when you look at stable trees etc, you can easily see more. But commits are only a fraction (about 1/8th) of the total objects. My tree is at about 1.3M commits, so we're basically an order of magnitude off the point where collisions start being an issue wrt commit IDs. Can you find collisions by looking at all objects? Yes. Git will do that for you, and tell you their types. But to take one recent example, let's do the 6.12 commit: adc218676eef25575469234709c2d87185ca223a. To get an ambiguous ID, you have to go down to 6 characters, and even then git will tell you there's only one object that is a commit, ie $ git show adc218 results in error: short object ID adc218 is ambiguous hint: The candidates are: hint: adc218676eef commit 2024-11-17 - Linux 6.12 hint: adc2184009c5 blob so right now you have a collision in six digits for that commit, but even then it's actually still entirely unambiguous once you know you're talking about a commit. Are there worse cases? Yup. With just 7 characters, you get commits like 95b861a that actually have three ambiguous commit IDs. And you still get ambiguous results with 9 characters. With 10 characters, there are no collisions. So the "we're an order of magnitude off" seems about right - you get slightly more than one order of magnitude for each two digits. And remember: we're an order of magnitude off *AFTER 20 YEARS OF GIT HISTORY*. Furthermore, the "in the future" argument is bogus. Yes, there will be more commits in the future, but it's not going to suddenly make old SHA ID's somehow more ambiguous, since you can also take history into account - and when quoting the short format it should always be accompanied by the first line of the commit message too. Why do I care? Because long git commit IDs are actually detrimental to legibility. I try to make commit messages legible, and that very much is the *point* of the short format. It's for people, not machinery. Yes, the basic git machinery doesn't do object type disambiguation (and if you do "git show", you can give it blob IDs etc, so git itself may not know about the proper type to use disambiguate at all). And git also doesn't know about the whole "we also put the first line of the commit message" thing. But honestly, I'm claiming that something like Fixes: 48bcda684823 ("tracing: Remove definition of trace_*_rcuidle()") (to pick a random recent commit) is completely unambiguous for the intended audience, and will remain so forever within the context that it is in. And I think the "intended audience" here is important. 12 characters is already line noise, and causes occasional odd line wrapping (you don't see that in things like the "Fixes:" tags, but you do see it in the better commit messages that refer to the commits they fix). I think we should accept that it's not the full SHA1, and also accept what that really means. Final note: personally, I find that the SHA1 - shortened or not - is often *less* descriptive than the shortlog, for the simple reason that rebasing happens, and people refer to other commits with stale commit IDs. That's an issue that I personally hit regularly, and it has a fairly simple solution in the form of git log --grep="..one-liner goes here.." and my point here is that if you rely too much on the SHA1, your workflow is *ALREADY* broken, and it has nothing to do with the shortening. Put another way: if you have particular tooling that you worry about, I think you should look at the tooling. You can find real examples of much shorted commit IDs in the kernel, and real examples of the MUCH MORE REAL issue of wrong commit ID's right now. See for example: 0a1336c8c935 ("IB/ipath: Fix IRQ for PCI Express HCAs") which refers to commit 51f65ebc ("IB/ipath - program intconfig register using new HT irq hook"), which is still perfectly unique, but then look at 2e61c646edfa ("mlx4_core: Use mmiowb() to avoid firmware commands getting jumbled up") which refers to commit 66547550 ("IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up"). That commit doesn't exist at all - it's not ambiguous due to being short, it's ambiguous due to being *wrong* (presumably due to a rebase)(. The real commit ID? 76d7cc0345a0. Easily found using the human-readable shortlog, So here's the meat of the argument: you are barking up the wrong tree. We have real and present issues that have been going on since at least 2007, and they have *nothing* to do with the short SHA1s. I don't want to make the short SHA1's worse, when the real and present problems are elsewhere. Make the tools deal with the cases we already have, and you'll find that the shortening is a complete non-issue. Linus