Re: [PATCH v2 2/2] Increase minimum git commit ID abbreviation to 16 characters

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 5 Dec 2024 11:19:18 -0800

On Thu, 5 Dec 2024 at 10:16, Geert Uytterhoeven <geert+renesas@xxxxxxxxx> wrote:
>
> Hence according to the Birthday Paradox, collisions of 12-chararacter
> git commit IDs are imminent, or already happening.

Note that ambiguous commit IDs are not even remotely as scary as this implies.

Yes, the current kernel tree has over ten million objects, and when
you look at stable trees etc, you can easily see more.

But commits are only a fraction (about 1/8th) of the total objects. My
tree is at about 1.3M commits, so we're basically an order of
magnitude off the point where collisions start being an issue wrt
commit IDs.

Can you find collisions by looking at all objects? Yes. Git will do
that for you, and tell you their types. But to take one recent
example, let's do the 6.12 commit:
adc218676eef25575469234709c2d87185ca223a. To get an ambiguous ID, you
have to go down to 6 characters, and even then git will tell you
there's only one object that is a commit, ie

   $ git show adc218

results in

  error: short object ID adc218 is ambiguous
  hint: The candidates are:
  hint:   adc218676eef commit 2024-11-17 - Linux 6.12
  hint:   adc2184009c5 blob

so right now you have a collision in six digits for that commit, but
even then it's actually still entirely unambiguous once you know
you're talking about a commit.

Are there worse cases? Yup. With just 7 characters, you get commits
like 95b861a that actually have three ambiguous commit IDs. And you
still get ambiguous results with 9 characters.

With 10 characters, there are no collisions. So the "we're an order of
magnitude off" seems about right - you get slightly more than one
order of magnitude for each two digits.

And remember: we're an order of magnitude off *AFTER 20 YEARS OF GIT HISTORY*.

Furthermore, the "in the future" argument is bogus. Yes, there will be
more commits in the future, but it's not going to suddenly make old
SHA ID's somehow more ambiguous, since you can also take history into
account - and when quoting the short format it should always be
accompanied by the first line of the commit message too.

Why do I care? Because long git commit IDs are actually detrimental to
legibility. I try to make commit messages legible, and that very much
is the *point* of the short format. It's for people, not machinery.

Yes, the basic git machinery doesn't do object type disambiguation
(and if you do "git show", you can give it blob IDs etc, so git itself
may not know about the proper type to use disambiguate at all). And
git also doesn't know about the whole "we also put the first line of
the commit message" thing.

But honestly, I'm claiming that something like

    Fixes: 48bcda684823 ("tracing: Remove definition of trace_*_rcuidle()")

(to pick a random recent commit) is completely unambiguous for the
intended audience, and will remain so forever within the context that
it is in.

And I think the "intended audience" here is important. 12 characters
is already line noise, and causes occasional odd line wrapping (you
don't see that in things like the "Fixes:" tags, but you do see it in
the better commit messages that refer to the commits they fix).

I think we should accept that it's not the full SHA1, and also accept
what that really means.

Final note: personally, I find that the SHA1 - shortened or not - is
often *less* descriptive than the shortlog, for the simple reason that
rebasing happens, and people refer to other commits with stale commit
IDs. That's an issue that I personally hit regularly, and it has a
fairly simple solution in the form of

    git log --grep="..one-liner goes here.."

and my point here is that if you rely too much on the SHA1, your
workflow is *ALREADY* broken, and it has nothing to do with the
shortening.

Put another way: if you have particular tooling that you worry about,
I think you should look at the tooling. You can find real examples of
much shorted commit IDs in the kernel, and real examples of the MUCH
MORE REAL issue of wrong commit ID's right now.

See for example:

   0a1336c8c935 ("IB/ipath: Fix IRQ for PCI Express HCAs")

which refers to commit 51f65ebc ("IB/ipath - program intconfig
register using new HT irq hook"), which is still perfectly unique, but
then look at

   2e61c646edfa ("mlx4_core: Use mmiowb() to avoid firmware commands
getting jumbled up")

which refers to commit 66547550 ("IB/mthca: Use mmiowb() to avoid
firmware commands getting jumbled up"). That commit doesn't exist at
all - it's not ambiguous due to being short, it's ambiguous due to
being *wrong* (presumably due to a rebase)(.

The real commit ID? 76d7cc0345a0. Easily found using the
human-readable shortlog,

So here's the meat of the argument: you are barking up the wrong tree.
We have real and present issues that have been going on since at least
2007, and they have *nothing* to do with the short SHA1s.

I don't want to make the short SHA1's worse, when the real and present
problems are elsewhere.

Make the tools deal with the cases we already have, and you'll find
that the shortening is a complete non-issue.

                Linus