Re: [PATCH] git-disambiguate: disambiguate shorthand git ids

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 26 Dec 2024 15:33:36 -0800

On Thu, 26 Dec 2024 at 14:33, Sasha Levin <sashal@xxxxxxxxxx> wrote:
>
> Which means that folks should be able to use a fairly short abbreviated
> commit IDs in messages, specially for commits with a long subject line.

So I don't think we should take this as a way to make *shorter* IDs,
just as a way to accept historical short IDs.

Also, I absolutely detest how you made this be all about "short IDs".

As mentioned in my very original email on this matter, the actual REAL
AND PRESENT issue isn't ambiguous IDs. We don't really have them.

What we *do* have is "wrong IDs". We have a ton of those.

Look here, you can get a list of suspiciously wrong SHA1s with
something like this:

    git log |
        egrep '[0-9a-f]{9,40} \(".*"\)' |
        sed 's/.*[^0-9a-f]\([0-9a-f]\{9,40\}\)[^0-9a-f].*/\1/' |
        sort -u > hexes

which generates a list of things that look like commit IDs (ok,
there's probably smarter ways) in our git logs.

Now, *some* of those won't actually be commit IDs at all, they'll just
be random hex numbers the above finds.

But most of them will indeed be references to other commits.

Then you try to find the bogus ones by doing something like

    cat hexes |
        while read i; do git show $i >& /dev/null || echo "No $i SHA"; done

and you will get a lot ot hits.

A *LOT*.

Look, I didn't check very many of them. Mainly because it gets *so*
many hits, and I get bored very easily.

But I did check a handful, just to get a feel for things.

And yes, some of them were random hex numbers unrelated to actual git
IDs, but most were really supposed to be git IDs. Except they weren't
- or at least not from the mainline tree.

For example, look at commit daa07cbc9ae3 ("KVM: x86: fix L1TF's MMIO
GFN calculation") which references one of those nonexistent commit
IDs:

    Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits
in non-present/reserved SPTEs")

and I have no idea where that bogus commit ID comes from. Maybe it's a
rebase. Maybe it's from a stable tree. But it sure doesn't exist in
mainline.

What *does* exist is commit 28a1f3ac1d0c ("kvm: x86: Set highest
physical address bits in non-present/reserved SPTEs"), which I found
by just doing that

    git log --grep='kvm: x86: Set highest physical address bits in
non-present/reserved SPTEs'

and my point is that this is really not about "disambiguating short
SHA1 IDs". Because those "ambiguous" SHA1's to a very close
approximation simply DO NOT EXIST.

But the completely wrong ones? They are plentiful.

For a completely different example, we have

    ec680c1990e7 ("ide: remove BUG_ON(in_interrupt() ||
irqs_disabled()) from ide_unregister()")

which has this in the log message:

    Both BUG_ON()s in ide-probe.c were introduced in commit
       4015c949fb465 ("[PATCH] update ide core")

and it turns out that that commit ID doesn't exist in the kernel tree.

It is actually a real commit ID, and it *does* actually exist in the
historical BK import by Thomas:

     https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=4015c949fb465

so that's an example of a cross-tree ID, and yeah, it might be really
cool to "disambiguate" *those* too.

But again, the problem wasn't actually that the SHA1 was _short_.

            Linus