On Thu, Dec 26, 2024 at 03:33:36PM -0800, Linus Torvalds wrote:
On Thu, 26 Dec 2024 at 14:33, Sasha Levin <sashal@xxxxxxxxxx> wrote:
Which means that folks should be able to use a fairly short abbreviated
commit IDs in messages, specially for commits with a long subject line.
So I don't think we should take this as a way to make *shorter* IDs,
just as a way to accept historical short IDs.
Also, I absolutely detest how you made this be all about "short IDs".
As mentioned in my very original email on this matter, the actual REAL
AND PRESENT issue isn't ambiguous IDs. We don't really have them.
What got me worried originally is the example Kees provided which
creates a collision of a 12-character abbreviated commit ID:
$ git log 1da177e4c3f4
error: short object ID 1da177e4c3 is ambiguous
hint: The candidates are:
hint: 1da177e4c3f41 commit 2005-04-16 - Linux-2.6.12-rc2
hint: 1da177e4c3f47 commit 2024-12-14 - docs: git SHA prefixes are for humans
When I tested it locally, my scripts were broken, our CVE scripts were
broken, and it is quite the PITA to address after the fact (think of all
the "Fixes: 1da177e4c3f4 [...]" lines we have in the log).
So sure, we don't have a collision right now, but going from 0 to 1 is
going to be painful.
Are we going to be actively watching for someone trying to sneak in a
commit like that, or should we just handle that case properly?
What we *do* have is "wrong IDs". We have a ton of those.
Look here, you can get a list of suspiciously wrong SHA1s with
something like this:
git log |
egrep '[0-9a-f]{9,40} \(".*"\)' |
sed 's/.*[^0-9a-f]\([0-9a-f]\{9,40\}\)[^0-9a-f].*/\1/' |
sort -u > hexes
which generates a list of things that look like commit IDs (ok,
there's probably smarter ways) in our git logs.
Now, *some* of those won't actually be commit IDs at all, they'll just
be random hex numbers the above finds.
But most of them will indeed be references to other commits.
Then you try to find the bogus ones by doing something like
cat hexes |
while read i; do git show $i >& /dev/null || echo "No $i SHA"; done
and you will get a lot ot hits.
A *LOT*.
I ended up with this fun thing:
git log --pretty=format:'%B' | grep -E '^Fixes:[[:space:]][0-9a-fA-F]{8,40}' | while read -r line; do
[[ $line =~ ^Fixes:[[:space:]]([0-9a-fA-F]{8,40})(.*) ]] && \
! git rev-parse --quiet --verify "${BASH_REMATCH[1]}^{commit}" >/dev/null 2>&1 && \
{ r=$(./scripts/git-disambiguate.sh --force "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]}"); \
[[ -n $r ]] && echo "Bad: $line" && echo "Good: Fixes: $(git ol "$r")"; }
Which looks for these wrong IDs in "Fixes:" tags and tries to run them
through git-disambiguate.sh:
Bad: Fixes: d71e629bed5b ("ARC: build: disallow invalid PAE40 + 4K page config")
Good: Fixes: 8871331b1769 ("ARC: build: disallow invalid PAE40 + 4K page config")
Bad: Fixes: 7e654ab7da03 ("cifs: during remount, make sure passwords are in sync")
Good: Fixes: 0f0e35790295 ("cifs: during remount, make sure passwords are in sync")
Bad: Fixes: ee650b3820f3 ("apparmor: properly handle cx/px lookup failure for complain")
Good: Fixes: db93ca15e5ae ("apparmor: properly handle cx/px lookup failure for complain")
[...]
Look, I didn't check very many of them. Mainly because it gets *so*
many hits, and I get bored very easily.
But I did check a handful, just to get a feel for things.
And yes, some of them were random hex numbers unrelated to actual git
IDs, but most were really supposed to be git IDs. Except they weren't
- or at least not from the mainline tree.
For example, look at commit daa07cbc9ae3 ("KVM: x86: fix L1TF's MMIO
GFN calculation") which references one of those nonexistent commit
IDs:
Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits
in non-present/reserved SPTEs")
and I have no idea where that bogus commit ID comes from. Maybe it's a
rebase. Maybe it's from a stable tree. But it sure doesn't exist in
mainline.
This one is indeed in the 4.18 stable tree.
What *does* exist is commit 28a1f3ac1d0c ("kvm: x86: Set highest
physical address bits in non-present/reserved SPTEs"), which I found
by just doing that
git log --grep='kvm: x86: Set highest physical address bits in
non-present/reserved SPTEs'
and my point is that this is really not about "disambiguating short
SHA1 IDs". Because those "ambiguous" SHA1's to a very close
approximation simply DO NOT EXIST.
But the completely wrong ones? They are plentiful.
Is the concern mostly around the term "disambiguate"? Would
git-sanitize.sh work better?
--
Thanks,
Sasha