Re: [PATCH] git-disambiguate: disambiguate shorthand git ids

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 26, 2024 at 05:50:22PM -0800, Linus Torvalds wrote:
On Thu, 26 Dec 2024 at 16:55, Sasha Levin <sashal@xxxxxxxxxx> wrote:

What got me worried originally is the example Kees provided which
creates a collision of a 12-character abbreviated commit ID:

Note that you can always create short ambiguous cases.

And in fact, since the way you create them is to just put noise in the
right place and brute-force things, you can also always make sure that
the one-liner short commit name will match the target commit too.

In other words: just accept the fact that a short hash is simply not a
secure hash. That's very very fundamental. It's just inherent in the
whole "we have shortened things to be legible".

And once you just accept that it's fundamental to any short hash, you
can relax and just live with it as just a fact of life.

So sure, we don't have a collision right now, but going from 0 to 1 is
going to be painful.

I disagree. You need to just own up to how it is, and more importantly
that you WILL NEVER EVER BE ABLE TO FIX IT.

There is no way to disambiguate the active malicious case, because the
active malicious case will just be able to handle any disambiguation
you throw at it too.

So the fix is to not *rely* on disambiguation, and to just accept it.
Don't fight reality. You'll just lose.

Are we going to be actively watching for someone trying to sneak in a
commit like that, or should we just handle that case properly?

See above. There is no "properly". There is only reality.

Git internally uses the full hashes. And any reasonably usefully
shortened hashes *cannot* be disambiguated in the face of an active
malicious person.

If you have some tool that you rely on absolutely to give you the "One
Correct Answer", then you are already broken. If thats' the goal, then
all you can do is give up and pray to some random $deity.

Or, as my argument is, DON'T DO THAT THEN.

Don't rely on some absolute thing. Accept the fact that a short hash
will always possibly refer to multiple things, and that you will
*always* need to have a human that actually *THINKS* about it, not
mindless automation.

The best the tools can do is say "there are multiple options here"
(or, more commonly, "I found no options at all, but maybe it meant
XYZ").

Literally.

To summarize: If you are aiming for anything else, you are bound to be
disappointed.

The script in the original mail will handle two important cases for me:

1. The "wrong ID" case, where the the provided commit ID doesn't exist
and we need to look at the subject line to figure out what the actual ID
is.

2. The case where we either get a too-short commit ID that has a
collision, or we just got unlucky and finally ended up with a 12-char
collision, but the subject line is enough to automatically resolve to
the correct ID.

We're disagreeing over the remaining <0.0001% of cases.

So aim for that simple "let me know about multiple choices", with the
knowledge that they are going to be EXCEEDINGLY rare.

And then if we have somebody who actively acts badly and works to make
that "it's going to be EXCEEDINGLY rare" be much more common than it
is supposed to be, we deal with that *person* by just not working with
them any more.

See? The solution is not some kind of "this cannot happen". The
solution is "stop accepting shit from evil people".

Not necessarily "evil", but consider something similar to the UMN saga
where we get "researchers" trying to sneak in commits that create a
collision. Once these make it in, it will be difficult walking them back
out.

Which, btw, is not a new thing. This is how open source works. It's
actually way more work to create collisions in short hashes, when you
can much more easily just send a bad patch that wastes peoples time
without any hash collision at all - just by virtue of writing bad
code.

No, but it makes for a "good" undergrad research paper...

So thinking that the short hash is some kind of special problem is wrong.

It's in fact a particularly *easy* problem to deal with, because at
least the whole "somebody did something malicious" is something git
will balk at, simply because the basic git tools will refuse to touch
the ambiguous hash.

So *your* tooling should concentrate on one thing, and one thing only:
making it easier for a *THINKING*HUMAN* to decide what a bad hash
means.

But you need to do it with the up-front understanding that it requires
that thinking human, and is not mindless automation.

And as mentioned, the most common case of bad hashes BY FAR is the
"oh, that doesn't match anything I know about at all", as opposed to
"oh, that matches two or more different commits".

>and my point is that this is really not about "disambiguating short
>SHA1 IDs". Because those "ambiguous" SHA1's to a very close
>approximation simply DO NOT EXIST.
>
>But the completely wrong ones? They are plentiful.

Is the concern mostly around the term "disambiguate"? Would
git-sanitize.sh work better?

No, my main worry is that you talk about short hashes (and talked
about shortening our existing ones even more).

When I really think that the size isn't the main problem at all.

No, it's not a problem. In my mind, I figured we could use shorter
hashes in mail message to make it easier to communicate.

It doesn't have to be formal, but for example if we exchange mails about
an issue, and I end up referring to '1d1a ("arm64/sme: ...")' it both
makes the mail more readable, but still someone who doesn't have context
can still easily get to the commit I was referring to.

To me, it's a nice (minor) improvement.

"disambiguate" is a fine word, but only if you use it in the context
of "I have a bad hash". Not just a short one. Because the hash being
too short is simply not the main case worth worrying about.

And hey, just to really hit this same argument home: I can absolutely
guarantee that you have a much more insidious problem of "wrong hash",
namely the "oh, it's actually a valid commit hash, but the Fixes: line
simply got the wrong commit:".

Because that's a mistake I see regularly: somebody simply blames the
wrong commit. I've most definitely done that very thing myself.
Sometimes it's people reading the history wrong, and not realizing
that the bug actually happened before the blamed change too. Or the
bug ends up being a combination of factors, and people didn't realize
that the commit they blamed was just one part of it.

So my complaint really ends up being that anybody who trusts those
hashes mindlessly is going to have mistakes, and the actual hash
collision case MUST NOT be the primary worry.

Because as long as that hash size is all you care about, you're
barking up the wrong tree.

              Linus

--
Thanks,
Sasha




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux