On Thu, 26 Dec 2024 at 16:55, Sasha Levin <sashal@xxxxxxxxxx> wrote: > > What got me worried originally is the example Kees provided which > creates a collision of a 12-character abbreviated commit ID: Note that you can always create short ambiguous cases. And in fact, since the way you create them is to just put noise in the right place and brute-force things, you can also always make sure that the one-liner short commit name will match the target commit too. In other words: just accept the fact that a short hash is simply not a secure hash. That's very very fundamental. It's just inherent in the whole "we have shortened things to be legible". And once you just accept that it's fundamental to any short hash, you can relax and just live with it as just a fact of life. > So sure, we don't have a collision right now, but going from 0 to 1 is > going to be painful. I disagree. You need to just own up to how it is, and more importantly that you WILL NEVER EVER BE ABLE TO FIX IT. There is no way to disambiguate the active malicious case, because the active malicious case will just be able to handle any disambiguation you throw at it too. So the fix is to not *rely* on disambiguation, and to just accept it. Don't fight reality. You'll just lose. > Are we going to be actively watching for someone trying to sneak in a > commit like that, or should we just handle that case properly? See above. There is no "properly". There is only reality. Git internally uses the full hashes. And any reasonably usefully shortened hashes *cannot* be disambiguated in the face of an active malicious person. If you have some tool that you rely on absolutely to give you the "One Correct Answer", then you are already broken. If thats' the goal, then all you can do is give up and pray to some random $deity. Or, as my argument is, DON'T DO THAT THEN. Don't rely on some absolute thing. Accept the fact that a short hash will always possibly refer to multiple things, and that you will *always* need to have a human that actually *THINKS* about it, not mindless automation. The best the tools can do is say "there are multiple options here" (or, more commonly, "I found no options at all, but maybe it meant XYZ"). Literally. To summarize: If you are aiming for anything else, you are bound to be disappointed. So aim for that simple "let me know about multiple choices", with the knowledge that they are going to be EXCEEDINGLY rare. And then if we have somebody who actively acts badly and works to make that "it's going to be EXCEEDINGLY rare" be much more common than it is supposed to be, we deal with that *person* by just not working with them any more. See? The solution is not some kind of "this cannot happen". The solution is "stop accepting shit from evil people". Which, btw, is not a new thing. This is how open source works. It's actually way more work to create collisions in short hashes, when you can much more easily just send a bad patch that wastes peoples time without any hash collision at all - just by virtue of writing bad code. So thinking that the short hash is some kind of special problem is wrong. It's in fact a particularly *easy* problem to deal with, because at least the whole "somebody did something malicious" is something git will balk at, simply because the basic git tools will refuse to touch the ambiguous hash. So *your* tooling should concentrate on one thing, and one thing only: making it easier for a *THINKING*HUMAN* to decide what a bad hash means. But you need to do it with the up-front understanding that it requires that thinking human, and is not mindless automation. And as mentioned, the most common case of bad hashes BY FAR is the "oh, that doesn't match anything I know about at all", as opposed to "oh, that matches two or more different commits". > >and my point is that this is really not about "disambiguating short > >SHA1 IDs". Because those "ambiguous" SHA1's to a very close > >approximation simply DO NOT EXIST. > > > >But the completely wrong ones? They are plentiful. > > Is the concern mostly around the term "disambiguate"? Would > git-sanitize.sh work better? No, my main worry is that you talk about short hashes (and talked about shortening our existing ones even more). When I really think that the size isn't the main problem at all. "disambiguate" is a fine word, but only if you use it in the context of "I have a bad hash". Not just a short one. Because the hash being too short is simply not the main case worth worrying about. And hey, just to really hit this same argument home: I can absolutely guarantee that you have a much more insidious problem of "wrong hash", namely the "oh, it's actually a valid commit hash, but the Fixes: line simply got the wrong commit:". Because that's a mistake I see regularly: somebody simply blames the wrong commit. I've most definitely done that very thing myself. Sometimes it's people reading the history wrong, and not realizing that the bug actually happened before the blamed change too. Or the bug ends up being a combination of factors, and people didn't realize that the commit they blamed was just one part of it. So my complaint really ends up being that anybody who trusts those hashes mindlessly is going to have mistakes, and the actual hash collision case MUST NOT be the primary worry. Because as long as that hash size is all you care about, you're barking up the wrong tree. Linus