Elijah Newren <newren@xxxxxxxxx> wrote: ... > What does "validate" mean in this context? Junio asked a bit about this. > > An alternative he didn't bring up is perhaps you are trying to protect > against supply chain attacks, with a scenario such as someone gains > remote access to some computers on your system, but not to any gpg > keys (because developers are using gpgs on yubikey or something like > that). In such a case, you might ask the question of whether you can > determine if such an attacker has inserted additional commits to your > history. You can use gpg-signed commits by known gpg-keys to rule out > most commits as being from a potential attacker, but odds are your git > hosting service does merges but doesn't sign them. You want a way to > differentiate between merges it makes and ones an attacker might sneak > in. And you are trying to equate "unsigned empty merge commit" with > "was a clean merge anyway, and we're not worried about any permutation > of clean merges of signed code". > > Or maybe you mean something completely different by "validate". No, you describe it perfectly :) > > Junio brushed over whether you could assume "unsigned empty merge > commit" is equivalent to "was a clean merge", so I'll focus on that. > You often can make such an assumption, but it's not valid in general. > > > A simple naive example of this would look something like this: ... > > Or is this a faulty strategy? > > It's faulty, but it'd only rarely trip you up. > > First off, I would use --remerge-diff over --cc. --remerge-diff was > designed to show whether the user made changes relative to what a > current automatic remerge of the parents would give (so it shows > whether and how they resolved conflicts and if and what other changes > they added in). --cc comes close when you only want to know if the > merge was clean, but still isn't quite the same. One example I can > think of is that if there is a modify/delete conflict that the user > resolves in favor of keeping the file, then the merge obviously wasn't > clean. But $(git show --pretty=format: --cc $i) will come back empty > (leading you to think the merge was clean) while $(git show > --pretty=format: --remerge-diff $i) will show there was a > modify/delete conflict. OK, I was not aware of --remerge-diff (we're currently on git 2.35), but from your description it seems to be much preferable. --cc makes it even easier to sneak in old commits cleanly than expected. > Second, even with remerge-diff, it is only checking whether a merge with > today's merge algorithm and config settings would be clean. Changing > diff.algorithm could in rare cases affect whether commits can be merged > cleanly -- and the default in the past was "myers" whereas nowadays for merge > specifically it is "histogram". Also, there have been changes to both of > these algorithms in the past (one notable example being the introduction of > diff.indentHeuristic and later turning it on by default), and there may be > more such changes in the future. Similarly, merge.directoryRenames was > essentially "false" before it was introduced, then was "true" for a while, > then defaulted to "conflict". Someone could have made a merge with an old > version of git that used either "false" or "true" for merge.directoryRenames > and have it be clean, but when you go to remerge the same commits today you > get a conflict due to that variable defaulting to "conflict". Further, the > person running this script may have various `diff` or `merge` config settings > globally that differ from those used by whoever or whatever did the past > merges. And, you have to account for the fact that the merge might have been > made with something other than git. GitHub and GitLab used to use libgit2, > which had an entirely different merge algorithm. Gerrit uses jgit, which has > its own merge algorithm. Realistically, there isn't that much difference > between all these algorithms. For nearly all merges, any of these merge > algorithms with any of these options will give the exact same answer...but > "nearly all" != "all". So, there is no guarantee. And we can't and won't > even guarantee it going forward even if you stick with Git and somehow ensure > using the same config settings either, because we may make further changes to > diff and merge algorithms in the future, which may affect the "clean > remergeability" (or whatever you call it) of merges you make today. Good points. I think we could live with false conflicts. We can acknowledge false negatives with a signed tag for only this purpose, right when they occur. We could also try to control the local git version and its settings. False positives however are worse of course. > All that said, the odds that > you discover a particular merge in the wild where any of this matters > to your usecase is probably small. So you might get away with > it...but if you try it, you should probably add some good comments to > the code for whoever comes along and investigates a "bug" in the > future, to let them know that false negatives and false positives are > possible with your checks, but that you just assumed they would be > rare enough that you decided to ignore them. > > Finally, just a couple preference questions: I'd have used [ -z > "$diff" ] rather than [ ! "$diff" ], and a simple "exit 1" rather than > "rc=1" to exit early, but maybe you really want all the output. But > if you do want the output, would it make more sense to have the -z > "$diff" before the git verify-commit "$i"? Also, I'd be tempted to > use "git log --format='%h %G? %s' $RANGE" rather than call git > verify-commit N times, and then post-process all the "N" cases, but > what you have works too. Yeah I wanted to show an as simple example as possible. The real version use explicit fifos, parallel xargs and shows an error for unsigned commits while minimising the amount of calls to git verify commit, for performance reasons. It just wouldn't have been a suitable example ;) I appreciate the feedback though, thank you! So for a smart enough attacker and a large enough commit history, depending on the nature of commits and certain time windows, they may be able to introduce previously signed commits that both go undetected and cause damage. This validation strategy is also brittle, since it depends on the internals of the git implementation and the chosen diff algorithm. The proper way, and the only way to fully validate the repository would still be to require all merge requests to be signed. We'll see, maybe this can serve as an interim, imperfect solution until we have changed our developer workflow. It looks like introducing signed commits and a validation strategy (although somewhat brittle) to our current developer workflow, would still be an improvement since it is both hopefully a bit tricky to reintroduce old signed commits and we currently have no way at all to really authenticate the commits. Thank you (and Junio) very much for the thorough responses, highly appreciated.