On Wed, Mar 22, 2023 at 5:14 AM Lundkvist Per <per.lundkvist@xxxxxxxxxxxxx> wrote: > > Elijah Newren <newren@xxxxxxxxx> wrote: > > ... > > What does "validate" mean in this context? Junio asked a bit about this. > > > > An alternative he didn't bring up is perhaps you are trying to protect > > against supply chain attacks, with a scenario such as someone gains > > remote access to some computers on your system, but not to any gpg > > keys (because developers are using gpgs on yubikey or something like > > that). In such a case, you might ask the question of whether you can > > determine if such an attacker has inserted additional commits to your > > history. You can use gpg-signed commits by known gpg-keys to rule out > > most commits as being from a potential attacker, but odds are your git > > hosting service does merges but doesn't sign them. You want a way to > > differentiate between merges it makes and ones an attacker might sneak > > in. And you are trying to equate "unsigned empty merge commit" with > > "was a clean merge anyway, and we're not worried about any permutation > > of clean merges of signed code". > > > > Or maybe you mean something completely different by "validate". > > No, you describe it perfectly :) :-) > > Junio brushed over whether you could assume "unsigned empty merge > > commit" is equivalent to "was a clean merge", so I'll focus on that. > > You often can make such an assumption, but it's not valid in general. > > [...] > > > > First off, I would use --remerge-diff over --cc. --remerge-diff was > > designed to show whether the user made changes relative to what a > > current automatic remerge of the parents would give (so it shows > > whether and how they resolved conflicts and if and what other changes > > they added in). --cc comes close when you only want to know if the > > merge was clean, but still isn't quite the same. One example I can > > think of is that if there is a modify/delete conflict that the user > > resolves in favor of keeping the file, then the merge obviously wasn't > > clean. But $(git show --pretty=format: --cc $i) will come back empty > > (leading you to think the merge was clean) while $(git show > > --pretty=format: --remerge-diff $i) will show there was a > > modify/delete conflict. > > OK, I was not aware of --remerge-diff (we're currently on git 2.35), but from > your description it seems to be much preferable. --cc makes it even easier to > sneak in old commits cleanly than expected. > > > Second, even with remerge-diff, it is only checking whether a merge with > > today's merge algorithm and config settings would be clean. Changing > > diff.algorithm could in rare cases affect whether commits can be merged > > cleanly -- and the default in the past was "myers" whereas nowadays for merge > > specifically it is "histogram". Also, there have been changes to both of > > these algorithms in the past (one notable example being the introduction of > > diff.indentHeuristic and later turning it on by default), and there may be > > more such changes in the future. Similarly, merge.directoryRenames was > > essentially "false" before it was introduced, then was "true" for a while, > > then defaulted to "conflict". Someone could have made a merge with an old > > version of git that used either "false" or "true" for merge.directoryRenames > > and have it be clean, but when you go to remerge the same commits today you > > get a conflict due to that variable defaulting to "conflict". Further, the > > person running this script may have various `diff` or `merge` config settings > > globally that differ from those used by whoever or whatever did the past > > merges. And, you have to account for the fact that the merge might have been > > made with something other than git. GitHub and GitLab used to use libgit2, > > which had an entirely different merge algorithm. Gerrit uses jgit, which has > > its own merge algorithm. Realistically, there isn't that much difference > > between all these algorithms. For nearly all merges, any of these merge > > algorithms with any of these options will give the exact same answer...but > > "nearly all" != "all". So, there is no guarantee. And we can't and won't > > even guarantee it going forward even if you stick with Git and somehow ensure > > using the same config settings either, because we may make further changes to > > diff and merge algorithms in the future, which may affect the "clean > > remergeability" (or whatever you call it) of merges you make today. > > Good points. > > I think we could live with false conflicts. We can acknowledge false negatives > with a signed tag for only this purpose, right when they occur. We could also > try to control the local git version and its settings. False positives however > are worse of course. So long as you use --remerge-diff rather than --cc, you should be able to rely on "empty output means the merge would be clean -- with the current merge algorithm and config options -- and match the results recorded in the merge commit". I designed it to behave exactly that way, so if anyone ever discovers a case where --remerge-diff gives empty output despite that not being true, then that would be a bug we would need to investigate and fix. And even if someone did make a merge commit using some other algorithm where it conflicted, and they tweaked it to fix the conflicts, show --remerge-diff would only come back empty if the tweaks made by that individual with the other merge algorithm happen to match what you'd get with a clean merge with Git's current merge algorithm and config options. So, you'd be pretty safe from false positives. You might have to wade through a some false negatives when merge algorithms or config options change, but as long as you're prepared for that, it shouldn't be a big deal. So, if you did that, the major vectors left for an attacker to fool the "was this code from one of our employees, or from a merge from our CI system" that I can think of would be: * trawl through your code review system to look for historical signed commits that passed CI with review comments like "We shouldn't do this; it'd open a security hole.", because they could then merge those bad changes. * find systems that sign commits automatically (in a big enough organization there always seem to be exceptions to human-signed stuff. Maybe a system that autofixes obvious code issues and pushes up an amended-and-signed commit, or something similar, and perhaps the autofixing doesn't check the input was signed, or the nature of the autofixing can be controlled somehow.) * figure out how to access the system that records which gpg keys are associated with employees, and add extra attacker-controlled gpg-keys to existing employees. > > All that said, the odds that > > you discover a particular merge in the wild where any of this matters > > to your usecase is probably small. So you might get away with > > it...but if you try it, you should probably add some good comments to > > the code for whoever comes along and investigates a "bug" in the > > future, to let them know that false negatives and false positives are > > possible with your checks, but that you just assumed they would be > > rare enough that you decided to ignore them. > > > > Finally, just a couple preference questions: I'd have used [ -z > > "$diff" ] rather than [ ! "$diff" ], and a simple "exit 1" rather than > > "rc=1" to exit early, but maybe you really want all the output. But > > if you do want the output, would it make more sense to have the -z > > "$diff" before the git verify-commit "$i"? Also, I'd be tempted to > > use "git log --format='%h %G? %s' $RANGE" rather than call git > > verify-commit N times, and then post-process all the "N" cases, but > > what you have works too. > > Yeah I wanted to show an as simple example as possible. The real version use > explicit fifos, parallel xargs and shows an error for unsigned commits while > minimising the amount of calls to git verify commit, for performance reasons. It > just wouldn't have been a suitable example ;) I appreciate the feedback though, > thank you! > > So for a smart enough attacker and a large enough commit history, depending on > the nature of commits and certain time windows, they may be able to introduce > previously signed commits that both go undetected and cause damage. This > validation strategy is also brittle, since it depends on the internals of the > git implementation and the chosen diff algorithm. > > The proper way, and the only way to fully validate the repository would still be > to require all merge requests to be signed. > > We'll see, maybe this can serve as an interim, imperfect solution until we have > changed our developer workflow. It looks like introducing signed commits and a > validation strategy (although somewhat brittle) to our current developer > workflow, would still be an improvement since it is both hopefully a bit tricky > to reintroduce old signed commits and we currently have no way at all to really > authenticate the commits. > > Thank you (and Junio) very much for the thorough responses, highly appreciated. Glad it helped.