Re: [EXTERNAL] Re: Soundness of signature verification excluding unsigned empty merges

Elijah Newren <newren@xxxxxxxxx> · Wed, 22 Mar 2023 18:49:00 -0700

On Wed, Mar 22, 2023 at 5:14 AM Lundkvist Per
<per.lundkvist@xxxxxxxxxxxxx> wrote:
>
> Elijah Newren <newren@xxxxxxxxx> wrote:
>
> ...
> > What does "validate" mean in this context?  Junio asked a bit about this.
> >
> > An alternative he didn't bring up is perhaps you are trying to protect
> > against supply chain attacks, with a scenario such as someone gains
> > remote access to some computers on your system, but not to any gpg
> > keys (because developers are using gpgs on yubikey or something like
> > that).  In such a case, you might ask the question of whether you can
> > determine if such an attacker has inserted additional commits to your
> > history.  You can use gpg-signed commits by known gpg-keys to rule out
> > most commits as being from a potential attacker, but odds are your git
> > hosting service does merges but doesn't sign them.  You want a way to
> > differentiate between merges it makes and ones an attacker might sneak
> > in.  And you are trying to equate "unsigned empty merge commit" with
> > "was a clean merge anyway, and we're not worried about any permutation
> > of clean merges of signed code".
> >
> > Or maybe you mean something completely different by "validate".
>
> No, you describe it perfectly :)

:-)

> > Junio brushed over whether you could assume "unsigned empty merge
> > commit" is equivalent to "was a clean merge", so I'll focus on that.
> > You often can make such an assumption, but it's not valid in general.
> >
[...]
> >
> > First off, I would use --remerge-diff over --cc.  --remerge-diff was
> > designed to show whether the user made changes relative to what a
> > current automatic remerge of the parents would give (so it shows
> > whether and how they resolved conflicts and if and what other changes
> > they added in).  --cc comes close when you only want to know if the
> > merge was clean, but still isn't quite the same.  One example I can
> > think of is that if there is a modify/delete conflict that the user
> > resolves in favor of keeping the file, then the merge obviously wasn't
> > clean.  But $(git show --pretty=format: --cc $i) will come back empty
> > (leading you to think the merge was clean) while $(git show
> > --pretty=format: --remerge-diff $i) will show there was a
> > modify/delete conflict.
>
> OK, I was not aware of --remerge-diff (we're currently on git 2.35), but from
> your description it seems to be much preferable. --cc makes it even easier to
> sneak in old commits cleanly than expected.
>
> > Second, even with remerge-diff, it is only checking whether a merge with
> > today's merge algorithm and config settings would be clean.  Changing
> > diff.algorithm could in rare cases affect whether commits can be merged
> > cleanly -- and the default in the past was "myers" whereas nowadays for merge
> > specifically it is "histogram".  Also, there have been changes to both of
> > these algorithms in the past (one notable example being the introduction of
> > diff.indentHeuristic and later turning it on by default), and there may be
> > more such changes in the future.  Similarly, merge.directoryRenames was
> > essentially "false" before it was introduced, then was "true" for a while,
> > then defaulted to "conflict".  Someone could have made a merge with an old
> > version of git that used either "false" or "true" for merge.directoryRenames
> > and have it be clean, but when you go to remerge the same commits today you
> > get a conflict due to that variable defaulting to "conflict".  Further, the
> > person running this script may have various `diff` or `merge` config settings
> > globally that differ from those used by whoever or whatever did the past
> > merges.  And, you have to account for the fact that the merge might have been
> > made with something other than git.  GitHub and GitLab used to use libgit2,
> > which had an entirely different merge algorithm.  Gerrit uses jgit, which has
> > its own merge algorithm.  Realistically, there isn't that much difference
> > between all these algorithms.  For nearly all merges, any of these merge
> > algorithms with any of these options will give the exact same answer...but
> > "nearly all" != "all".  So, there is no guarantee.  And we can't and won't
> > even guarantee it going forward even if you stick with Git and somehow ensure
> > using the same config settings either, because we may make further changes to
> > diff and merge algorithms in the future, which may affect the "clean
> > remergeability" (or whatever you call it) of merges you make today.
>
> Good points.
>
> I think we could live with false conflicts. We can acknowledge false negatives
> with a signed tag for only this purpose, right when they occur. We could also
> try to control the local git version and its settings. False positives however
> are worse of course.

So long as you use --remerge-diff rather than --cc, you should be able
to rely on "empty output means the merge would be clean -- with the
current merge algorithm and config options -- and match the results
recorded in the merge commit".  I designed it to behave exactly that
way, so if anyone ever discovers a case where --remerge-diff gives
empty output despite that not being true, then that would be a bug we
would need to investigate and fix.  And even if someone did make a
merge commit using some other algorithm where it conflicted, and they
tweaked it to fix the conflicts, show --remerge-diff would only come
back empty if the tweaks made by that individual with the other merge
algorithm happen to match what you'd get with a clean merge with Git's
current merge algorithm and config options.  So, you'd be pretty safe
from false positives.  You might have to wade through a some false
negatives when merge algorithms or config options change, but as long
as you're prepared for that, it shouldn't be a big deal.

So, if you did that, the major vectors left for an attacker to fool
the "was this code from one of our employees, or from a merge from our
CI system" that I can think of would be:
  * trawl through your code review system to look for historical
signed commits that passed CI with review comments like "We shouldn't
do this; it'd open a security hole.", because they could then merge
those bad changes.
  * find systems that sign commits automatically (in a big enough
organization there always seem to be exceptions to human-signed stuff.
Maybe a system that autofixes obvious code issues and pushes up an
amended-and-signed commit, or something similar, and perhaps the
autofixing doesn't check the input was signed, or the nature of the
autofixing can be controlled somehow.)
  * figure out how to access the system that records which gpg keys
are associated with employees, and add extra attacker-controlled
gpg-keys to existing employees.

> > All that said, the odds that
> > you discover a particular merge in the wild where any of this matters
> > to your usecase is probably small.  So you might get away with
> > it...but if you try it, you should probably add some good comments to
> > the code for whoever comes along and investigates a "bug" in the
> > future, to let them know that false negatives and false positives are
> > possible with your checks, but that you just assumed they would be
> > rare enough that you decided to ignore them.
> >
> > Finally, just a couple preference questions: I'd have used [ -z
> > "$diff" ] rather than [ ! "$diff" ], and a simple "exit 1" rather than
> > "rc=1" to exit early, but maybe you really want all the output.  But
> > if you do want the output, would it make more sense to have the  -z
> > "$diff" before the git verify-commit "$i"?  Also, I'd be tempted to
> > use "git log --format='%h %G? %s' $RANGE" rather than call git
> > verify-commit N times, and then post-process all the "N" cases, but
> > what you have works too.
>
> Yeah I wanted to show an as simple example as possible. The real version use
> explicit fifos, parallel xargs and shows an error for unsigned commits while
> minimising the amount of calls to git verify commit, for performance reasons. It
> just wouldn't have been a suitable example ;) I appreciate the feedback though,
> thank you!
>
> So for a smart enough attacker and a large enough commit history, depending on
> the nature of commits and certain time windows, they may be able to introduce
> previously signed commits that both go undetected and cause damage. This
> validation strategy is also brittle, since it depends on the internals of the
> git implementation and the chosen diff algorithm.
>
> The proper way, and the only way to fully validate the repository would still be
> to require all merge requests to be signed.
>
> We'll see, maybe this can serve as an interim, imperfect solution until we have
> changed our developer workflow. It looks like introducing signed commits and a
> validation strategy (although somewhat brittle) to our current developer
> workflow, would still be an improvement since it is both hopefully a bit tricky
> to reintroduce old signed commits and we currently have no way at all to really
> authenticate the commits.
>
> Thank you (and Junio) very much for the thorough responses, highly appreciated.

Glad it helped.