Re: Soundness of signature verification excluding unsigned empty merges

Elijah Newren <newren@xxxxxxxxx> · Tue, 21 Mar 2023 18:41:00 -0700

Hi,

On Tue, Mar 21, 2023 at 4:21 AM Lundkvist Per
<per.lundkvist@xxxxxxxxxxxxx> wrote:
>
> Hi,
>
> We are investigating adding commit and tag signatures into our existing
> repositories. We currently use the common workflow of developers merging commits
> to master using an internal git hosting service after having passed code
> review. Non-local merges like this would then be unsigned.
>
> But it seems like if we allow unsigned empty merge commits, i.e. those that
> themselves do not introduce any any other change than what its parents
> introduce, and require all other commits to be properly validated, then we can
> safely validate the whole repository?

What does "validate" mean in this context?  Junio asked a bit about this.

An alternative he didn't bring up is perhaps you are trying to protect
against supply chain attacks, with a scenario such as someone gains
remote access to some computers on your system, but not to any gpg
keys (because developers are using gpgs on yubikey or something like
that).  In such a case, you might ask the question of whether you can
determine if such an attacker has inserted additional commits to your
history.  You can use gpg-signed commits by known gpg-keys to rule out
most commits as being from a potential attacker, but odds are your git
hosting service does merges but doesn't sign them.  You want a way to
differentiate between merges it makes and ones an attacker might sneak
in.  And you are trying to equate "unsigned empty merge commit" with
"was a clean merge anyway, and we're not worried about any permutation
of clean merges of signed code".

Or maybe you mean something completely different by "validate".

Junio brushed over whether you could assume "unsigned empty merge
commit" is equivalent to "was a clean merge", so I'll focus on that.
You often can make such an assumption, but it's not valid in general.

> A simple naive example of this would look something like this:
>
>     rc=0
>     tags=$(git for-each-ref --format '%(objectname)' refs/tags)
>     tags_verified=$(for i in $tags; do git verify-tag --format='%(objectname)' "$i"; done)
>
>     for i in $(git rev-list HEAD --no-merges --not $tags_verified); do
>         git verify-commit "$i" || rc=1
>     done
>
>     for i in $(git rev-list HEAD --merges --not $tags_verified); do
>         diff=$(git show --text --pretty=format: --diff-merges=cc "$i")
>         git verify-commit "$i" || [ ! "$diff" ] || rc=1
>     done
>
>     exit $rc
>
> Or is this a faulty strategy?

It's faulty, but it'd only rarely trip you up.

First off, I would use --remerge-diff over --cc.  --remerge-diff was
designed to show whether the user made changes relative to what a
current automatic remerge of the parents would give (so it shows
whether and how they resolved conflicts and if and what other changes
they added in).  --cc comes close when you only want to know if the
merge was clean, but still isn't quite the same.  One example I can
think of is that if there is a modify/delete conflict that the user
resolves in favor of keeping the file, then the merge obviously wasn't
clean.  But $(git show --pretty=format: --cc $i) will come back empty
(leading you to think the merge was clean) while $(git show
--pretty=format: --remerge-diff $i) will show there was a
modify/delete conflict.

Second, even with remerge-diff, it is only checking whether a merge
with today's merge algorithm and config settings would be clean.
Changing diff.algorithm could in rare cases affect whether commits can
be merged cleanly -- and the default in the past was "myers" whereas
nowadays for merge specifically it is "histogram".  Also, there have
been changes to both of these algorithms in the past (one notable
example being the introduction of diff.indentHeuristic and later
turning it on by default), and there may be more such changes in the
future.  Similarly, merge.directoryRenames was essentially "false"
before it was introduced, then was "true" for a while, then defaulted
to "conflict".  Someone could have made a merge with an old version of
git that used either "false" or "true" for merge.directoryRenames and
have it be clean, but when you go to remerge the same commits today
you get a conflict due to that variable defaulting to "conflict".
Further, the person running this script may have various `diff` or
`merge` config settings globally that differ from those used by
whoever or whatever did the past merges.  And, you have to account for
the fact that the merge might have been made with something other than
git.  GitHub and GitLab used to use libgit2, which had an entirely
different merge algorithm.  Gerrit uses jgit, which has its own merge
algorithm.  Realistically, there isn't that much difference between
all these algorithms.  For nearly all merges, any of these merge
algorithms with any of these options will give the exact same
answer...but "nearly all" != "all".  So, there is no guarantee.  And
we can't and won't even guarantee it going forward even if you stick
with Git and somehow ensure using the same config settings either,
because we may make further changes to diff and merge algorithms in
the future, which may affect the "clean remergeability" (or whatever
you call it) of merges you make today.  All that said, the odds that
you discover a particular merge in the wild where any of this matters
to your usecase is probably small.  So you might get away with
it...but if you try it, you should probably add some good comments to
the code for whoever comes along and investigates a "bug" in the
future, to let them know that false negatives and false positives are
possible with your checks, but that you just assumed they would be
rare enough that you decided to ignore them.

Finally, just a couple preference questions: I'd have used [ -z
"$diff" ] rather than [ ! "$diff" ], and a simple "exit 1" rather than
"rc=1" to exit early, but maybe you really want all the output.  But
if you do want the output, would it make more sense to have the [[ -z
"$diff"]] before the git verify-commit "$i"?  Also, I'd be tempted to
use "git log --format='%h %G? %s' $RANGE" rather than call git
verify-commit N times, and then post-process all the "N" cases, but
what you have works too.