Re: Determining if a merge was produced automatically

Elijah Newren <newren@xxxxxxxxx> · Mon, 1 Jul 2024 08:11:03 -0700

Hi,

On Sun, Jun 30, 2024 at 5:45 PM Martin von Zweigbergk
<martinvonz@xxxxxxxxx> wrote:
>
> Forwarding to the list without HTML so others can correct me if I was wrong.
>
> On Sun, Jun 30, 2024 at 3:32 PM Martin von Zweigbergk
> <martinvonz@xxxxxxxxx> wrote:
> >
> >
> >
> > On Sun, Jun 30, 2024, 11:06 Pavel Rappo <pavel.rappo@xxxxxxxxx> wrote:
> >>
> >> Hello,
> >>
> >> I'm looking for a robust way to determine if a given merge commit
> >> could've been produced automatically by `git merge`, without any
> >> manual intervention or tampering, such as:
> >>
> >>   - resolving conflicts,
> >>   - stopping (`--no-commit`) and modifying,
> >>   - amending the commit.
> >>
> >> My initial idea was to re-enact the merge. If the merge failed, I
> >> would conclude that the original merge couldn't have been produced
> >> automatically. If the merge succeeded, I would compare it with the
> >> original merge. Any differences would indicate that the original merge
> >> couldn't have been produced automatically. Otherwise, I would conclude
> >> that it could've been. This approach is simple, but involves multiple
> >> steps and requires clean-up.

Further, your strategy has some blind spots.  What if the original
person creating the merge used special flags, such as changing the
rename threshold, ignoring space changes, or a different underlying
diff algorithm?  It may be that the merge was clean for the original
merger, but if you don't use the same options it doesn't look clean to
you -- or vice versa.  (In short, this method has both false positives
and false negatives.)

(The odds that someone used specialized options and then had the merge
succeed, when it wouldn't have otherwise, is pretty low.  So your
strategy is probably good enough, but it's good to be aware of other
possibilities.)

> >> My second idea was to use `git show --diff-merges=dense-combined`,
> >> which only prints hunks that come from neither parent. If nothing is
> >> printed, I would conclude that the merge could've been produced
> >> automatically. This approach is simple, single-step, but seems to have
> >> an issue. In my experiments, I found that if some hunks from different
> >> parents were located closely enough, output was produced. So, checking
> >> if nothing is output could lead to false negatives: a merge that
> >> could've been produced automatically might look like it was tampered
> >> with.
> >>
> >> My third idea was to use a recently added feature, `git show
> >> --remerge-diff`, which seemingly embodies my first idea and is immune
> >> to the issue of the second. It is also single-step and requires no
> >> clean-up:
> >>
> >> > Remerge two-parent merge commits to create a temporary tree object—potentially containing files with conflict markers and such. A diff is then shown between that temporary tree and the actual merge commit.

Yes, this is exactly your strategy, except that instead of checking
whether it "completes" you are checking whether the output is empty,
and it avoids messing up the working tree and index and thus also
avoids the need for clean-up.

> >> However, this bit means that I shouldn't entirely trust its output:
> >>
> >> > The output emitted when this option is used is subject to change, and so is its interaction with other options (unless explicitly documented).

--remerge-diff output uses diff headers a bit inventively.  And, being
a somewhat new option, I didn't want a repeat of issues like we had
with --cc (where the output format is documented in detail and was for
a few years before we realized that showing diff headers for exactly
two files when there are at least three files is kinda dumb when
renames or mode changes are involved).  So, we needed the flexibility
to change the output in the future.

However, this wording was not intended to detract from the main point
that "empty output means clean merge and non-empty output means
conflicted merge" (it never even occurred to me that someone might
read that part of the documentation and assume that it presents a
problem for checking-if-diff-is-empty).  I use it for the same
purpose, and that's absolutely a guarantee we want to provide.  If you
want to guarantee output format beyond that, though, then I object.

Anyway, in summary...

> > There's basically only one way to display an empty diff, so I suspect that checking that the diff is empty is still going to be enough for your purposes.
> >
> > Note that you can specify e.g. the rename detection threshold to use while merging, and the person doing the merge might have used a different threshold than you're using when you're trying to check if they added other changes. There are also different merge strategies and diff algorithms to choose. That means that you might get false positives and false negatives. Maybe that's still good enough for you.

Yes, Martin has it exactly right.  And it has possible false positives
and false negatives precisely because your original strategy did too.