Re: [PATCH v2 8/8] merge-tree: provide an easy way to access which files have conflicts

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Fri, 7 Jan 2022 20:36:18 +0100 (CET)

Hi Elijah,

On Wed, 5 Jan 2022, Elijah Newren via GitGitGadget wrote:

> From: Elijah Newren <newren@xxxxxxxxx>
>
> Callers of `git merge-tree --real` might want an easy way to determine
> which files conflicted.  While they could potentially use the --messages
> option and parse the resulting messages written to that file, those
> messages are not meant to be machine readable.  Provide a simpler
> mechanism of having the user specify --unmerged-list=$FILENAME, and then
> write a NUL-separated list of unmerged filenames to the specified file.

This patch does what the commit message says, and it looks quite
plausible. However, in practice it seems that you need either a tree (if
the merge succeeded) or the list of conflicted files (if the merge
succeeded).

So while it looks relatively clean from the implementation's point of
view, the design itself could probably withstand a bit of consideration.

As I hinted earlier (to be precise, in
https://lore.kernel.org/git/nycvar.QRO.7.76.6.2201071602110.339@xxxxxxxxxxxxxxxxx/),
I had the chance in December to work on the server-side, using `merge-ort`
for a bit. In the following, I will talk about this a bit more than about
this particular patch, but I think it is highly relevant (not a tangent).

One of the things that became clear to me is that we really have an
either/or situation here. Either the merge succeeds, and we _need_ that
tree, or it fails, and we could not care less about the tree at all.

In fact, if the merge fails, we completely ignore the tree, and it would
be better if we would not even write out any Git objects in that case at
all: even just writing the objects would be quite costly at the
server-side scale.

So my (somewhat hacky) patches for a proof-of-concept produced _either_
the hash of the tree on `stdout`, _or_ a header saying that there were
conflicts followed by a NUL-separated list of file names.

Mind you, I did not even get to the point of analyzing things even more
deeply. My partner in crime and I only got to comparing the `merge-ort`
way to the libgit2-based way, trying to get them to compare as much
apples-to-apples as possible [*1*], and we found that even the time to
spawn the Git process (~1-3ms, with all overhead counted in) is _quite_
noticeable, at server-side scale.

Of course, the `merge-ort` performance was _really_ nice when doing
anything remotely complex, then `merge-ort` really blew the libgit2-based
merge out of the water. But that's not the common case. The common case
are merges that involve very few modified files, a single merge base, and
they don't conflict. And those can be processed by the libgit2-based
method within a fraction of the time it takes to even only so much as
spawn `git` (libgit2-based merges can complete in less than a fifth
millisecond, that's at most a fifth of the time it takes to merely run
`git merge-tree`).

The difference between 0.2-0.5ms for libgit2-based merges on the one hand,
and 1-3ms for `merge-ort`-based merges on the other hand, might not seem
like much, but you have to multiply it by the times such a merge is
performed on the server. Which is a _lot_. Way more often than I thought.

In this particular instance, there is a silver-lining: the libgit2-based
merge is not actually recursive. It is a three-way merge. Which means that
we first have to determine a merge base. In our case, this is done by
spawning a Git process anyway, so one of my ideas to move forward is to fold
that merge-base logic into `git merge-tree`, too.

Anyway, the short short is: whenever we can avoid unnecessary work, we
should do so. In the context of this patch, I would say that we should
avoid writing out a tree (and avoid printing its hash to `stdout`) if
there are merge conflicts. And we should avoid writing (and later reading)
a file, if we can get away with avoiding it.

At least in the default case, that is. We still might need a flag to
produce some more information about those merge conflicts. But even in
that case, it would be better to have a list of file names with the three
associated stages than to output the hash of a tree that contains
conflicts (and tons of files _without_ conflicts). The UI needs to
re-generate those conflicts anyway. And remember: a tree can contain
millions of files even if there is but a single conflict. It makes more
sense for `merge-tree --real` to output a concrete list of files that
conflicted, rather than expecting the caller to discern between conflicts
and non-conflicts by processing a tree object.

Maybe you agree with this rationale and re-design the `--real` mode to try
to avoid writing out files in the common case?

About the form of the patch itself: I was tempted to go with the
nitpicking spirit I see on the Git mailing list these days, especially
about the shell script code in the test scripts. But then I realized that
I find such nitpicking pretty unhelpful, myself. The code is good as-is,
even if I would write it differently. It is clear, and it does exactly
what it is supposed to do.

Thank you,
Dscho

Footnote *1*: I did not _quite_ get to the point of comparing the
`merge-ort` merges to the libgit2 ones, unfortunately. I was on my way to
add code to respect `merge.renames = false` so that we could _truly_
compare the `merge-ort` merges to the libgit2 merges (we really will want
to verify that the output is identical, before even considering to enable
recursive merges on the server side, and then only after studying the time
difference), and then had to take off due to the holidays. If you already
have that need to be able to turn off rename-detection on your radar, even
if only for a transitional period, I would be _so_ delighted.