Re: [PATCH v2 8/8] merge-tree: provide an easy way to access which files have conflicts

Elijah Newren <newren@xxxxxxxxx> · Fri, 7 Jan 2022 14:12:56 -0800

On Fri, Jan 7, 2022 at 11:36 AM Johannes Schindelin
<Johannes.Schindelin@xxxxxx> wrote:
>
> Hi Elijah,
>
> On Wed, 5 Jan 2022, Elijah Newren via GitGitGadget wrote:
>
> > From: Elijah Newren <newren@xxxxxxxxx>
> >
> > Callers of `git merge-tree --real` might want an easy way to determine
> > which files conflicted.  While they could potentially use the --messages
> > option and parse the resulting messages written to that file, those
> > messages are not meant to be machine readable.  Provide a simpler
> > mechanism of having the user specify --unmerged-list=$FILENAME, and then
> > write a NUL-separated list of unmerged filenames to the specified file.
>
> This patch does what the commit message says, and it looks quite
> plausible. However, in practice it seems that you need either a tree (if
> the merge succeeded) or the list of conflicted files (if the merge
> succeeded).
>
> So while it looks relatively clean from the implementation's point of
> view, the design itself could probably withstand a bit of consideration.
>
> As I hinted earlier (to be precise, in
> https://lore.kernel.org/git/nycvar.QRO.7.76.6.2201071602110.339@xxxxxxxxxxxxxxxxx/),
> I had the chance in December to work on the server-side, using `merge-ort`
> for a bit. In the following, I will talk about this a bit more than about
> this particular patch, but I think it is highly relevant (not a tangent).
>
> One of the things that became clear to me is that we really have an
> either/or situation here. Either the merge succeeds, and we _need_ that
> tree, or it fails, and we could not care less about the tree at all.
>
> In fact, if the merge fails, we completely ignore the tree, and it would
> be better if we would not even write out any Git objects in that case at
> all: even just writing the objects would be quite costly at the
> server-side scale.
>
> So my (somewhat hacky) patches for a proof-of-concept produced _either_
> the hash of the tree on `stdout`, _or_ a header saying that there were
> conflicts followed by a NUL-separated list of file names.

Did you really check that it only produced one of these?  If you were
using ort, you wrote blob and tree objects to disk, even if you didn't
print their hash on stdout.

> Mind you, I did not even get to the point of analyzing things even more
> deeply. My partner in crime and I only got to comparing the `merge-ort`
> way to the libgit2-based way, trying to get them to compare as much
> apples-to-apples as possible [*1*], and we found that even the time to
> spawn the Git process (~1-3ms, with all overhead counted in) is _quite_
> noticeable, at server-side scale.

1-3ms?  I thought it was a good bit more than that.

> Of course, the `merge-ort` performance was _really_ nice when doing
> anything remotely complex, then `merge-ort` really blew the libgit2-based
> merge out of the water. But that's not the common case. The common case
> are merges that involve very few modified files, a single merge base, and
> they don't conflict. And those can be processed by the libgit2-based
> method within a fraction of the time it takes to even only so much as
> spawn `git` (libgit2-based merges can complete in less than a fifth
> millisecond, that's at most a fifth of the time it takes to merely run
> `git merge-tree`).
>
> The difference between 0.2-0.5ms for libgit2-based merges on the one hand,
> and 1-3ms for `merge-ort`-based merges on the other hand, might not seem
> like much, but you have to multiply it by the times such a merge is
> performed on the server. Which is a _lot_. Way more often than I thought.

Ah, I had been wondering a bit about process overhead.  Having
something that avoids that is definitely helpful.  I tried to design
ort such that its API is relatively clean and easy-to-use, and I
carefully tested and fixed it until it ran memory-leak free.  If
process execution overhead is such a big problem, perhaps you could
use these new functions via libgit.a instead of invoking a git process
and get the best of both worlds?

Unfortunately, in-process would run into problems with finding merge
bases, because the revision walking machinery is definitely not
leak-free -- an annoyance I had to deal with while attempting to clean
up merge-ort since the code I was using always did the merge-base
finding as well as the calls into merge-ort.  (I hear Ævar has some
not-yet-submitted patches that might help with memory leaks in the
revision walking machinery.)

> In this particular instance, there is a silver-lining: the libgit2-based
> merge is not actually recursive. It is a three-way merge. Which means that
> we first have to determine a merge base. In our case, this is done by
> spawning a Git process anyway, so one of my ideas to move forward is to fold
> that merge-base logic into `git merge-tree`, too.

The merge-base logic is already part of merge-tree in my patches.
What exactly did you do with merge-tree?  Were you using the existing
one and feeding it with a merge-base as an input, or did you write
your own that was more like Christian's that expected a merge base?

> Anyway, the short short is: whenever we can avoid unnecessary work, we
> should do so. In the context of this patch, I would say that we should
> avoid writing out a tree (and avoid printing its hash to `stdout`) if
> there are merge conflicts.

We can avoid printing its hash to `stdout`, but it'd take significant
work to avoid writing the tree to an object store, and it cannot be
done at the merge-tree level, it'd require replumbing some bits of
merge-ort (and making some already complex codepaths a bit more
complex, but that's a price I'm willing to pay for significant
performance wins).

Can I first suggest a simpler alternative that may give some
performance wins despite keeping the object writes:

We could make use of the tmp_objdir API that recently merged (see
b3cecf49ea ("tmp-objdir: new API for creating temporary writable
databases", 2021-12-06)).  We could put that tmp-objdir on /dev/shm or
other ramdisk, and if the merge is clean, migrate the contents into
the real object store.  Perhaps we could even pack those objects first
if there are a large number of them, but If it's not clean, we can
just discard the tmp-objdir. Also, as a further variant on this
alternative... packing these objects before migrating if there are a
sufficient number of them.  Now, this is rather unlikely to be needed
in general by merge-tree, because you only need to write new objects
(thus representing files modified on both sides, or whatever leading
trees are needed to reference the updated paths).  However, it might
matter for big enough repos with large enough numbers of changes on
both sides.  And it'd align nicely with my idea for server-side
rebases (where implementing this is on my TODO list), because
server-side rebases are much more likely to generate a large number of
objects.

But if you really want to learn about avoiding object writes...

If you really want to only write tree and blob objects when the merge
is clean, then as far as I can tell you have two options in regards to
the blobs: (1) you'll need to keep all files from three-way content
merges simultaneously in memory until you've determined if the result
is clean, so that you can then write the merged contents out as blobs
at the end.  Or (2) doing all the three-way content merges and keeping
track of whether the result for each is clean, and if they all turn
out to be clean, then redo every single one of those three-way content
merges afterwards so that you can actually write out the merged-result
to disk that time.

I think (2) would cost you a lot more work than you'd save, and I
worry that (1) might risk using large amounts of memory in the big
repositories if there are lots of changes on both sides.  While that
may be uncommon, I've seen folks try to merge things with lots of
changes on both sides, and you do have the server side to worry about
after all.

There are similar issues with the fact that trees are written as they
are processed as well.  Those would also require re-running afterwards
to re-generate the trees from the list of relevant-files-and-trees we
operate on.

However, if you are really curious about trying this out despite the
fact that I think you might be causing more work than you're avoiding
(or potentially requiring a lot more RAM), look for calls to
write_tree() (there are precisely two in merge-ort.c, one for
intermediate trees and one for the toplevel tree) and
write_object_file() (there are precisely two in merge-ort.c, one
within write_tree() for writing tree objects, and one in
handle_content_merge() for writing blob objects).

> And we should avoid writing (and later reading)
> a file, if we can get away with avoiding it.

Sure, we can do that.  Christian had a suggestion that if the
--conflicted-list didn't have an associated filename, then we just
print those to stdout.  Would that be to your liking?  And would you
prefer NUL-separated, or ls-tree style escaping of filenames?

> At least in the default case, that is. We still might need a flag to
> produce some more information about those merge conflicts. But even in
> that case, it would be better to have a list of file names with the three
> associated stages than to output the hash of a tree that contains
> conflicts (and tons of files _without_ conflicts). The UI needs to
> re-generate those conflicts anyway. And remember: a tree can contain
> millions of files even if there is but a single conflict. It makes more
> sense for `merge-tree --real` to output a concrete list of files that
> conflicted, rather than expecting the caller to discern between conflicts
> and non-conflicts by processing a tree object.

???

Where did I suggest that we discern between conflicts and
non-conflicts by processing a tree object?  That's not merely
something with atrocious performance, it's also utterly crazy from a
UI perspective, and is downright *impossible* to achieve in general
anyway.  (Failure to merge binary files.  modify/delete conflicts.
mode conflicts.  file/directory conflicts.  Various rename
permutatations.  There's all kinds of non-content conflicts that are
not representable in a tree, and which folks will miss if they attempt
to parse a tree and surmise what conflicts there were.)  Whatever I
wrote that might have suggested such a course of action needs some
serious rewording or clarification; I consider attempting that to be a
horrible idea.

The tree exists so that if people want to get extra information (and
presumably in a format similar to what they would find in their
working directory if they had asked `git merge` to merge those same
two branches on their laptop), then they can do so.  For example,
perhaps in the list of --conflicted-files they notice a file of
interest and want to ask, "What's found in this particular file?".
They can use the tree together with the filename to get that kind of
info.

> Maybe you agree with this rationale and re-design the `--real` mode to try
> to avoid writing out files in the common case?

I'm totally open to something like having --conflicted-list be given
without a filename (or maybe a '-') and write to stdout instead.

I'm amenable to various tradeoffs to improve performance, but I'm
worried that "not-writing-tree-and-blob object files" sounds like a
goal that might hurt performance rather than help it (either worse
performance in general due to the need to do things twice when the
merge is clean, or else a peak memory usage that is unmanageable and
causes problems).  Perhaps there's a smarter solution I'm not seeing,
or my worries about maximum memory usage are overblown.  However, if
you still want to pursue such core changes, I would also like to see
more performance measuring work to justify them -- especially since in
the common case merge-ort won't recurse into most directories and will
end up only writing a few object files (since only new blob or tree
objects need to be written).  In particular, I think we'd need to do
the following first:

  * Do a real comparison; libgit2 + the separate find-merge-base git
process, vs. the single `git merge-tree --real` call that handles
both.  Excluding the merge-base computation makes it apples to oranges
in my opinion.

  * While measuring the performance of the above, specifically use
trace2 to measure time spent in write_loose_object() in object-file.c.
See how much time object writing actually takes vs. everything else.

  * Try the tmp-objdir changes I suggested (examples of using the API
can be found by searching for "tmp_objdir" at
https://lore.kernel.org/git/d57ae218cf9eaee0b66db299ee1bba9b488b69b1.1640907369.git.gitgitgadget@xxxxxxxxx/
and https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f3.1637020263.git.gitgitgadget@xxxxxxxxx/),
and then re-measure the performance particular in respect to
write_loose_object() as a percentage of overall time.

  * Get some kind of measure of the maximum number of three-way
content merges needed in merges and the overall size of holding all
the results in memory simultaneously.  If repositories have a million
files and there are merges involving a high enough percentage of files
changes on both sides, then that could certainly theoretically be
quite large.

I also think that doing something in-process (via linking against
libgit.a?) rather than forking `git merge-tree --real` would probably
net you _much_ bigger wins than avoiding writing these object files,
though it'd be unsafe without memory leak fixes for all the merge-base
computations.  That, of course, might not be the only reason you're
leery of such an approach.

> About the form of the patch itself: I was tempted to go with the
> nitpicking spirit I see on the Git mailing list these days, especially
> about the shell script code in the test scripts. But then I realized that
> I find such nitpicking pretty unhelpful, myself. The code is good as-is,
> even if I would write it differently. It is clear, and it does exactly
> what it is supposed to do.
>
> Thank you,
> Dscho
>
> Footnote *1*: I did not _quite_ get to the point of comparing the
> `merge-ort` merges to the libgit2 ones, unfortunately. I was on my way to
> add code to respect `merge.renames = false` so that we could _truly_
> compare the `merge-ort` merges to the libgit2 merges (we really will want
> to verify that the output is identical, before even considering to enable
> recursive merges on the server side, and then only after studying the time
> difference), and then had to take off due to the holidays. If you already
> have that need to be able to turn off rename-detection on your radar, even
> if only for a transitional period, I would be _so_ delighted.

Well, I had no intention of submitting it (and still don't), but I did
implement it a while back for folks who have needs for it in a
transitional period.  It's a pretty simple change.

https://github.com/newren/git/commit/5330a9de77f56f20e546acc65c924fc783f092e6
https://github.com/newren/git/commit/2e7e8d79b0995d352558608b6308060fbc055fd1

:-)