Re: [PATCH v2 8/8] merge-tree: provide an easy way to access which files have conflicts

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Tue, 22 Feb 2022 14:03:46 +0100 (CET)

Hi Elijah,

I meant to answer this mail much earlier, oh well. Sorry. I will only
answer the still open questions below, clipping the quoted text to save
every reader some time.

On Fri, 7 Jan 2022, Elijah Newren wrote:

> On Fri, Jan 7, 2022 at 11:36 AM Johannes Schindelin
> <Johannes.Schindelin@xxxxxx> wrote:
>
> > So my (somewhat hacky) patches for a proof-of-concept produced
> > _either_ the hash of the tree on `stdout`, _or_ a header saying that
> > there were conflicts followed by a NUL-separated list of file names.
>
> Did you really check that it only produced one of these?  If you were
> using ort, you wrote blob and tree objects to disk, even if you didn't
> print their hash on stdout.

D'oh. No, I had not checked that.

> > Mind you, I did not even get to the point of analyzing things even more
> > deeply. My partner in crime and I only got to comparing the `merge-ort`
> > way to the libgit2-based way, trying to get them to compare as much
> > apples-to-apples as possible [*1*], and we found that even the time to
> > spawn the Git process (~1-3ms, with all overhead counted in) is _quite_
> > noticeable, at server-side scale.
>
> 1-3ms?  I thought it was a good bit more than that.

The absolute number really depends on a lot of factors, i.e. what is
relevant is the relative difference between in-process vs spawning a new
process.

> If process execution overhead is such a big problem, perhaps you could
> use these new functions via libgit.a instead of invoking a git process
> and get the best of both worlds?

For now, I solved this by combining multiple Git process invocations into
a single one, as described here:

> > In this particular instance, there is a silver-lining: the
> > libgit2-based merge is not actually recursive. It is a three-way
> > merge. Which means that we first have to determine a merge base. In
> > our case, this is done by spawning a Git process anyway, so one of my
> > ideas to move forward is to fold that merge-base logic into `git
> > merge-tree`, too.
>
> The merge-base logic is already part of merge-tree in my patches.
> What exactly did you do with merge-tree?  Were you using the existing
> one and feeding it with a merge-base as an input, or did you write
> your own that was more like Christian's that expected a merge base?

For an apples-to-apples comparison, I had to stay with the very same merge
base logic as before, i.e. originally I added a new option to `merge-tree`
that would take a merge base and then take a non-recursive path. In my
current version, it is merely an option that even determines said merge
base in the same process (and yes, it leaks memory, but that does not
matter because the process is short-lived anyway).

> > Anyway, the short short is: whenever we can avoid unnecessary work, we
> > should do so. In the context of this patch, I would say that we should
> > avoid writing out a tree (and avoid printing its hash to `stdout`) if
> > there are merge conflicts.
>
> We can avoid printing its hash to `stdout`, but it'd take significant
> work to avoid writing the tree to an object store, and it cannot be
> done at the merge-tree level, it'd require replumbing some bits of
> merge-ort (and making some already complex codepaths a bit more
> complex, but that's a price I'm willing to pay for significant
> performance wins).

Yes, let's leave things as-are. It is not worth the trouble.

> We could make use of the tmp_objdir API that recently merged (see
> b3cecf49ea ("tmp-objdir: new API for creating temporary writable
> databases", 2021-12-06)).  We could put that tmp-objdir on /dev/shm or
> other ramdisk, and if the merge is clean, migrate the contents into
> the real object store.  Perhaps we could even pack those objects first
> if there are a large number of them, but If it's not clean, we can
> just discard the tmp-objdir. Also, as a further variant on this
> alternative... packing these objects before migrating if there are a
> sufficient number of them.  Now, this is rather unlikely to be needed
> in general by merge-tree, because you only need to write new objects
> (thus representing files modified on both sides, or whatever leading
> trees are needed to reference the updated paths).  However, it might
> matter for big enough repos with large enough numbers of changes on
> both sides.  And it'd align nicely with my idea for server-side
> rebases (where implementing this is on my TODO list), because
> server-side rebases are much more likely to generate a large number of
> objects.
>
> But if you really want to learn about avoiding object writes...
>
> If you really want to only write tree and blob objects when the merge
> is clean, then as far as I can tell you have two options in regards to
> the blobs: (1) you'll need to keep all files from three-way content
> merges simultaneously in memory until you've determined if the result
> is clean, so that you can then write the merged contents out as blobs
> at the end.  Or (2) doing all the three-way content merges and keeping
> track of whether the result for each is clean, and if they all turn
> out to be clean, then redo every single one of those three-way content
> merges afterwards so that you can actually write out the merged-result
> to disk that time.
>
> I think (2) would cost you a lot more work than you'd save, and I
> worry that (1) might risk using large amounts of memory in the big
> repositories if there are lots of changes on both sides.  While that
> may be uncommon, I've seen folks try to merge things with lots of
> changes on both sides, and you do have the server side to worry about
> after all.
>
> There are similar issues with the fact that trees are written as they
> are processed as well.  Those would also require re-running afterwards
> to re-generate the trees from the list of relevant-files-and-trees we
> operate on.
>
> However, if you are really curious about trying this out despite the
> fact that I think you might be causing more work than you're avoiding
> (or potentially requiring a lot more RAM), look for calls to
> write_tree() (there are precisely two in merge-ort.c, one for
> intermediate trees and one for the toplevel tree) and
> write_object_file() (there are precisely two in merge-ort.c, one
> within write_tree() for writing tree objects, and one in
> handle_content_merge() for writing blob objects).

Thank you for this thorough analysis.

I am a big fan of crossing bridges when they are reached, and not miles
before that. So _iff_ it turns out that the speed, or the potential
cluttering with objects, should present a problem in the future, I am
inclined to follow the tmp-objdir route you described above (thank you for
pointing it out, I had not made the connection to this here scenario).

> > Footnote *1*: I did not _quite_ get to the point of comparing the
> > `merge-ort` merges to the libgit2 ones, unfortunately. I was on my way to
> > add code to respect `merge.renames = false` so that we could _truly_
> > compare the `merge-ort` merges to the libgit2 merges (we really will want
> > to verify that the output is identical, before even considering to enable
> > recursive merges on the server side, and then only after studying the time
> > difference), and then had to take off due to the holidays. If you already
> > have that need to be able to turn off rename-detection on your radar, even
> > if only for a transitional period, I would be _so_ delighted.
>
> Well, I had no intention of submitting it (and still don't), but I did
> implement it a while back for folks who have needs for it in a
> transitional period.  It's a pretty simple change.
>
> https://github.com/newren/git/commit/5330a9de77f56f20e546acc65c924fc783f092e6
> https://github.com/newren/git/commit/2e7e8d79b0995d352558608b6308060fbc055fd1

Thank you _so_ much for that. It was super helpful, and it allowed me to
gather enough evidence to justify continuing to work on this code.

Ciao,
Dscho