Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`

Taylor Blau <me@xxxxxxxxxxxx> · Mon, 13 Nov 2023 17:34:42 -0500

On Mon, Nov 13, 2023 at 05:02:54PM -0500, Jeff King wrote:
> On Fri, Nov 10, 2023 at 03:51:18PM -0800, Elijah Newren wrote:
>
> > This is unsafe; the object may need to be read later within the same
> > merge. [...]
> >
> > I think (untested) that you could fix this by creating two packs
> > instead of just one.  In particular, add a call to
> > flush_odb_transcation() after the "redo_after_renames" block in
> > merge_ort_nonrecursive_internal().
>
> It might not be too hard to just let in-process callers access the
> objects we've written. A quick and dirty patch is below, which works
> with the test you suggested (the test still fails because it finds a
> conflict, but it gets past the "woah, I can't find that sha1" part).

That's a very slick idea, and I think that this series has some legs to
stand on now as a result.

There are a few of interesting conclusions we can draw here:

  1. This solves the recursive merge problem that Elijah mentioned
     earlier where we could generate up to 2^N packs (where N is the
     maximum depth of the recursive merge).

  2. This also solves the case where the merge-ort code needs to read an
     object that it wrote earlier on in the same process without having
     to flush out intermediate packs. So in the best case we need only a
     single pack (instead of two).

  3. This also solves the 'replay' issue, *and* allows us to avoid the
     tmp-objdir thing there entirely, since we can simulate object reads
     in the bulk-checkin pack.

So I think that this is a direction worth pursuing.

In terms of making those lookups faster, I think that what you'd want is
an oidmap. The overhead is slightly unfortunate, but I think that any
other solution which requires keeping the written array in sorted order
would in practice be more expensive as you have to constantly reallocate
and copy portions of the array around to maintain its invariant.

> I don't know if that is sufficient, though. Would other spawned
> processes (hooks, external merge drivers, and so on) need to be able to
> see these objects, too?

Interesting point. In theory those processes could ask about newly
created objects, and if they were spawned before the bulk-checkin pack
was flushed, those lookups would fail.

But that requires that, e.g. for hooks, that we know a-priori the object
ID of some newly-written objects. If you wanted to make those lookups
succeed, I think there are a couple of options:

  - teach child-processes about the bulk-checkin pack, and let them
    perform the same fake lookup

  - flush (but do not close) the bulk-checkin transaction

In any event, I think that this is a sufficiently rare and niche case
that we'd be OK to declare that you should not expect the above
scenarios to work when using `--write-pack`. If someone does ask for
that feature in the future, we could implement it relatively painlessly
using one of the above options.

Thanks,
Taylor