Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`

Elijah Newren <newren@xxxxxxxxx> · Mon, 13 Nov 2023 18:50:08 -0800

On Mon, Nov 13, 2023 at 2:34 PM Taylor Blau <me@xxxxxxxxxxxx> wrote:
>
> On Mon, Nov 13, 2023 at 05:02:54PM -0500, Jeff King wrote:
> > On Fri, Nov 10, 2023 at 03:51:18PM -0800, Elijah Newren wrote:
> >
> > > This is unsafe; the object may need to be read later within the same
> > > merge. [...]
> > >
> > > I think (untested) that you could fix this by creating two packs
> > > instead of just one.  In particular, add a call to
> > > flush_odb_transcation() after the "redo_after_renames" block in
> > > merge_ort_nonrecursive_internal().
> >
> > It might not be too hard to just let in-process callers access the
> > objects we've written. A quick and dirty patch is below, which works
> > with the test you suggested (the test still fails because it finds a
> > conflict, but it gets past the "woah, I can't find that sha1" part).
>
> That's a very slick idea, and I think that this series has some legs to
> stand on now as a result.
>
> There are a few of interesting conclusions we can draw here:
>
>   1. This solves the recursive merge problem that Elijah mentioned
>      earlier where we could generate up to 2^N packs (where N is the
>      maximum depth of the recursive merge).
>
>   2. This also solves the case where the merge-ort code needs to read an
>      object that it wrote earlier on in the same process without having
>      to flush out intermediate packs. So in the best case we need only a
>      single pack (instead of two).
>
>   3. This also solves the 'replay' issue, *and* allows us to avoid the
>      tmp-objdir thing there entirely, since we can simulate object reads
>      in the bulk-checkin pack.
>
> So I think that this is a direction worth pursuing.

Agreed; this looks great.  It's basically bringing fast-import-like
functionality -- writing objects to a single new packfile while making
previous objects accessible to subsequent ones -- to additional
callers.

> In terms of making those lookups faster, I think that what you'd want is
> an oidmap. The overhead is slightly unfortunate, but I think that any
> other solution which requires keeping the written array in sorted order
> would in practice be more expensive as you have to constantly reallocate
> and copy portions of the array around to maintain its invariant.

When comparing the overhead of an oidmap to a bunch of inodes,
however, it seems relatively cheap.  :-)

> > I don't know if that is sufficient, though. Would other spawned
> > processes (hooks, external merge drivers, and so on) need to be able to
> > see these objects, too?
>
> Interesting point. In theory those processes could ask about newly
> created objects, and if they were spawned before the bulk-checkin pack
> was flushed, those lookups would fail.

One of the big design differences that I was pushing really hard with
git-replay was performance and things that came with it -- no
worktree, no per-commit hooks (which are nearly ruled out by no
worktree, but it's still worth calling out separately), etc.  A
post-operation hook could be fine, but would also not get to assume a
worktree.

merge-tree is the same as far as hooks; I'd rather just not have them,
but if we did, they'd be a post-operation hook.

In both cases, that makes hooks not much of a sticking point.

External merge drivers, however...

> But that requires that, e.g. for hooks, that we know a-priori the object
> ID of some newly-written objects. If you wanted to make those lookups
> succeed, I think there are a couple of options:
>
>   - teach child-processes about the bulk-checkin pack, and let them
>     perform the same fake lookup
>
>   - flush (but do not close) the bulk-checkin transaction
>
> In any event, I think that this is a sufficiently rare and niche case
> that we'd be OK to declare that you should not expect the above
> scenarios to work when using `--write-pack`. If someone does ask for
> that feature in the future, we could implement it relatively painlessly
> using one of the above options.

Seems reasonable to me.