Re: [PATCH 7/7] builtin/merge-tree.c: implement support for `--write-pack`

Taylor Blau <me@xxxxxxxxxxxx> · Sun, 8 Oct 2023 12:04:04 -0400

On Sun, Oct 08, 2023 at 12:02:27AM -0700, Elijah Newren wrote:
> On Fri, Oct 6, 2023 at 4:02 PM Taylor Blau <me@xxxxxxxxxxxx> wrote:
> >
> > On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote:
> > > Taylor Blau <me@xxxxxxxxxxxx> writes:
> > >
> > > > When using merge-tree often within a repository[^1], it is possible to
> > > > generate a relatively large number of loose objects, which can result in
> > > > degraded performance, and inode exhaustion in extreme cases.
> > >
> > > Well, be it "git merge-tree" or "git merge", new loose objects tend
> > > to accumulate until "gc" kicks in, so it is not a new problem for
> > > mere mortals, is it?
> >
> > Yeah, I would definitely suspect that this is more of an issue for
> > forges than individual Git users.
>
> It may still be nice to also do this optimization for plain "git
> merge" as well.  I had it in my list of ideas somewhere to do a
> "fast-import-like" thing to avoid writing loose objects, as I
> suspected that'd actually be a performance impediment.

I think that would be worth doing, definitely. I do worry a little bit
about locking in low-quality deltas (or lack thereof), but more on that
below...

> Oh, at the contributor summit, Johannes said he only needed pass/fail,
> not the actual commits, which is why I suggested this route.  If you
> need to keep the actual commits, then this won't help.

Yep, agreed. Like I said earlier, I think there are some niche scenarios
where we just care about "would this merge cleanly?", but in most other
cases we want to keep around the actual tree.

> I was interested in the same question as Junio, but from a different
> angle.  fast-import documentation points out that the packs it creates
> are suboptimal with poorer delta choices.  Are the packs created by
> bulk-checkin prone to the same issues?  When I was thinking in terms
> of having "git merge" use fast-import for pack creation instead of
> writing loose objects (an idea I never investigated very far), I was
> wondering if I'd need to mark those packs as "less optimal" and do
> something to make sure they were more likely to be repacked.
>
> I believe geometric repacking didn't exist back when I was thinking
> about this, and perhaps geometric repacking automatically handles
> things nicely for us.  Does it, or are we risking retaining
> sub-optimal deltas from the bulk-checkin code?
>
> (I've never really cracked open the pack code, so I have absolutely no
> idea; I'm just curious.)

Yes, the bulk-checkin mechanism suffers from an even worse problem which
is the pack it creates will contain no deltas whatsoever. The contents
of the pack are just getting written as-is, so there's no fancy
delta-ficiation going on.

I think Michael Haggerty (?) suggested to me off-list that it might be
interesting to have a flag that we could mark packs with bad/no deltas
as such so that we don't implicitly trust their contents as having high
quality deltas.

> > I think that like anything, this is a trade-off. Having lots of packs
> > can be a performance hindrance just like having lots of loose objects.
> > But since we can represent more objects with fewer inodes when packed,
> > storing those objects together in a pack is preferable when (a) you're
> > doing lots of test-merges, and (b) you want to keep those objects
> > around, e.g., because they are reachable.
>
> A couple of the comments earlier in the series suggested this was
> about streaming blobs to a pack in the bulk checkin code.  Are tree
> and commit objects also put in the pack, or will those continue to be
> written loosely?

This covers both blobs and trees, since IIUC that's all we'd need to
implement support for merge-tree to be able to write any objects it
creates into a pack. AFAIK merge-tree never generates any commit
objects. But teaching 'merge' to perform the same bulk-checkin trick
would just require us implementing index_bulk_commit_checkin_in_core()
or similar, which is straightforward to do on top of the existing code.

Thanks,
Taylor