Re: [PATCH 7/7] builtin/merge-tree.c: implement support for `--write-pack`

Elijah Newren <newren@xxxxxxxxx> · Sun, 8 Oct 2023 00:02:27 -0700

On Fri, Oct 6, 2023 at 4:02 PM Taylor Blau <me@xxxxxxxxxxxx> wrote:
>
> On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote:
> > Taylor Blau <me@xxxxxxxxxxxx> writes:
> >
> > > When using merge-tree often within a repository[^1], it is possible to
> > > generate a relatively large number of loose objects, which can result in
> > > degraded performance, and inode exhaustion in extreme cases.
> >
> > Well, be it "git merge-tree" or "git merge", new loose objects tend
> > to accumulate until "gc" kicks in, so it is not a new problem for
> > mere mortals, is it?
>
> Yeah, I would definitely suspect that this is more of an issue for
> forges than individual Git users.

It may still be nice to also do this optimization for plain "git
merge" as well.  I had it in my list of ideas somewhere to do a
"fast-import-like" thing to avoid writing loose objects, as I
suspected that'd actually be a performance impediment.

> > As one "interesting" use case of "merge-tree" is for a Git hosting
> > site with bare repositories to offer trial merges, without which
> > majority of the object their repositories acquire would have been in
> > packs pushed by their users, "Gee, loose objects consume many inodes
> > in exchange for easier selective pruning" becomes an issue, right?
>
> Right.
>
> > Just like it hurts performance to have too many loose object files,
> > presumably it would also hurt performance to keep too many packs,
> > each came from such a trial merge.  Do we have a "gc" story offered
> > for these packs created by the new feature?  E.g., "once merge-tree
> > is done creating a trial merge, we can discard the objects created
> > in the pack, because we never expose new objects in the pack to the
> > outside, processes running simultaneously, so instead closing the
> > new packfile by calling flush_bulk_checkin_packfile(), we can safely
> > unlink the temporary pack.  We do not even need to spend cycles to
> > run a gc that requires cycles to enumerate what is still reachable",
> > or something like that?
>
> I know Johannes worked on something like this recently. IIRC, it
> effectively does something like:
>
>     struct tmp_objdir *tmp_objdir = tmp_objdir_create(...);
>     tmp_objdir_replace_primary_odb(tmp_objdir, 1);
>
> at the beginning of a merge operation, and:
>
>     tmp_objdir_discard_objects(tmp_objdir);
>
> at the end. I haven't followed that work off-list very closely, but it
> is only possible for GitHub to discard certain niche kinds of
> merges/rebases, since in general we make the objects created during test
> merges available via refs/pull/N/{merge,rebase}.

Oh, at the contributor summit, Johannes said he only needed pass/fail,
not the actual commits, which is why I suggested this route.  If you
need to keep the actual commits, then this won't help.

I was interested in the same question as Junio, but from a different
angle.  fast-import documentation points out that the packs it creates
are suboptimal with poorer delta choices.  Are the packs created by
bulk-checkin prone to the same issues?  When I was thinking in terms
of having "git merge" use fast-import for pack creation instead of
writing loose objects (an idea I never investigated very far), I was
wondering if I'd need to mark those packs as "less optimal" and do
something to make sure they were more likely to be repacked.

I believe geometric repacking didn't exist back when I was thinking
about this, and perhaps geometric repacking automatically handles
things nicely for us.  Does it, or are we risking retaining
sub-optimal deltas from the bulk-checkin code?

(I've never really cracked open the pack code, so I have absolutely no
idea; I'm just curious.)

> I think that like anything, this is a trade-off. Having lots of packs
> can be a performance hindrance just like having lots of loose objects.
> But since we can represent more objects with fewer inodes when packed,
> storing those objects together in a pack is preferable when (a) you're
> doing lots of test-merges, and (b) you want to keep those objects
> around, e.g., because they are reachable.

A couple of the comments earlier in the series suggested this was
about streaming blobs to a pack in the bulk checkin code.  Are tree
and commit objects also put in the pack, or will those continue to be
written loosely?