Re: [PATCH 7/7] builtin/merge-tree.c: implement support for `--write-pack`

Taylor Blau <me@xxxxxxxxxxxx> · Fri, 6 Oct 2023 19:02:05 -0400

On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@xxxxxxxxxxxx> writes:
>
> > When using merge-tree often within a repository[^1], it is possible to
> > generate a relatively large number of loose objects, which can result in
> > degraded performance, and inode exhaustion in extreme cases.
>
> Well, be it "git merge-tree" or "git merge", new loose objects tend
> to accumulate until "gc" kicks in, so it is not a new problem for
> mere mortals, is it?

Yeah, I would definitely suspect that this is more of an issue for
forges than individual Git users.

> As one "interesting" use case of "merge-tree" is for a Git hosting
> site with bare repositories to offer trial merges, without which
> majority of the object their repositories acquire would have been in
> packs pushed by their users, "Gee, loose objects consume many inodes
> in exchange for easier selective pruning" becomes an issue, right?

Right.

> Just like it hurts performance to have too many loose object files,
> presumably it would also hurt performance to keep too many packs,
> each came from such a trial merge.  Do we have a "gc" story offered
> for these packs created by the new feature?  E.g., "once merge-tree
> is done creating a trial merge, we can discard the objects created
> in the pack, because we never expose new objects in the pack to the
> outside, processes running simultaneously, so instead closing the
> new packfile by calling flush_bulk_checkin_packfile(), we can safely
> unlink the temporary pack.  We do not even need to spend cycles to
> run a gc that requires cycles to enumerate what is still reachable",
> or something like that?

I know Johannes worked on something like this recently. IIRC, it
effectively does something like:

    struct tmp_objdir *tmp_objdir = tmp_objdir_create(...);
    tmp_objdir_replace_primary_odb(tmp_objdir, 1);

at the beginning of a merge operation, and:

    tmp_objdir_discard_objects(tmp_objdir);

at the end. I haven't followed that work off-list very closely, but it
is only possible for GitHub to discard certain niche kinds of
merges/rebases, since in general we make the objects created during test
merges available via refs/pull/N/{merge,rebase}.

I think that like anything, this is a trade-off. Having lots of packs
can be a performance hindrance just like having lots of loose objects.
But since we can represent more objects with fewer inodes when packed,
storing those objects together in a pack is preferable when (a) you're
doing lots of test-merges, and (b) you want to keep those objects
around, e.g., because they are reachable.

Thanks,
Taylor