Re: [PATCH 7/7] builtin/merge-tree.c: implement support for `--write-pack`

Patrick Steinhardt <ps@xxxxxx> · Mon, 9 Oct 2023 12:54:12 +0200

On Fri, Oct 06, 2023 at 07:02:05PM -0400, Taylor Blau wrote:
> On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote:
> > Taylor Blau <me@xxxxxxxxxxxx> writes:
> >
> > > When using merge-tree often within a repository[^1], it is possible to
> > > generate a relatively large number of loose objects, which can result in
> > > degraded performance, and inode exhaustion in extreme cases.
> >
> > Well, be it "git merge-tree" or "git merge", new loose objects tend
> > to accumulate until "gc" kicks in, so it is not a new problem for
> > mere mortals, is it?
> 
> Yeah, I would definitely suspect that this is more of an issue for
> forges than individual Git users.
> 
> > As one "interesting" use case of "merge-tree" is for a Git hosting
> > site with bare repositories to offer trial merges, without which
> > majority of the object their repositories acquire would have been in
> > packs pushed by their users, "Gee, loose objects consume many inodes
> > in exchange for easier selective pruning" becomes an issue, right?
> 
> Right.
> 
> > Just like it hurts performance to have too many loose object files,
> > presumably it would also hurt performance to keep too many packs,
> > each came from such a trial merge.  Do we have a "gc" story offered
> > for these packs created by the new feature?  E.g., "once merge-tree
> > is done creating a trial merge, we can discard the objects created
> > in the pack, because we never expose new objects in the pack to the
> > outside, processes running simultaneously, so instead closing the
> > new packfile by calling flush_bulk_checkin_packfile(), we can safely
> > unlink the temporary pack.  We do not even need to spend cycles to
> > run a gc that requires cycles to enumerate what is still reachable",
> > or something like that?
> 
> I know Johannes worked on something like this recently. IIRC, it
> effectively does something like:
> 
>     struct tmp_objdir *tmp_objdir = tmp_objdir_create(...);
>     tmp_objdir_replace_primary_odb(tmp_objdir, 1);
> 
> at the beginning of a merge operation, and:
> 
>     tmp_objdir_discard_objects(tmp_objdir);
> 
> at the end. I haven't followed that work off-list very closely, but it
> is only possible for GitHub to discard certain niche kinds of
> merges/rebases, since in general we make the objects created during test
> merges available via refs/pull/N/{merge,rebase}.
> 
> I think that like anything, this is a trade-off. Having lots of packs
> can be a performance hindrance just like having lots of loose objects.
> But since we can represent more objects with fewer inodes when packed,
> storing those objects together in a pack is preferable when (a) you're
> doing lots of test-merges, and (b) you want to keep those objects
> around, e.g., because they are reachable.

In Gitaly, we usually set up quarantine directories for all operations
that create objects. This allows us to discard any newly written objects
in case either the RPC call gets cancelled or in case our access checks
determine that the change should not be allowed. The logic is rather
simple:

    1. Create a new temporary directory.

    2. Set up the new temporary directory as main object database via
       the `GIT_OBJECT_DIRECTORY` environment variable.

    3. Set up the main repository's object database via the
       `GIT_ALTERNATE_OBJECT_DIRECTORIES` environment variable.

    4. Execute Git commands that write objects with these environment
       variables set up. The new objects will end up neatly contained in
       the temporary directory.

    5. Once done, either discard the temporary object database or
       migrate objects into the main object daatabase.

I wonder whether this would be a viable approach for you, as well.

Patrick
Attachment:
signature.asc

Description: PGP signature