Re: [PATCH 7/7] builtin/merge-tree.c: implement support for `--write-pack`

Patrick Steinhardt <ps@xxxxxx> · Tue, 10 Oct 2023 08:36:03 +0200

On Mon, Oct 09, 2023 at 12:08:32PM -0400, Taylor Blau wrote:
> On Mon, Oct 09, 2023 at 12:54:12PM +0200, Patrick Steinhardt wrote:
> > In Gitaly, we usually set up quarantine directories for all operations
> > that create objects. This allows us to discard any newly written objects
> > in case either the RPC call gets cancelled or in case our access checks
> > determine that the change should not be allowed. The logic is rather
> > simple:
> >
> >     1. Create a new temporary directory.
> >
> >     2. Set up the new temporary directory as main object database via
> >        the `GIT_OBJECT_DIRECTORY` environment variable.
> >
> >     3. Set up the main repository's object database via the
> >        `GIT_ALTERNATE_OBJECT_DIRECTORIES` environment variable.
> 
> Is there a reason not to run Git in the quarantine environment and list
> the main repository as an alternate via $GIT_DIR/objects/info/alternates
> instead of the GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable?

The quarantine environment as we use it is really only a single object
database, so you cannot run Git inside of it directly.

> >     4. Execute Git commands that write objects with these environment
> >        variables set up. The new objects will end up neatly contained in
> >        the temporary directory.
> >
> >     5. Once done, either discard the temporary object database or
> >        migrate objects into the main object daatabase.
> 
> Interesting. I'm curious why you don't use the builtin tmp_objdir
> mechanism in Git itself. Do you need to run more than one command in the
> quarantine environment? If so, that makes sense that you'd want to have
> a scratch repository that lasts beyond the lifetime of a single process.

It's a mixture of things:

    - Many commands simply don't set up a temporary object directory.

    - We want to check the result after the objects have been generated.
      Many of the commands don't provide hooks to do so in a reasonable
      way. So we want to check the result _after_ the command has exited
      already, and objects should not yet have been migrated into the
      target object database at that point.

    - Sometimes we indeed want to run multiple Git commands. We use this
      e.g. for worktreeless rebases, where we run a succession of
      commands to rebase every single commit.

So ultimately, our own quarantine directory sits at a conceptually
higher level than what Git commands would be able to provide.

> > I wonder whether this would be a viable approach for you, as well.
> 
> I think that the main problem that we are trying to solve with this
> series is creating a potentially large number of loose objects. I think
> that you could do something like what you propose above, with a 'git
> repacks -adk' before moving its objects over back to the main repository.

Yeah, that's certainly possible. I had the feeling that there are two
different problems that we're trying to solve in this thread:

    - The problem that the objects are of temporary nature. Regardless
      of whether they are in a packfile or loose, it doesn't make much
      sense to put them into the main object database as they would be
      unreachable anyway and thus get pruned after some time.

    - The problem that writing many loose objects can be slow.

My proposal only addresses the first problem.

> But since we're working in a single process only when doing a merge-tree
> operation, I think it is probably more expedient to write the pack's
> bytes directly.

If it's about efficiency and not about how we discard those objects
after the command has ran to completion then yes, I agree.

Patrick
Attachment:
signature.asc

Description: PGP signature