On Mon, Oct 09, 2023 at 12:08:32PM -0400, Taylor Blau wrote: > On Mon, Oct 09, 2023 at 12:54:12PM +0200, Patrick Steinhardt wrote: > > In Gitaly, we usually set up quarantine directories for all operations > > that create objects. This allows us to discard any newly written objects > > in case either the RPC call gets cancelled or in case our access checks > > determine that the change should not be allowed. The logic is rather > > simple: > > > > 1. Create a new temporary directory. > > > > 2. Set up the new temporary directory as main object database via > > the `GIT_OBJECT_DIRECTORY` environment variable. > > > > 3. Set up the main repository's object database via the > > `GIT_ALTERNATE_OBJECT_DIRECTORIES` environment variable. > > Is there a reason not to run Git in the quarantine environment and list > the main repository as an alternate via $GIT_DIR/objects/info/alternates > instead of the GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable? The quarantine environment as we use it is really only a single object database, so you cannot run Git inside of it directly. > > 4. Execute Git commands that write objects with these environment > > variables set up. The new objects will end up neatly contained in > > the temporary directory. > > > > 5. Once done, either discard the temporary object database or > > migrate objects into the main object daatabase. > > Interesting. I'm curious why you don't use the builtin tmp_objdir > mechanism in Git itself. Do you need to run more than one command in the > quarantine environment? If so, that makes sense that you'd want to have > a scratch repository that lasts beyond the lifetime of a single process. It's a mixture of things: - Many commands simply don't set up a temporary object directory. - We want to check the result after the objects have been generated. Many of the commands don't provide hooks to do so in a reasonable way. So we want to check the result _after_ the command has exited already, and objects should not yet have been migrated into the target object database at that point. - Sometimes we indeed want to run multiple Git commands. We use this e.g. for worktreeless rebases, where we run a succession of commands to rebase every single commit. So ultimately, our own quarantine directory sits at a conceptually higher level than what Git commands would be able to provide. > > I wonder whether this would be a viable approach for you, as well. > > I think that the main problem that we are trying to solve with this > series is creating a potentially large number of loose objects. I think > that you could do something like what you propose above, with a 'git > repacks -adk' before moving its objects over back to the main repository. Yeah, that's certainly possible. I had the feeling that there are two different problems that we're trying to solve in this thread: - The problem that the objects are of temporary nature. Regardless of whether they are in a packfile or loose, it doesn't make much sense to put them into the main object database as they would be unreachable anyway and thus get pruned after some time. - The problem that writing many loose objects can be slow. My proposal only addresses the first problem. > But since we're working in a single process only when doing a merge-tree > operation, I think it is probably more expedient to write the pack's > bytes directly. If it's about efficiency and not about how we discard those objects after the command has ran to completion then yes, I agree. Patrick
Attachment:
signature.asc
Description: PGP signature