On Mon, Nov 06, 2023 at 04:46:32PM +0100, Johannes Schindelin wrote: > I wonder whether a more generic approach would be more desirable, an > approach that would work for `git replay`, too, for example (where > streaming objects does not work because they need to be made available > immediately because subsequent `merge_incore_nonrecursive()` might expect > the created objects to be present)? > > What I have in mind is more along Elijah's suggestion at the Contributor > Summit to use the `tmp_objdir*()` machinery. But instead of discarding the > temporary object database, the contained objects would be repacked and the > `.pack`, (maybe `.rev`) and the `.idx` file would then be moved (in that > order) before discarding the temporary object database. > > This would probably need to be implemented as a new > `tmp_objdir_pack_and_migrate()` function that basically spawns > `pack-objects` and feeds it the list of generated objects, writing > directly into the non-temporary object directory, then discarding the > `tmp_objdir`. Perhaps I'm missing some piece of the puzzle, but I'm not sure what you're trying to accomplish with that approach. If the goal is to increase performance by avoiding the loose object writes, then we haven't really helped much. We're still writing them, and then writing them again for the repack. If the goal is just to end up with a single nice pack for the long term, then why do we need to use tmp_objdir at all? That point of that API is to avoid letting other simultaneous processes see the intermediate state before we're committed to keeping the objects around. That makes sense for receiving a fetch or push, since we want to do some quality checks on the objects before agreeing to keep them. But does it make sense for a merge? Sure, in some workflows (like GitHub's test merges) we might end up throwing away the merge result if it's not clean. But there is no real downside to other processes seeing those objects. They can be cleaned up at the next pruning repack. I guess if your scenario requirements include "and we are never allowed to run a pruning repack", then that could make sense. And I know that has been a historical issue for GitHub. But I'm not sure it's necessarily a good driver for an upstream feature. As an alternative, though, I wonder if you need to have access to the objects outside of the merge process. If not, then rather than an alternate object store, what if that single process wrote to a streaming pack _and_ used its running in-core index of the objects to allow access via the usual object-retrieval. Then you'd get a single, clean pack as the outcome _and_ you'd get the performance boost over just "write loose objects, repack, and prune". -Peff