Hi Junio, On Fri, 10 Nov 2023, Junio C Hamano wrote: > Taylor Blau <me@xxxxxxxxxxxx> writes: > > > On Thu, Nov 09, 2023 at 02:40:28AM +0900, Junio C Hamano wrote: > >> * tb/merge-tree-write-pack (2023-10-23) 5 commits > > ... > > This series received a couple of LGTMs from you and Patrick: > > > > - https://lore.kernel.org/git/xmqqo7go7w63.fsf@gitster.g/#t > > - https://lore.kernel.org/git/ZTjKmcV5c_EFuoGo@tanuki/ > > Yup, I am aware of them. > > > Johannes had posted some comments[1] about instead using a temporary > > object store where objects are written as loose that would extend to git > > replay.... > > I was hoping to hear from Johannes saying he agrees with the above. > It is not strictly required, but is much nice to have once we hear > "let's step back a bit---are we going in the right direction?" and > it has been responded. When I wrote about `tmp_objdir`, there were a couple of things going on in my mind: - First of all, I was hesitant to write this at all because I knew that I lack the time to engage meaningfully in any follow-up discussion. - To be honest, the approach to teach `merge-ort.c` anything about whether objects are written loosely or streamed into a pack strikes me as somewhat contrary to the goal of separating concerns. The merge machinery should not know, in my mind, how the objects are stored. - A long-standing paradigm in Git is that pack files are not used until finalized. Breaking such a paradigm after being in effect for a long time, in my experience, is always followed by unwelcome "gifts that keep on giving". - The streaming pack approach struck me as something that would only work properly if Git was designed with single-process operations in mind. But Git was originally designed around the process-proliferating Unix philosophy, and it is deeply ingrained in Git to this day. As such, I do not expect the streaming pack approach to generalize to a noteworthy fraction of Git operations, and I would love to focus on an approach that generalizes better. - At the Git Contributor Summit, I had talked about my goals, and Elijah helpfully pointed out how `--remerge-diff` does it, and I wanted to pursue that idea further. - The scenario I want to address (and that I assumed the patch series under discussion tried to address, too) is a very specific, server-side scenario where many `merge-tree`/`replay` runs produce _many_ loose objects. Quite a fraction of those are produced by processes that run into a SIGTERM-enforced timeout, and the `tmp_objdir` approach would naturally help: unneeded loose objects would not even be added to the primary object database but be reaped with the temporary object databases. - While it may sound as if the sheer number of loose objects is the primary problem, an even more pressing issue I need to address is that competing processes that try to work on a snapshot of the loose objects (which does not exist, you cannot "take a snapshot", all you can do is to enumerate the directories sequentially) seem sometimes to process loose tree/commit objects that reference other objects that have been missed due to racy reads/writes/enumerations. The reason for this is that the loose objects produced by `merge-tree`/`replay` are added non-transactionally, and concurrent reads are prone to run into racy conditions where they only see a part of those objects. - Even just using `tmp_objdir_migrate()` could help a lot by narrowing the window for those racy conditions. - The number of inodes has been a concern, yes, but not such a pressing one that I could afford spending any further thought on the idea to reduce them. In any case, a working theory is that this concern would already be helped by avoiding the loose objects produced by failing merges/rebases (whose results are not used) or by merges/rebases running into a timeout. - Streaming packs, if I understand correctly, do not do deltas. That in and of itself can cause file size issues, and light-weight maintenance may not even bother to try finding deltas, thereby causing follow-on problems. With all this in mind, I do not think that I can affort to spend brain cycles on the streaming-pack approach. I do not intend to discourage anybody from working on that approach, yet I won't encourage anyone, either. Ciao, Johannes