tb/merge-tree-write-pack, was Re: What's cooking in git.git (Nov 2023, #04; Thu, 9)

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Wed, 15 Nov 2023 13:57:00 +0100 (CET)

Hi Junio,

On Fri, 10 Nov 2023, Junio C Hamano wrote:

> Taylor Blau <me@xxxxxxxxxxxx> writes:
>
> > On Thu, Nov 09, 2023 at 02:40:28AM +0900, Junio C Hamano wrote:
> >> * tb/merge-tree-write-pack (2023-10-23) 5 commits
> > ...
> > This series received a couple of LGTMs from you and Patrick:
> >
> >   - https://lore.kernel.org/git/xmqqo7go7w63.fsf@gitster.g/#t
> >   - https://lore.kernel.org/git/ZTjKmcV5c_EFuoGo@tanuki/
>
> Yup, I am aware of them.
>
> > Johannes had posted some comments[1] about instead using a temporary
> > object store where objects are written as loose that would extend to git
> > replay....
>
> I was hoping to hear from Johannes saying he agrees with the above.
> It is not strictly required, but is much nice to have once we hear
> "let's step back a bit---are we going in the right direction?" and
> it has been responded.

When I wrote about `tmp_objdir`, there were a couple of things going on in
my mind:

- First of all, I was hesitant to write this at all because I knew that I
  lack the time to engage meaningfully in any follow-up discussion.

- To be honest, the approach to teach `merge-ort.c` anything about whether
  objects are written loosely or streamed into a pack strikes me as
  somewhat contrary to the goal of separating concerns. The merge
  machinery should not know, in my mind, how the objects are stored.

- A long-standing paradigm in Git is that pack files are not used until
  finalized. Breaking such a paradigm after being in effect for a long
  time, in my experience, is always followed by unwelcome "gifts that keep
  on giving".

- The streaming pack approach struck me as something that would only work
  properly if Git was designed with single-process operations in mind. But
  Git was originally designed around the process-proliferating Unix
  philosophy, and it is deeply ingrained in Git to this day. As such, I do
  not expect the streaming pack approach to generalize to a noteworthy
  fraction of Git operations, and I would love to focus on an approach
  that generalizes better.

- At the Git Contributor Summit, I had talked about my goals, and Elijah
  helpfully pointed out how `--remerge-diff` does it, and I wanted to
  pursue that idea further.

- The scenario I want to address (and that I assumed the patch series
  under discussion tried to address, too) is a very specific, server-side
  scenario where many `merge-tree`/`replay` runs produce _many_ loose
  objects. Quite a fraction of those are produced by processes that run
  into a SIGTERM-enforced timeout, and the `tmp_objdir` approach would
  naturally help: unneeded loose objects would not even be added to the
  primary object database but be reaped with the temporary object
  databases.

- While it may sound as if the sheer number of loose objects is the
  primary problem, an even more pressing issue I need to address is that
  competing processes that try to work on a snapshot of the loose objects
  (which does not exist, you cannot "take a snapshot", all you can do is
  to enumerate the directories sequentially) seem sometimes to process
  loose tree/commit objects that reference other objects that have been
  missed due to racy reads/writes/enumerations. The reason for this is
  that the loose objects produced by `merge-tree`/`replay` are added
  non-transactionally, and concurrent reads are prone to run into racy
  conditions where they only see a part of those objects.

- Even just using `tmp_objdir_migrate()` could help a lot by narrowing the
  window for those racy conditions.

- The number of inodes has been a concern, yes, but not such a pressing
  one that I could afford spending any further thought on the idea to
  reduce them. In any case, a working theory is that this concern would
  already be helped by avoiding the loose objects produced by failing
  merges/rebases (whose results are not used) or by merges/rebases running
  into a timeout.

- Streaming packs, if I understand correctly, do not do deltas. That in
  and of itself can cause file size issues, and light-weight maintenance
  may not even bother to try finding deltas, thereby causing follow-on
  problems.

With all this in mind, I do not think that I can affort to spend brain
cycles on the streaming-pack approach. I do not intend to discourage
anybody from working on that approach, yet I won't encourage anyone,
either.

Ciao,
Johannes