Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs

Jeff King <peff@xxxxxxxx> · Tue, 28 Sep 2021 19:17:22 -0400

On Tue, Aug 31, 2021 at 02:26:38AM +0000, Elijah Newren via GitGitGadget wrote:

> From: Elijah Newren <newren@xxxxxxxxx>
> 
> The tmp_objdir API provides the ability to create temporary object
> directories, but was designed with the goal of having subprocesses
> access these object stores, followed by the main process migrating
> objects from it to the main object store or just deleting it.  The
> subprocesses would view it as their primary datastore and write to it.
> 
> For the --remerge-diff option we want to add to show & log, we want all
> writes of intermediate merge results (both blobs and trees) to go to
> this alternate object store; since those writes will be done by the main
> process, we need this "alternate" object store to actually be the
> primary object store.  When show & log are done, they'll simply remove
> this temporary object store.

I think this is consistent with the original design of tmp_objdir. I
just never needed to do anything in-process before, so overriding the
environment of sub-processes was sufficient.

I do think these are dangerous and may cause bugs, though. Anything you
write while the tmp_objdir is marked as "primary" is going to go away
when you remove it. So any object names you reference outside of that,
either:

  - by another object that you create after the tmp_objdir is removed;
    or

  - in a ref

are going to turn into repository corruption. Of course that's true for
the existing sub-processes, too, but here we're touching a wider variety
of code.

Obviously the objects we write as part of remerge-diff are meant to be
temporary, and we'll never reference them in any other way. And usually
we would not expect a diff to write any other objects. But one thing
that comes to mind if textconv caching.

If you do a remerge diff on a blob that uses textconv, and the caching
feature is turned on, then we'll write out a new blob with the cached
value, and eventually a new tree and refs/notes/ pointer referencing it.
I'm not sure of the timing of all of that, but it seems likely to me
that at least some of that will end up in your tmp_objdir.

If you remove the tmp_objdir as the primary as soon as you're done with
the merge, but before you run the diff, you might be OK, though.

If not, then I think the solution is probably not to install this as the
"primary", but rather:

  - do the specific remerge-diff writes we want using a special "write
    to this object dir" variant of write_object_file()

  - install the tmp_objdir as an alternate, so the reading side (which
    is diff code that doesn't otherwise know about our remerge-diff
    trickery) can find them

And that lets other writers avoid writing into the temporary area
accidentally.

In that sense this is kind of like the pretend_object_file() interface,
except that it's storing the objects on disk instead of in memory. Of
course another way of doing that would be to stuff the object data into
tempfiles and just put pointers into the in-memory cached_objects array.

It's also not entirely foolproof (nor is the existing
pretend_object_file()). Any operation which is fooled into thinking we
have object X because it's in the fake-object list or the tmp_objdir may
reference that object erroneously, creating a corruption. But it's still
safer than allowing arbitrary writes into the tmp_objdir.

  Side note: The pretend_object_file() approach is actually even better,
  because we know the object is fake. So it does not confuse
  write_object_file()'s "do we already have this object" freshening
  check.

  I suspect it could even be made faster than the tmp_objdir approach.
  From our perspective, these objects really are tempfiles. So we could
  write them as such, not worrying about things like fsyncing them,
  naming them into place, etc. We could just write them out, then mmap
  the results, and put the pointers into cached_objects (currently it
  insists on malloc-ing a copy of the input buffer, but that seems like
  an easy extension to add).

  In fact, I think you could get away with just _one_ tempfile per
  merge. Open up one tempfile. Write out all of the objects you want to
  "store" into it in sequence, and record the lseek() offsets before and
  after for each object. Then mmap the whole result, and stuff the
  appropriate pointers (based on arithmetic with the offsets) into the
  cached_objects list.

  As a bonus, this tempfile can easily be in $TMPDIR, meaning
  remerge-diff could work on a read-only repository (and the OS can
  perhaps handle the data better, especially if $TMPDIR isn't a regular
  filesystem).

> We also need one more thing -- `git log --remerge-diff` can cause the
> temporary object store to fill up with loose objects.  Rather than
> trying to gc that are known garbage anyway, we simply want to know the
> location of the temporary object store so we can purge the loose objects
> after each merge.

This paragraph confused me. At first I thought you were talking about
how to avoid calling "gc --auto" because we don't want to waste time
thinking all those loose objects were worth gc-ing. But we wouldn't do
that anyway (git-log does not expect to make objects so doesn't call it
at all, and anyway you'd expect it to happen at the end of a process,
after we've already removed the tmp_objdir).

But I think you just mean: we can build up a bunch of loose objects,
which is inefficient. We don't want to gc them, because that's even less
efficient. So we want to clean them out between merges.

But I don't see any code to do that here. I guess that's maybe why
you've added tmp_objdir_path(), to find them later. But IMHO this would
be much better encapsulated as tmp_objdir_clear_objects() or something.

But simpler still: is there any reason not to just drop and re-create
the tmp-objdir for each merge? It's not very expensive to do so (and
certainly not compared to the cost of actually writing out the objects).

-Peff