Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs

Elijah Newren <newren@xxxxxxxxx> · Thu, 30 Sep 2021 19:23:14 -0700

On Thu, Sep 30, 2021 at 12:26 AM Jeff King <peff@xxxxxxxx> wrote:
>
> On Tue, Sep 28, 2021 at 10:05:31PM -0700, Elijah Newren wrote:
>
> > So there are other parts of running `git log` that are not read-only,
> > besides what I'm doing.  That's unfortunate.  Can we just disable
> > those things in this section of the code?
>
> It might be possible for the cache (we'd still generate entries, but
> just not write them). It's "just" an optimization, though it's possible
> there are cases where it's really annoying to do so. I'd imagine those
> are a stretch. I do use textconv to decrypt gpg blobs with a key on a
> hardware token, for example. I don't require a touch for each one,
> though, nor would I use textconv caching anyway (since the point is not
> to store the unencrypted contents).
>
> It wouldn't necessarily be OK to disable object writes in general.
> Though I'd think since a diff is _conceptually_ a read-only operation
> that any writes it wants to do would be in the same caching /
> optimization boat.  I'd also worry about the maintenance cost of having
> to annotate any such case.
>
> > Also, I think it makes sense that textconv caching causes a problem
> > via writing new blobs and trees *and* refs that reference these blobs
> > and trees.  However, I'm not sure I'm really grasping it, because I
> > don't see how your solutions are safe either.  I'll mention more after
> > each one.
>
> Heh. So I originally wrote more on this in my earlier email, but worried
> I was getting too into the weeds and would confuse you, so I pared it
> down. But I managed to be confusing anyway. :)
>
> Here's one mental model. If you have a temporary object store, then
> these are the bad things that can happen:
>
>   1. Some code may write an object to it, thinking the data is saved,
>      but we later throw it away.
>
>   2. Some code may _skip_ writing an object, thinking that we already
>      have that object available (when in fact we'll later throw it
>      away).
>
>   3. Some code may record the fact that we have an object in memory, and
>      then later (after the temporary object store is not in use), may
>      point to it when writing an object or reference.
>
> Pointing the primary object-dir at a temporary directory (like your
> patch does) means we're subject to all three.
>
> Problem (1) happens because unrelated code may not realize we've swapped
> out the primary object writing code for this throw-away directory. But
> we could require writers to explicitly indicate that they want to write
> to the temporary area, while allowing other writes to go to the regular
> object store. This could be done with a flag or variant of
> write_object_file(). Or if the temporary area isn't a regular object
> directory at all (like the whole tempfile / pretend thing), then this
> happens automatically, because the regular object-write paths don't even
> know about our temporary area.
>
> Problem (2) is mostly handled inside write_object_file(). It will
> optimize out writes of objects we already have. So it needs to know when
> we have an object "for real", and when we just have it in our temporary
> store. If we make the tmp_objdir the primary, then it doesn't know this
> (though we could teach it to check. In the pretend_object_file() system,
> it already knows this (it uses the "freshen" code paths which actually
> want to find the entry in the filesystem).
>
> Problem (3) is the hardest one, because we don't really distinguish
> between readers and writers. So if I call has_object_file(), is it
> because I want to access the object? Or is it because I want to generate
> a new reference to it, which must ensure that it exists? One of those is
> OK to look at the tmp_objdir (and indeed, is the whole point of making
> it in the first place), and the other is the first step to corrupting
> the repository. ;)
>
> With the environment variables we set up for running external commands
> (like hooks) in the receive-pack quarantine, one bit of safety we have
> is to prevent ref writes entirely in those commands. That works
> reasonably well. Even if they may write new objects that might get
> thrown away (if the push is rejected), they'd ultimately need to point
> to those objects with a ref.  And because those hooks spend their
> _entire_ run-time in the quarantine state, they can't ever write a ref.
>
> Whereas doing it in-process is a little dicier. Right now the textconv
> cache writes out a new tree for every entry we add, so it happens to do
> the ref write while the tmp-objdir is still in place. But as this
> comment notes:
>
>   $ git grep -B6 notes_cache_write diff.c
>   diff.c-         /*
>   diff.c-          * we could save up changes and flush them all at the end,
>   diff.c-          * but we would need an extra call after all diffing is done.
>   diff.c-          * Since generating a cache entry is the slow path anyway,
>   diff.c-          * this extra overhead probably isn't a big deal.
>   diff.c-          */
>   diff.c:         notes_cache_write(textconv->cache);
>
> this is slow, and it would be perfectly reasonable to flush the cache at
> the end of the process (e.g., using an atexit() handler). In which case
> we'd hit this problem (3) exactly: we'd generate and remember objects
> during the tmp-objdir period, but only later actually reference them.
> This is more likely than the receive-pack hook case, because we're doing
> it all in process.
>
> > > If you remove the tmp_objdir as the primary as soon as you're done with
> > > the merge, but before you run the diff, you might be OK, though.
> >
> > It has to be after I run the diff, because the diff needs access to
> > the temporary files to diff against them.
>
> Right, of course. I was too fixated on the object-write part, forgetting
> that the whole point of the exercise is to later read them back. :)
>
> > > If not, then I think the solution is probably not to install this as the
> > > "primary", but rather:
> > >
> > >   - do the specific remerge-diff writes we want using a special "write
> > >     to this object dir" variant of write_object_file()
> > >
> > >   - install the tmp_objdir as an alternate, so the reading side (which
> > >     is diff code that doesn't otherwise know about our remerge-diff
> > >     trickery) can find them
> > >
> > > And that lets other writers avoid writing into the temporary area
> > > accidentally.
> >
> > Doesn't this have the same problem?  If something similar to textconv
> > caching were to create new trees that referenced the existing blobs
> > and then added a ref that referred to that tree, then you have the
> > exact same problem, right?  The new tree and the new ref would be
> > corrupt as soon as the tmp-objdir went away.  Whether or not other
> > callers write to the temporary area isn't the issue, it's whether they
> > write refs that can refer to anything from the temporary area.  Or am
> > I missing something?
>
> The key thing here is in the first step, where remerge-diff is
> explicitly saying "I want to write to this temporary object-dir". But
> the textconv-cache code does not; it gets to write to the normal spot.
> So it avoids problem (1).
>
> You're right that it does not avoid problem (3) exactly. But triggering
> that would require some code not just writing other objects or
> references while the tmp-objdir is in place, but specifically
> referencing the objects that remerge-diff did put into the tmp-objdir.
> That seems a lot less likely to me (because the thing we're most worried
> about is unrelated code that just happens to write while the tmp-objdir
> is in place).
>
> > > In that sense this is kind of like the pretend_object_file() interface,
> > > except that it's storing the objects on disk instead of in memory. Of
> > > course another way of doing that would be to stuff the object data into
> > > tempfiles and just put pointers into the in-memory cached_objects array.
> > >
> > > It's also not entirely foolproof (nor is the existing
> > > pretend_object_file()). Any operation which is fooled into thinking we
> > > have object X because it's in the fake-object list or the tmp_objdir may
> > > reference that object erroneously, creating a corruption. But it's still
> > > safer than allowing arbitrary writes into the tmp_objdir.
> >
> > Why is the pretend_object_file() interface safer?  Does it disable the
> > textconv caching somehow?  I don't see why it helps avoid this
> > problem.
>
> So hopefully it's clearer now from what I wrote above, but just
> connecting the dots:
>
>   1. Unrelated code calling write_object_file() would still write real,
>      durable objects, as usual.
>
>   2. The write_object_file() "skip this write" optimization already
>      ignores the pretend_object_file() objects while checking "do we
>      have this object".
>
> > Interesting.  I'm not familiar with the pretend_object_file() code,
> > but I think I catch what you're saying.  I don't see anything in
> > object-file.c that allows removing items from the cache, as I would
> > want to do, but I'm sure I could add something.
>
> Right, it's pretty limited now, and there's only one caller. And in fact
> I've wanted to get rid of it, exactly because it can create the danger
> we're discussing here. But I still think it's _less_ dangerous than
> fully replacing the object directory.
>
> > (I'm still not sure how this avoids the problem of things like
> > textconv caching trying to write an object that references one of
> > these, though.  Should we just manually disable textconv caching
> > during a remerge-diff?)
>
> It can't avoid it completely, but in general, the textconv cache should
> be writing and referencing its own objects. Even if they happen to be
> the same as something remerge-diff generated, it won't know that until
> it has tried to write it (and so that's why having write_object_file()
> continue to really write the object is important).
>
>
> Hopefully that clears up my line of thinking. I do think a lot of this
> is kind of theoretical, in the sense that:
>
>   - textconv caching is an obscure feature that we _could_ just disable
>     during remerge-diffs
>
>   - there's a reasonable chance that there isn't any other code that
>     wants to write during a diff
>
> But this is all scary and error-prone enough that I'd prefer an approach
> that has the "least surprise". So any solution where random code calling
> write_object_file() _can't_ accidentally write to a throw-away directory
> seems like a safer less surprising thing.

Yes, thanks for taking the time to go into this detail.

> It does mean that the remerge-diff code needs to be explicit in its
> object writes to say "this needs to go to the temporary area" (whether
> it's a flag, or calling into a totally different function or subsystem).
> I'm hoping that's not hard to do (because its writes are done explicitly
> by the remerge-diff code), but I worry that it may be (because you are
> relying on more generic tree code to the write under the hood).

All blob and tree objects written during merge-ort are done by
explicit write_object_file() calls that exist within merge-ort.c.  So
this approach should be doable, it just needs the external caller to
set some flag saying to write these objects elsewhere.