Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Thu, 3 Aug 2017 12:08:49 -0700

On Wed, 02 Aug 2017 13:51:37 -0700
Junio C Hamano <gitster@xxxxxxxxx> wrote:

> > The complication is in the "git gc" operation for the case (*).
> > Today, "git gc" uses a reachability walk to decide which objects to
> > remove --- an object referenced by no other object is fair game to
> > remove.  With (*), there is another kind of object that must not be
> > removed: if an object that I made, M, points to a missing/promised
> > object, O, pointed to by a an object I fetched, F, then I cannot prune
> > F unless there is another fetched object present to anchor O.
> 
> Absolutely.  Lazy-objects support comes with certain cost and this
> is one of them.  
> 
> But I do not think it is realistic to expect that you can prune
> anything you fetched from the "other place" (i.e. the source
> 'lazy-objects' hook reads from).  After all, once they give out
> objects to their clients (like us in this case), they cannot prune
> it, if we take the "implicit promise" approach to avoid the cost to
> transmit and maintain a separate "object list".

By this, do you mean that the client cannot prune anything lazily loaded
from the server? If yes, I understand that the server cannot prune
anything (for the reasons that you describe), but I don't see how that
applies to the client.

> > For example: suppose I have a sparse checkout and run
> >
> > 	git fetch origin refs/pulls/x
> > 	git checkout -b topic FETCH_HEAD
> > 	echo "Some great modification" >> README
> > 	git add README
> > 	git commit --amend
> >
> > When I run "git gc", there is nothing pointing to the commit that was
> > pointed to by the remote ref refs/pulls/x, so it can be pruned.  I
> > would naively also expect that the tree pointed to by that commit
> > could be pruned.  But pruning it means pruning the promise that made
> > it permissible to lack various blobs that my topic branch refers to
> > that are outside the sparse checkout area.  So "git gc" must notice
> > that it is not safe to prune that tree.
> >
> > This feels hacky.  I prefer the promised object list over this
> > approach.
> 
> I think they are moral equivalents implemented differently with
> different assumptions.  The example we are discussing makes an extra
> assumption: In order to reduce the cost of transferring and
> maintaining the list, we assume that all objects that came during
> that transfer are implicitly "promised", i.e. everything behind each
> of these objects will later be available on demand.  How these
> objects are marked is up to the exact mechanism (my preference is to
> mark the resulting packfile as special; Jon Tan's message to which
> my message was a resopnse alluded to using an alternate object
> store).  If you choose to maintain a separate "object list" and have
> the "other side" explicitly give it, perhaps you can lift that
> assumption and replace it with some other assumption that assumes
> less.

Marking might be an issue if we expect the lazy loader to emit an object
after every hash, like in the current design. There would thus be one
mark per object, with overhead similar to the promise list. (Having said
that, batching is possible - I plan to optimize common cases like
checkout, and have such a patch online for an older "promised blob"
design [1].)

Overhead could be reduced by embedding the mark in both the packed and
loose objects, requiring a different format (instead of having a
separate "catalog" of marked files). This seems more complicated than
using an alternate object store, hence my suggestion.

[1] https://github.com/jonathantanmy/git/commit/14f07d3f06bc3a1a2c9bca85adc8c42b230b9143