Re: Missing Promisor Objects in Partial Repo Design Doc

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Wed, 9 Oct 2024 11:53:11 -0700

Junio C Hamano <gitster@xxxxxxxxx> writes:
> > (C2b is a bit of a special case. Despite not being in a promisor pack,
> > it is still considered to be a promisor object since C3 directly
> > references it.)
> 
> Yes, and I suspect the root cause of this confusion is because
> "promisor object", as defined today, is a flawed concept.  If C2b
> were pointed by a local ref, just like the case the ref points at
> C2a, they should be treated the same way, as both of them are
> locally created.  To put it another way, presumably the local have
> already been pushed out to elsewhere and the promisor remote got
> hold of them, and that is why C3 can build on top of them.  And the
> fact C2b is directly reachable from C3 and C2a is not should not
> have any relevance if C2a or C2b are not _included_ in promisor
> packs (hence both of them need to be included in the local pack).
> 
> Two concepts that would have been useful are (1) objects that are in
> promisor packs and (2) objects that are reachable from an object
> that is in a promisor pack.  I do not see how the current definition
> of "promisor objects" (i.e. in a promisor pack, or one hop from an
> object in a promisor pack) is useful in any context.

The one-hop part in the current definition is meant to (a) explain what
objects the client knows the remote has (in theory the client has no
knowledge of objects beyond the first hop, but we now know this theory
to not be true) and (b) explain what objects a non-promisor object can
reference (in particular, a non-promisor tree can reference promisor
blobs, even when our knowledge of that promisor blob only comes from a
tree in a promisor pack).

If we think that a promisor commit being a child of a non-promisor
commit as a "bad state" that needs to be fixed [1], then the one-hop
current definition seems to be equivalent to (2).

As for (1), we do use that concept in Git, although it's limited to the
repack during GC (or maybe there are others that I don't recall), so the
concept doesn't have a widely-used name like "promisor object".

[1] https://lore.kernel.org/git/20241001191811.1934900-1-calvinwan@xxxxxxxxxx/

> > Garbage Collection repack
> > -------------------------
> > Not yet implemented.
> >
> > Same concept at “fetch repack”, but happens during garbage collection
> > instead. The traversal is more expensive since we no longer have access
> > to what was recently fetched so we have to traverse through all promisor
> > packs to collect tips of “bad” history.
> 
> In other words, with the status quo, "git gc" that attempts to
> repack "objects in promisor packs" and "other objects that did not
> get repacked in the step that repack objects in promisor packs"
> separately, it implements the latter in a buggy way and discards
> some objects.  And fixing that bug by doing the right thing is
> expensive.
> 
> Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after
> "git gc"?  Wouldn't these "missing" objects be lazily fetchable, now
> C3 is known to the remote and the remote promises everything
> reachable from what they offer are (re)fetchable from them?  IOW, is
> this a correctness issue, or only performance issue (of having to
> re-fetch what we once locally had)?

I believe the re-fetch didn't happen because it was run from a command
with fetch_if_missing=0. (But even if we decide that we shouldn't use
fetch_if_missing, and then change all commands to not use it, there
still remains the performance issue, so we should still fix it.)

> > Cons: Packing local objects into promisor packs means that it is no
> > longer possible to detect if an object is missing due to repository
> > corruption or because we need to fetch it from a promisor remote.
> 
> Is this true?  Can we tell, when trying to access C2a/C2b/C2 after
> the current version of "git gc" removes them from the local object
> store, that they are missing due to repository corruption?  After
> all, C3 can reach them so wouldn't it be possible for us to fetch
> them from the promisor remote?
> 
> After a lazy clone that omits a lot of objects acquires many objects
> over time by fetching missing objects on demand, wouldn't we want to
> have an option to "slim" the local repository by discarding some of
> these objects (the ones that are least frequently used), relying on
> the promise by the promisor remote that even if we did so, they can
> be fetched again?  Can we treat loss of C2a/C2b/C2 as if such a
> feature prematurely kicked in?  Or are we failing to refetch them
> for some reason?

This is under the "repack all" option, which states that we repack all
objects (wherever they came from) into promisor packs. If we locally
created commit A and then its child commit B, and the repo got corrupted
so that we lost A, repacking all objects would mean that we could never
detect that the loss of A is problematic.