Re: Missing Promisor Objects in Partial Repo Design Doc

Calvin Wan <calvinwan@xxxxxxxxxx> · Tue, 8 Oct 2024 14:35:56 -0700

On Tue, Oct 1, 2024 at 7:54 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> True.  Will it become even worse, if a protocol extension Christian
> proposes starts suggesting a repository that is not lazy to add a
> promisor remote?  In such a set-up, perhaps all history leading to
> C2b down to the root are local, but C3 may have come from a promisor
> remote (hence in a promisor pack).

Yes if we and consequently Git considers this state to be problematic.

> > Bad State Solutions
> > ===================
> >
> > Fetch negotiation
> > -----------------
> > Implemented at
> > https://lore.kernel.org/git/20240919234741.1317946-1-calvinwan@xxxxxxxxxx/
> >
> > During fetch negotiation, if a commit is not in a promisor pack and
> > therefore local, do not declare it as "have" so they can be fetched into
> > a promisor pack.
> >
> > Cost:
> > - Creation of set of promisor pack objects (by iterating through every
> >   .idx of promisor packs)
>
> What is "promisor PACK objects"?  Is it different from the "promisor
> objects" (i.e. what I called the useless definition above)?

Objects that are in promisor packs, specifically the ones that have the
flag, packed_git::pack_promisor, set. However, since this design doc
was sent out, it turns out the creation of a set of promisor pack objects
in a large repository (such as Android or Chrome) is very expensive, so
this design is infeasible in my opinion.

>
> > - Refetch number of local commits
> >
> > Pros: Implementation is simple, client doesn’t have to repack, prevents
> > state from ever occurring in the repository.
> >
> > Cons: Network cost of refetching could be high if many local commits
> > need to be refetched.
>
> What if we get into the same state by creating local C4, which gets
> to outside and on top of which C5 is built, which is now sitting at
> the tip of the remote history and we fetch from them?  In order to
> include C4 in the "promisor pack", we refrain from saying C4 is a
> "have" for us and refetch.  Would C2 be fetched again?
>
> I do not think C2 would be, because we made it an object in a
> promisor pack when we "fixed" the history for C3.
>
> So the cost will not grow proportionally to the depth of the
> history, which makes it OK from my point of view.

Correct, the cost of refetching is only a one time cost, but
unfortunately creation of a set of promisor pack objects isn't.

>
> > Garbage Collection repack
> > -------------------------
> > Not yet implemented.
> >
> > Same concept at “fetch repack”, but happens during garbage collection
> > instead. The traversal is more expensive since we no longer have access
> > to what was recently fetched so we have to traverse through all promisor
> > packs to collect tips of “bad” history.
>
> In other words, with the status quo, "git gc" that attempts to
> repack "objects in promisor packs" and "other objects that did not
> get repacked in the step that repack objects in promisor packs"
> separately, it implements the latter in a buggy way and discards
> some objects.  And fixing that bug by doing the right thing is
> expensive.
>
> Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after
> "git gc"?  Wouldn't these "missing" objects be lazily fetchable, now
> C3 is known to the remote and the remote promises everything
> reachable from what they offer are (re)fetchable from them?  IOW, is
> this a correctness issue, or only performance issue (of having to
> re-fetch what we once locally had)?

My first thought is that from both the user and developer perspective,
we don't expect our reachable objects to be gc'ed. So all of the "bad
state" solutions work to ensure that that isn't the case in some way or
form. However, if it turns out that all of these solutions are much more
expensive and disruptive to the user than accepting that local objects
can be gc'ed and JIT refetching, then the latter seems much more
palatable. It is inevitable that we take some performance hit to fix this
problem and we may just have to accept this as one of the costs of
having partial clones to begin with.

>
> > Cons: Packing local objects into promisor packs means that it is no
> > longer possible to detect if an object is missing due to repository
> > corruption or because we need to fetch it from a promisor remote.
>
> Is this true?  Can we tell, when trying to access C2a/C2b/C2 after
> the current version of "git gc" removes them from the local object
> store, that they are missing due to repository corruption?  After
> all, C3 can reach them so wouldn't it be possible for us to fetch
> them from the promisor remote?

I should be more clear that "detecting if an object is missing due to
repository corruption" refers to fsck currently not having the
functionality to do that. We are "accidentally" discovering the
corruption when we try to access the missing object, but we can
still fetch them from the promisor remote afterwards.

> After a lazy clone that omits a lot of objects acquires many objects
> over time by fetching missing objects on demand, wouldn't we want to
> have an option to "slim" the local repository by discarding some of
> these objects (the ones that are least frequently used), relying on
> the promise by the promisor remote that even if we did so, they can
> be fetched again?  Can we treat loss of C2a/C2b/C2 as if such a
> feature prematurely kicked in?  Or are we failing to refetch them
> for some reason?

Yes if such a feature existed, then it would be feasible and a possible
solution for this issue (I'm leaning quite towards this now after testing
out some of the other designs).