Re: [External] Re: Missing Promisor Objects in Partial Repo Design Doc

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Wed, 9 Oct 2024 11:34:53 -0700

Han Young <hanyang.tony@xxxxxxxxxxxxx> writes:
> On Wed, Oct 9, 2024 at 5:36 AM Calvin Wan <calvinwan@xxxxxxxxxx> wrote:
> 
> > Objects that are in promisor packs, specifically the ones that have the
> > flag, packed_git::pack_promisor, set. However, since this design doc
> > was sent out, it turns out the creation of a set of promisor pack objects
> > in a large repository (such as Android or Chrome) is very expensive, so
> > this design is infeasible in my opinion.
> 
> I wonder if a set of local loose/pack objects will be cheaper to construct?
> Normally loose objects are always non-promisor objects, unless the user
> running something like `unpack-objects`.

We had a similar idea at $JOB. Note that you don't actually
need to create the set - when looking up an object using
oid_object_info_extended(), we know if it's a loose object and if not,
which pack it is in. The pack has a promisor bit that we can check.

Note that there is a possibility of a false positive. If the same object
is in two packs - one promisor and one non-promisor - I believe there's
no guarantee that one pack will be preferred. So we can see that the
object is in a non-promisor pack, but there's no guarantee that it's not
also in a promisor pack. For the omit-local-commits-in-"have" solution,
this is a fatal flaw (we absolutely must guarantee that we don't send
any promisor commits) but for the repack-on-fetch solution, this is no
big deal (we are looking for objects to repack into a promisor pack, and
repacking a promisor object into a promisor pack is perfectly file). For
this reason, I think the repack-on-fetch solution is the most promising
one so far.

Loose objects are always non-promisor objects, yes. (I don't think the
user running `unpack-objects` counts - the user running a command on a
packfile in the .git directory is out of scope, I think.)

> > > After a lazy clone that omits a lot of objects acquires many objects
> > > over time by fetching missing objects on demand, wouldn't we want to
> > > have an option to "slim" the local repository by discarding some of
> > > these objects (the ones that are least frequently used), relying on
> > > the promise by the promisor remote that even if we did so, they can
> > > be fetched again?  Can we treat loss of C2a/C2b/C2 as if such a
> > > feature prematurely kicked in?  Or are we failing to refetch them
> > > for some reason?
> >
> > Yes if such a feature existed, then it would be feasible and a possible
> > solution for this issue (I'm leaning quite towards this now after testing
> > out some of the other designs).
> 
> Since no partial clone filter omits commit objects, we always assume
> commits are available in the codebase. `merge` reports "cannot merge
> unrelated history" if one of the commits is missing, instead of trying to
> fetch it.
> Another problem is current lazy fetching code does not report "haves"
> to remote, so a lazy fetching of commit ended up pulling all the trees,
> blobs associated with that commit.
> I also prefer the "fetching the missing objects" approach, making sure
> the repo has all the "correct" objects is difficult to get right.

If I remember correctly, our intention (or, at least, my intention)
of not treating missing commits differently was so that we don't limit
the solutions that we can implement. For example, we had the idea of
server-assisted merge base computation - this and other features would
make it feasible to omit commits locally. It has not been implemented,
though.