Re: [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Mon, 12 Aug 2024 17:45:08 -0700

Han Young <hanyang.tony@xxxxxxxxxxxxx> writes:
> Here are the minimal steps to recreate issue.
[snip]

I think the following is what is happening. Before the final gc, the
repo looks as follows:

  commit  tree  blob
   C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
   |
   C2 ---- T2 -- B2 (created locally, in non-promisor pack)
   |
   C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

After the final gc, {C,T,B}3 and {C,T,B}1 are in a promisor pack, but
all of {C,T,B}2 are deleted because they are thought to be promisor
objects.

> The last `git gc` will error out on fsck with error message like this:
> 
>   error: Could not read d3fbfea9e448461c2b72a79a95a220ae10defd94
>   error: Could not read d3fbfea9e448461c2b72a79a95a220ae10defd94

I'm not sure how `git gc` (or `git fsck`) knows the name of this
object (what is the type of this object, and which object refers to
this object?) but I think that if we implement one of the solutions I
describe below, this problem will go away.

> `git gc` will call `git repack`, which will call `git pack-objects`
> twice on a partially cloned repo. The first call to pack-objects 
> combines all the promisor packfiles, and the second pack-objects 
> command packs all reachable non-promisor objects into a normal packfile.

Yes, this is what I remember.

> However, a bug in setup_revision caused some non-promisor objects 
> to be mistakenly marked as in promisor packfiles in the second call 
> to pack-objects.

I think they ({C,T,B}2 in the example above) should be considered as
promisor objects, actually. From the partial clone doc, it says of a
promisor object: "the local repository has that object in one of its
promisor packfiles, or because another promisor object refers to it".
C3 (in a promisor packfile) refers to C2, so C2 is a promisor object. It
refers to T2, which refers to B2, so all of them are promisor objects.
However, in the Git code, I don't think this definition is applied
recursively - if I remember correctly, we made the assumption (e.g.
in is_promisor_object() in packfile.c) that we only need to care about
objects in promisor packfiles and the objects they directly reference,
because how would we know what objects they indirectly reference?
But this does not cover the case in which we know what objects they
indirectly reference because we pushed them to the server in the first
place.

Solutions I can think of:

 - When fetching from a promisor remote, never declare commits that are
   not in promisor packfiles as HAVE. This means that we would refetch
   C2 and T2 as being in promisor packfiles. But it's wasteful in
   network bandwidth, and does not repair problems in Git repos created
   by Git versions that do not have this solution.

 - When fetching from a promisor remote, parse every object and repack
   any local objects referenced (directly or indirectly) into a promisor
   packfile. Also does not repair problems.

 - When repacking all objects in promisor packfiles, if any object they
   refer to is present in a non-promisor packfile, do a revwalk on that
   object and pack those objects too. The repack will probably be slower
   because each object now has to be parsed. The revwalks themselves
   probably will not take too long, since they can stop at known promisor
   objects.

(One other thing that might be considered is, whenever pushing to a
promisor remote, to write the pack that's pushed as a promisor packfile.
But I don't think this is a good idea - the server may not retain any
packs that were pushed.)