[Design RFC] Being more defensive about fetching commits in partial clone

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Wed, 23 Nov 2022 16:42:05 -0800

At $DAYJOB, we recently ran into a situation in which through a bug (not
necessarily through Git) [1], there was corruption in the object store of
a partial clone. In this particular case, the problem was exposed when "git
gc" tried to expire reflogs, which calls repo_parse_commit(), which triggers
fetches of the missing commits.

We don't want to go to great lengths to improve the user experience in a
relatively rare case caused by a bug in another program at the expense of the
regular user experience, so this constrains the solution space. But I think
there is a solution that works: if we have reason to believe that we are
parsing a commit, we shouldn't lazy-fetch if it is missing. (I'm not proposing
a hard guarantee that commits are never lazy-fetched; this just relatively
increases resilience to object store corruption, and does not guarantee
absolute defense.) I think that we can use a missing commit as a sign of
object store corruption in this case because currently, Git does not support
excluding commits in partial clones.

There are other possible solutions including passing an argument from "git gc"
to "git reflog" to inhibit all lazy fetches, but I think that this fix is at
the wrong level - fixing "git reflog" means that this particular command works
fine, or so we think (it will fail if it somehow needs to read a legitimately
missing blob, say, a .gitmodules file), but fixing repo_parse_commit() will fix
a whole class of bugs.

A question remains of whether we would need to undo all this work if we decide
to support commit filters in partial clones. Firstly, there are good arguments
against (and, of course, for) commit filters in partial clones, so commit
filters may not work out in the end anyway. Secondly, even if we do have commit
filters, we at $DAYJOB think that we still need to differentiate, in some way,
a fetch that we have accounted for in our design and a fetch that we haven't;
commit chains have much greater lengths than tree chains and users wouldn't
want to wait for Git to fetch commit by commit (or segment by segment, if we
end up batch fetching commits as we probably will). So we would be building on
the defensiveness of fetching commits in this case, not tearing it down.

My next step will be to send a patch modifying repo_parse_commit() to not
lazy-fetch, and I think that future work will lie in identifying when we know
that we are reading a commit and inhibiting lazy-fetches in those cases. If
anyone has an opinion on this, feel free to let us know (hence the "RFC" in
the subject).

[1] For the curious, we ran a script that ran "git gc" on a repo having
configured a symlink to that repo as its alternate, which resulted in many
objects being deleted.