Re: [PATCH 0/3] list-object-filter: introduce depth filter

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Fri, 2 Sep 2022 15:48:20 +0200 (CEST)

Hi ZheNing,

first of all: thank you for working on this. In the past, I thought that
this feature would be likely something we would want to have in Git.

But Stolee's concerns are valid, and made me think about it more. See
below for a more detailed analysis.

On Thu, 1 Sep 2022, Derrick Stolee wrote:

> On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote:
>
> > [...]
> >
> > Disadvantages of git clone --filter=blob:none with git
> > sparse-checkout: The git client needs to send a lot of missing
> > objects' id to the server, this can be very wasteful of network
> > traffic.
>
> Asking for a list of blobs (especially limited to a sparse-checkout) is
> much more efficient than what will happen when a user tries to do almost
> anything in a repository formed the way you did here.

I agree. When you have all the commit and tree objects on the local side,
you can enumerate all the blob objects you need in one fell swoop, then
fetch them in a single network round trip.

When you lack tree objects, or worse, commit objects, this is not true.
You may very well need to fetch _quite_ a bunch of objects, then inspect
them to find out that you need to fetch more tree/commit objects, and then
a couple more round trips, before you can enumerate all of the objects you
need.

Concrete example: let's assume that you clone git.git with a "partial
depth" of 50. That is, while cloning, all of the tip commits' graphs will
be traversed up until the commits that are removed by 49 edges in the
commit graph. For example, v0.99~49 will be present locally after cloning,
but not v0.99~50.

Now, the first-parent depth of v0.99 is 955 (verify with `git rev-list
--count --first-parent v0.99`). None of the commits reachable from v0.99
other than the tip itself seem to be closer to any other tag, so all
commits reachable from v0.99~49 will be missing locally. And since reverts
are rare, we must assume that the vast majority of the associated root
tree objects are missing, too.

Digging through history, a contributor might need to investigate where,
say, `t/t4100/t-apply-7.expect` was introduced (it was in v0.99~206)
because they found something looking like a bug and they need to read the
commit message to see whether it was intentional. They know that this file
was already present in v0.99. Naturally, the command-line to investigate
that is:

	git log --diff-filter=A v0.99 -- t/t4100/t-apply-7.expect

So what does Git do in that operation? It traverses the commits starting
from v0.99, following the chain along the commit parents. When it
encounters v0.99~49, it figures out that it has to fetch v0.99~50. To see
whether v0.99~49 introduced that file, it then has to inspect that commit
object and then fetch the tree object (v0.99~50^{tree}). Then, Git
inspects that tree to find out the object ID for v0.99~50^{tree}:t/, sees
that it is identical to v0.99~49^{tree}:t/ and therefore the pathspec
filter skips this commit from the output of the `git log` command. A
couple of parent traversals later (always fetching the parent commit
object individually, then the associated tree object, then figuring out
that `t/` is unchanged) Git will encounter v0.99~55 where `t/` _did_
change. So now it also has to fetch _that_ tree object.

In total, we are looking at 400+ individual network round trips just to
fetch the required tree/commit objects, i.e. before Git can show you the
output of that `git log` command. And that's just for back-filling the
missing tree/commit objects.

If we had done this using a shallow clone, Git would have stopped at the
shallow boundary, the user would have had a chance to increase the depth
in bigger chunks (probably first extending the depth by 50, then maybe
100, then maybe going for 500) and while it would have been a lot of
manual labor, the total time would be still a lot shorter than those 400+
network round trips (which likely would incur some throttling on the
server side).

> Thinking about this idea, I don't think it is viable. I would need to
> see a lot of work done to test these scenarios closely to believe that
> this type of partial clone is a desirable working state.

Indeed, it is hard to think of a way how the design could result in
anything but undesirable behavior, both on the client and the server side.

We also have to consider that our experience with large repositories
demonstrates that tree and commit objects delta pretty well and are
virtually never a concern when cloning. It is always the sheer amount of
blob objects that is causing poor user experience when performing
non-partial clones of large repositories.

Now, I can be totally wrong in my expectation that there is _no_ scenario
where cloning with a "partial depth" would cause anything but poor
performance. If I am wrong, then there is value in having this feature,
but since it causes undesirable performance in all cases I can think of,
it definitely should be guarded behind an opt-in flag.

Ciao,
Dscho