Re: [PATCH 0/3] list-object-filter: introduce depth filter

ZheNing Hu <adlternative@xxxxxxxxx> · Sun, 4 Sep 2022 15:27:47 +0800

Derrick Stolee <derrickstolee@xxxxxxxxxx> 于2022年9月2日周五 03:24写道：
>
> On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote:
> > This patch let partial clone have the similar capabilities of the shallow
> > clone git clone --depth=<depth>.
> ...
> > Now we can use git clone --filter="depth=<depth>" to omit all commits whose
> > depth is >= <depth>. By this way, we can have the advantages of both shallow
> > clone and partial clone: Limiting the depth of commits, get other objects on
> > demand.
>
> I have several concerns about this proposal.
>
> The first is that "depth=X" doesn't mean anything after the first
> clone. What will happen when we fetch the remaining objects?
>

According to the current results, yes, it still downloads a large number
of commits.

Do a litte test again:

$ git clone --filter=depth:2 git.git git
Cloning into 'git'...
remote: Enumerating objects: 4311, done.
remote: Counting objects: 100% (4311/4311), done.
remote: Compressing objects: 100% (3788/3788), done.

Just see how many objects...
$ git cat-file --batch-check --batch-all-objects | grep blob | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
    4098
$ git cat-file --batch-check --batch-all-objects | grep tree | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
     211
$ git cat-file --batch-check --batch-all-objects | grep commit | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
       2

$ git checkout HEAD~

Fetch nothing...because depth=2.

$  git checkout HEAD~
remote: Enumerating objects: 198514, done.
remote: Counting objects: 100% (198514/198514), done.
remote: Compressing objects: 100% (68511/68511), done.
remote: Total 198514 (delta 128408), reused 198509 (delta 128406), pack-reused 0
Receiving objects: 100% (198514/198514), 77.07 MiB | 9.58 MiB/s, done.
Resolving deltas: 100% (128408/128408), done.
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (1/1), 14.35 KiB | 14.35 MiB/s, done.
remote: Enumerating objects: 198014, done.
remote: Counting objects: 100% (198014/198014), done.
remote: Compressing objects: 100% (68362/68362), done.
remote: Total 198014 (delta 128056), reused 198012 (delta 128055), pack-reused 0
Receiving objects: 100% (198014/198014), 76.55 MiB | 14.00 MiB/s, done.
Resolving deltas: 100% (128056/128056), done.
Previous HEAD position was 624a936234 Merge branch 'en/merge-multi-strategies'
HEAD is now at 014a9ea207 Merge branch 'en/t4301-more-merge-tree-tests'

Fetch a lot of objects... (three times!)

$ git cat-file --batch-check --batch-all-objects | grep blob | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
    4099
$ git cat-file --batch-check --batch-all-objects | grep tree | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
    130712
$ git cat-file --batch-check --batch-all-objects | grep commit | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
    67815

It fetched too many Commits and Trees... But Surprisingly, only one
more blob was downloaded.

I admit that this is a very bad action, That's because we
have no commits locally...

Maybe one solution: we can also provide a commit-id parameter
inside the depth filter, like --filter="commit:014a9ea207, depth:1"...
we can clone with blob:none filter to download all trees/commits,
then fetch blobs with this "commit-depth" filter.... even we can
provide a more complex filter: --filter="commit:014a9ea207, depth:1, type=blob"
This may avoid downloading too many unneeded commits and trees...

git fetch --filter="commit:014a9ea207, depth:1, type=blob"

If git fetch have learned this filter, then git checkout or other commands can
 use this filter internally heuristically:

e.g.

git checkout HEAD~
if HEAD~ missing | 75% blobs/trees in HEAD~ missing -> use "commit-depth" filter
else -> use blob:none filter

We can even make this commit-depth filter support multiple commits later.

> Partial clone is designed to download a subset of objects, but make
> the remaining reachable objects downloadable on demand. By dropping
> reachable commits, the normal partial clone mechanism would result
> in a 'git rev-list' call asking for a missing commit. Would this
> inherit the "depth=X" but result in a huge amount of over-downloading
> the trees and blobs in that commit range? Would it result in downloading
> commits one-by-one, and then their root trees (and all reachable objects
> from those root trees)?
>

I don't know if it's possible let git rev-list know that commits is missing, and
stop download them. (just like git cat-file --batch --batch-all-objects does)

Similarly, you can let git log or other commands to understand this...

Probably a config var: fetch.skipmissingcommits...

> Finally, computing the set of objects to send is just as expensive as
> if we had a shallow clone (we can't use bitmaps). However, we get the
> additional problem where fetches do not have a shallow boundary, so
> the server will send deltas based on objects that are not necessarily
> present locally, triggering extra requests to resolve those deltas.
>

Agree, I think this maybe a problem, but there is no good solution for it.

> This fallout remains undocumented and unexplored in this series, but I
> doubt the investigation would result in positive outcomes.
>
> > Disadvantages of git clone --depth=<depth> --filter=blob:none: we must call
> > git fetch --unshallow to lift the shallow clone restriction, it will
> > download all history of current commit.
>
> How does your proposal fix this? Instead of unshallowing, users will
> stumble across these objects and trigger huge downloads by accident.
>

As mentioned above, I would expect a commit-depth filter to fix this.

> > Disadvantages of git clone --filter=blob:none with git sparse-checkout: The
> > git client needs to send a lot of missing objects' id to the server, this
> > can be very wasteful of network traffic.
>
> Asking for a list of blobs (especially limited to a sparse-checkout) is
> much more efficient than what will happen when a user tries to do almost
> anything in a repository formed the way you did here.
>

Yes. also as mentioned above, enabling this filter in some specific cases:
e.g. we have the commit but not all trees/blobs in it.

> Thinking about this idea, I don't think it is viable. I would need to
> see a lot of work done to test these scenarios closely to believe that
> this type of partial clone is a desirable working state.
>

Agree.

> Thanks,
> -Stolee

Thanks to these reviews and criticisms, it makes me think more :)

ZheNing Hu