Hi ZheNing, first of all: thank you for working on this. In the past, I thought that this feature would be likely something we would want to have in Git. But Stolee's concerns are valid, and made me think about it more. See below for a more detailed analysis. On Thu, 1 Sep 2022, Derrick Stolee wrote: > On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote: > > > [...] > > > > Disadvantages of git clone --filter=blob:none with git > > sparse-checkout: The git client needs to send a lot of missing > > objects' id to the server, this can be very wasteful of network > > traffic. > > Asking for a list of blobs (especially limited to a sparse-checkout) is > much more efficient than what will happen when a user tries to do almost > anything in a repository formed the way you did here. I agree. When you have all the commit and tree objects on the local side, you can enumerate all the blob objects you need in one fell swoop, then fetch them in a single network round trip. When you lack tree objects, or worse, commit objects, this is not true. You may very well need to fetch _quite_ a bunch of objects, then inspect them to find out that you need to fetch more tree/commit objects, and then a couple more round trips, before you can enumerate all of the objects you need. Concrete example: let's assume that you clone git.git with a "partial depth" of 50. That is, while cloning, all of the tip commits' graphs will be traversed up until the commits that are removed by 49 edges in the commit graph. For example, v0.99~49 will be present locally after cloning, but not v0.99~50. Now, the first-parent depth of v0.99 is 955 (verify with `git rev-list --count --first-parent v0.99`). None of the commits reachable from v0.99 other than the tip itself seem to be closer to any other tag, so all commits reachable from v0.99~49 will be missing locally. And since reverts are rare, we must assume that the vast majority of the associated root tree objects are missing, too. Digging through history, a contributor might need to investigate where, say, `t/t4100/t-apply-7.expect` was introduced (it was in v0.99~206) because they found something looking like a bug and they need to read the commit message to see whether it was intentional. They know that this file was already present in v0.99. Naturally, the command-line to investigate that is: git log --diff-filter=A v0.99 -- t/t4100/t-apply-7.expect So what does Git do in that operation? It traverses the commits starting from v0.99, following the chain along the commit parents. When it encounters v0.99~49, it figures out that it has to fetch v0.99~50. To see whether v0.99~49 introduced that file, it then has to inspect that commit object and then fetch the tree object (v0.99~50^{tree}). Then, Git inspects that tree to find out the object ID for v0.99~50^{tree}:t/, sees that it is identical to v0.99~49^{tree}:t/ and therefore the pathspec filter skips this commit from the output of the `git log` command. A couple of parent traversals later (always fetching the parent commit object individually, then the associated tree object, then figuring out that `t/` is unchanged) Git will encounter v0.99~55 where `t/` _did_ change. So now it also has to fetch _that_ tree object. In total, we are looking at 400+ individual network round trips just to fetch the required tree/commit objects, i.e. before Git can show you the output of that `git log` command. And that's just for back-filling the missing tree/commit objects. If we had done this using a shallow clone, Git would have stopped at the shallow boundary, the user would have had a chance to increase the depth in bigger chunks (probably first extending the depth by 50, then maybe 100, then maybe going for 500) and while it would have been a lot of manual labor, the total time would be still a lot shorter than those 400+ network round trips (which likely would incur some throttling on the server side). > Thinking about this idea, I don't think it is viable. I would need to > see a lot of work done to test these scenarios closely to believe that > this type of partial clone is a desirable working state. Indeed, it is hard to think of a way how the design could result in anything but undesirable behavior, both on the client and the server side. We also have to consider that our experience with large repositories demonstrates that tree and commit objects delta pretty well and are virtually never a concern when cloning. It is always the sheer amount of blob objects that is causing poor user experience when performing non-partial clones of large repositories. Now, I can be totally wrong in my expectation that there is _no_ scenario where cloning with a "partial depth" would cause anything but poor performance. If I am wrong, then there is value in having this feature, but since it causes undesirable performance in all cases I can think of, it definitely should be guarded behind an opt-in flag. Ciao, Dscho