On Sat, Jan 22, 2022 at 5:36 PM Junio C Hamano <gitster@xxxxxxxxx> wrote: > > Josh Steadmon <steadmon@xxxxxxxxxx> writes: > > > When cloning a repo with a --filter and with --recurse-submodules > > enabled, the partial clone filter only applies to > > the top-level repo. This can lead to unexpected bandwidth and disk > > usage for projects which include large submodules. For example, a user > > might wish to make a partial clone of Gerrit and would run: > > `git clone --recurse-submodules --filter=blob:5k > > https://gerrit.googlesource.com/gerrit`. However, only the superproject > > would be a partial clone; all the submodules would have all blobs > > downloaded regardless of their size. With this change, the same filter > > applies to submodules, meaning the expected bandwidth and disk savings > > apply consistently. > > > > Plumb the --filter argument from git-clone through git-submodule and > > git-submodule--helper, such that submodule clones also have the filter > > applied. > > > > This applies the same filter to the superproject and all submodules. > > Users who prefer the current behavior (i.e., a filter only on the > > superproject) would need to clone with `--no-recurse-submodules` and > > then manually initialize each submodule. > > Two concerns (I do not say "issues", because I honestly do not know > how much this will hurt in the future). > > - Obviously, this changes the end user experience. To users in the > scenario that motivated this change (described above), obviously > it is a change in a good way, and but I wonder if there are > workflows that are hurt and actually have to resort to the > workaround to preserve the current behaviour. > > - Passing the filter down to submodules means that the filter > settings are universal across projects. The current set of > filters, I do not think such an assumption is too bad. If 5k > blob is too large for the top-level superproject, it is OK for > the superproject to dictate that 5k blob is too large for any of > the submodules the superproject uses. But can we forever limit > the filter vocabulary to the ones that can sensibly be applied > recursively? If we had a filter that goes with pathnames > (e.g. "I only want src/ and test/ directories and nothing else > initially"), such a set of pathnames appropriate in the context > of the superproject is unlikely to apply to its submodules. Even > the existing "depth" filter is iffy, if a toplevel superproject > is fairly flat and one of the submodules has a directory > hierarchy that is ultra deep. This second item matches the concern I wanted to raise. I've wanted to do "sparse clones" for over a decade[*]. In my opinion, disconnected development should not require a full clone or constant network connectivity. We're inching closer to this goal: (1) partial clones provide some of the base ability for partial downloads augmented by updates as needed, (2) sparse-checkout makes the working tree handle a subset so we can work without downloading more, (3) cone-mode corrects the mistake of specifying paths via gitignore-style patterns, and (4) merge-ort has some huge wins for partial clones to avoid downloading objects by both avoiding unnecessary rename detections and avoiding traversing into unmodified directories. There's a number of next steps, of course, one of which is that I would really like to add a path-based filter (corresponding to the directories in the sparsity cone) so that the initial partial clone can get all commits, all trees, and the paths within the sparsity cone. But how would that interact with this patch? There's a bit of a conflict. There's a few ways out: * Make your change be explicit rather than implicit: Based on Junio's first concern, you could modify this patch so that it requires a new flag like --filter-submodules-too (or some better name), and perhaps folks with a path filter just wouldn't use that. * Make these incompatible: Maybe a path filter is incompatible with --recurse-submodules, and we should throw an error if both are specified. * Attempt to marry the two options: Each submodule could perhaps extract the subset of paths with itself as a leading directory and remove that leading prefix and then use that as the path portion of the filter. (And perhaps even taking this a step farther: each level of cloning will only recurse into submodules which match the specified paths). I'm inclined to prefer an explicit flag for this behavior, but I feel bad asking for that due to behavior and code that doesn't exist and isn't even being worked on yet. If your patch does go through as-is, I'd probably implement the make-path-filters-and-recurse-submodules-incompatible option when adding path based filters. [*] https://lore.kernel.org/git/AANLkTikJhSVJw2hXkp0j6XA+k-J-AtSYzKWumjnqqsgz@xxxxxxxxxxxxxx/, the old code may be dead but the dream isn't