Re: [PATCH] clone, submodule: pass partial clone filters to submodules

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Jan 22, 2022 at 5:36 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Josh Steadmon <steadmon@xxxxxxxxxx> writes:
>
> > When cloning a repo with a --filter and with --recurse-submodules
> > enabled, the partial clone filter only applies to
> > the top-level repo. This can lead to unexpected bandwidth and disk
> > usage for projects which include large submodules. For example, a user
> > might wish to make a partial clone of Gerrit and would run:
> > `git clone --recurse-submodules --filter=blob:5k
> > https://gerrit.googlesource.com/gerrit`. However, only the superproject
> > would be a partial clone; all the submodules would have all blobs
> > downloaded regardless of their size. With this change, the same filter
> > applies to submodules, meaning the expected bandwidth and disk savings
> > apply consistently.
> >
> > Plumb the --filter argument from git-clone through git-submodule and
> > git-submodule--helper, such that submodule clones also have the filter
> > applied.
> >
> > This applies the same filter to the superproject and all submodules.
> > Users who prefer the current behavior (i.e., a filter only on the
> > superproject) would need to clone with `--no-recurse-submodules` and
> > then manually initialize each submodule.
>
> Two concerns (I do not say "issues", because I honestly do not know
> how much this will hurt in the future).
>
>  - Obviously, this changes the end user experience.  To users in the
>    scenario that motivated this change (described above), obviously
>    it is a change in a good way, and but I wonder if there are
>    workflows that are hurt and actually have to resort to the
>    workaround to preserve the current behaviour.
>
>  - Passing the filter down to submodules means that the filter
>    settings are universal across projects.  The current set of
>    filters, I do not think such an assumption is too bad.  If 5k
>    blob is too large for the top-level superproject, it is OK for
>    the superproject to dictate that 5k blob is too large for any of
>    the submodules the superproject uses.  But can we forever limit
>    the filter vocabulary to the ones that can sensibly be applied
>    recursively?  If we had a filter that goes with pathnames
>    (e.g. "I only want src/ and test/ directories and nothing else
>    initially"), such a set of pathnames appropriate in the context
>    of the superproject is unlikely to apply to its submodules.  Even
>    the existing "depth" filter is iffy, if a toplevel superproject
>    is fairly flat and one of the submodules has a directory
>    hierarchy that is ultra deep.

This second item matches the concern I wanted to raise.  I've wanted
to do "sparse clones" for over a decade[*].  In my opinion,
disconnected development should not require a full clone or constant
network connectivity.  We're inching closer to this goal: (1) partial
clones provide some of the base ability for partial downloads
augmented by updates as needed, (2) sparse-checkout makes the working
tree handle a subset so we can work without downloading more, (3)
cone-mode corrects the mistake of specifying paths via gitignore-style
patterns, and (4) merge-ort has some huge wins for partial clones to
avoid downloading objects by both avoiding unnecessary rename
detections and avoiding traversing into unmodified directories.

There's a number of next steps, of course, one of which is that I
would really like to add a path-based filter (corresponding to the
directories in the sparsity cone) so that the initial partial clone
can get all commits, all trees, and the paths within the sparsity
cone.

But how would that interact with this patch?  There's a bit of a
conflict.  There's a few ways out:
  * Make your change be explicit rather than implicit: Based on
Junio's first concern, you could modify this patch so that it requires
a new flag like --filter-submodules-too (or some better name), and
perhaps folks with a path filter just wouldn't use that.
  * Make these incompatible: Maybe a path filter is incompatible with
--recurse-submodules, and we should throw an error if both are
specified.
  * Attempt to marry the two options: Each submodule could perhaps
extract the subset of paths with itself as a leading directory and
remove that leading prefix and then use that as the path portion of
the filter.  (And perhaps even taking this a step farther: each level
of cloning will only recurse into submodules which match the specified
paths).

I'm inclined to prefer an explicit flag for this behavior, but I feel
bad asking for that due to behavior and code that doesn't exist and
isn't even being worked on yet.  If your patch does go through as-is,
I'd probably implement the
make-path-filters-and-recurse-submodules-incompatible option when
adding path based filters.


[*] https://lore.kernel.org/git/AANLkTikJhSVJw2hXkp0j6XA+k-J-AtSYzKWumjnqqsgz@xxxxxxxxxxxxxx/,
the old code may be dead but the dream isn't



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux