Re: [PATCH 0/6] [RFC] partial-clone: add ability to refetch with expanded filter

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 01 Feb 2022 12:13:38 -0800

"Robert Coup via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> If a filter is changed on a partial clone repository, for example from
> blob:none to blob:limit=1m, there is currently no straightforward way to
> bulk-refetch the objects that match the new filter for existing local
> commits. This is because the client will report commits as "have" during
> negotiation and any dependent objects won't be included in the transferred
> pack.

It sounds like a useful thing to have such a "refetch things"
option.

A lazy/partial clone is narrower than the full tree in the width
dimension, while a shallow clone is shallower than the full history
in the time dimension.  The latter already has the "--deepen"
support to say "the commits listed in my shallow boundary list may
claim that I already have these, but I actually don't have them;
please stop lying to the other side and refetch what I should have
fetched earlier".  I understand that this works in the other
dimension to "--widen" things?

Makes me wonder how well these two features work together (or if
they are mutually exclusive, that is fine as well as a starting
point).

If you update the filter specification to make it narrower (e.g. you
start from blob:limit=1m down to blob:limit=512k), would we transfer
nothing (which would be ideal), or would we end up refetching
everything that are smaller than 512k?

> This patch series proposes adding a --refilter option to fetch & fetch-pack
> to enable doing a full fetch with a different filter, as if the local has no
> commits in common with the remote. It builds upon cbe566a071
> ("negotiator/noop: add noop fetch negotiator", 2020-08-18).

I guess the answer to the last question is ...

> To note:
>
>  1. This will produce duplicated objects between the existing and newly
>     fetched packs, but gc will clean them up.

... it is not smart enough to stell them to exclude what we _ought_
to have by telling them what the _old_ filter spec was.  That's OK
for a starting point, I guess.  Hopefully, at the end of this
operation, we should garbage collect the duplicated objects by
default (with an option to turn it off)?

>  2. This series doesn't check that there's a new filter in any way, whether
>     configured via config or passed via --filter=. Personally I think that's
>     fine.

In other words, a repository that used to be a partial clone can
become a full clone by using the option _and_ not giving any filter.
I think that is an intuitive enough behaviour and a natural
consequence to the extreme of what the feature is.  Compared to
making a full "git clone", fetching from the old local (and narrow)
repository into it and then discarding the old one, it would not
have any performance or storage advantage, but it probably is more
convenient.

>  3. If a user fetches with --refilter applying a more restrictive filter
>     than previously (eg: blob:limit=1m then blob:limit=1k) the eventual
>     state is a no-op, since any referenced object already in the local
>     repository is never removed. Potentially this could be improved in
>     future by more advanced gc, possibly along the lines discussed at [2].

OK.  That matches my reaction to 1. above.