Re: [PATCH v3 0/7] fetch: add repair: full refetch without negotiation

Calvin Wan <calvinwan@xxxxxxxxxx> · Mon, 21 Mar 2022 10:58:00 -0700

Hi Robert,

Documentation for the config setting is an acceptable solution!
Apologies for the late response -- wanted to wait and see if anyone
else on the list had any last thoughts. Also I noticed you were hoping
that Jonathan Tan could take a look at your patch on the What's
Cooking thread. Before I sent my first review out, I discussed your
patch with him so he's been briefed.

Reviewed-by: Calvin Wan <calvinwan@xxxxxxxxxx>

On Thu, Mar 10, 2022 at 6:29 AM Robert Coup <robert@xxxxxxxxxxx> wrote:
>
> Hi,
>
> On Wed, 9 Mar 2022 at 21:32, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> >
> > The way I read Calvin's suggestion was that you won't allow such a
> > random series of "git fetch"es without updating the "this is the
> > filter that is consistent with the contents of this repository"
> > record, which will lead to inconsistencies.  I.e.
> >
> >  - we must maintain the "filter that is consistent with the contents
> >    of this repository", which this series does not do, but we should.
>
> I don't think we should strive to keep this "consistency" —
>
> >  - the "--refetch" is unnecessary and redundant, as long as such a
> >    record is maintained; when a filter settings changes, we should
> >    do the equivalent of "--refetch" automatically.
>
> — we don't know how much data has been pulled in by fetches from
> different promisor and non-promisor remotes (past & present); or
> dynamically faulted in through branch switching or history
> exploration. And I can't see any particular benefit in attempting to
> keep track of that?
>
> Ævar suggested in future maybe we could figure out which commits a
> user definitively has all the blobs & trees for and refetch could
> negotiate from that position to improve efficiency: nothing in this
> series precludes such an enhancement.
>
> > ... isn't "git fetch --fitler" that does not update the configured
> > filter (and does not do a refetch automatically) a bug that made the
> > "refetch" necessary in the first place?
>
> I don't believe it's a bug. A fairly obvious partial clone example
> I've used before on repos where I want the commit history but not all
> the associated data (especially when the history is littered with
> giant blobs I don't care about):
>
>   git clone example.com/myrepo --filter=blob:none
>   # does a partial clone with no blobs
>   # checkout faults in the blobs present at HEAD in bulk to populate
> the working tree
>   git config --unset remote.origin.partialclonefilter
>   # going forward, future fetches include all associated blobs for new commits
>
> Getting all the blobs for all history is something I'm explicitly
> trying not to do in this example, but if the next fetch from origin
> automatically did a "refetch" after I removed the filter that's
> exactly what would happen.
>
> We don't expect users to update `diff.algorithm` in config to run a
> minimal diff: using the `--diff-algorithm=` option on the command line
> overrides the config. And the same philosophy applies with fetch:
> `remote.<name>.partialclonefilter` provides the default filter for
> fetches, and a user can override it via `git fetch --filter=`. To me
> this is how Git commands are expected to work.
>
> Partial clones are still relatively new and advanced, and I don't
> believe we should try and over-predict too much what the correct
> behaviour is for a user.
>
> I'd be happy adding something to the documentation for the
> `remote.<name>.partialclonefilter` config setting to explain that
> changing or removing the filter won't backfill the local object DB and
> the user would need `fetch --refetch` for that.
>
> Thanks,
> Rob :)