Re: [PATCH v3 0/7] fetch: add repair: full refetch without negotiation

Robert Coup <robert@xxxxxxxxxxx> · Thu, 10 Mar 2022 14:29:02 +0000

Hi,

On Wed, 9 Mar 2022 at 21:32, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> The way I read Calvin's suggestion was that you won't allow such a
> random series of "git fetch"es without updating the "this is the
> filter that is consistent with the contents of this repository"
> record, which will lead to inconsistencies.  I.e.
>
>  - we must maintain the "filter that is consistent with the contents
>    of this repository", which this series does not do, but we should.

I don't think we should strive to keep this "consistency" —

>  - the "--refetch" is unnecessary and redundant, as long as such a
>    record is maintained; when a filter settings changes, we should
>    do the equivalent of "--refetch" automatically.

— we don't know how much data has been pulled in by fetches from
different promisor and non-promisor remotes (past & present); or
dynamically faulted in through branch switching or history
exploration. And I can't see any particular benefit in attempting to
keep track of that?

Ævar suggested in future maybe we could figure out which commits a
user definitively has all the blobs & trees for and refetch could
negotiate from that position to improve efficiency: nothing in this
series precludes such an enhancement.

> ... isn't "git fetch --fitler" that does not update the configured
> filter (and does not do a refetch automatically) a bug that made the
> "refetch" necessary in the first place?

I don't believe it's a bug. A fairly obvious partial clone example
I've used before on repos where I want the commit history but not all
the associated data (especially when the history is littered with
giant blobs I don't care about):

  git clone example.com/myrepo --filter=blob:none
  # does a partial clone with no blobs
  # checkout faults in the blobs present at HEAD in bulk to populate
the working tree
  git config --unset remote.origin.partialclonefilter
  # going forward, future fetches include all associated blobs for new commits

Getting all the blobs for all history is something I'm explicitly
trying not to do in this example, but if the next fetch from origin
automatically did a "refetch" after I removed the filter that's
exactly what would happen.

We don't expect users to update `diff.algorithm` in config to run a
minimal diff: using the `--diff-algorithm=` option on the command line
overrides the config. And the same philosophy applies with fetch:
`remote.<name>.partialclonefilter` provides the default filter for
fetches, and a user can override it via `git fetch --filter=`. To me
this is how Git commands are expected to work.

Partial clones are still relatively new and advanced, and I don't
believe we should try and over-predict too much what the correct
behaviour is for a user.

I'd be happy adding something to the documentation for the
`remote.<name>.partialclonefilter` config setting to explain that
changing or removing the filter won't backfill the local object DB and
the user would need `fetch --refetch` for that.

Thanks,
Rob :)