Re: [PATCH 0/3] Implement filtering repacks

Christian Couder <christian.couder@xxxxxxxxx> · Thu, 20 Oct 2022 13:23:02 +0200

On Fri, Oct 14, 2022 at 6:46 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Christian Couder <christian.couder@xxxxxxxxx> writes:
>
> > For example one might want to clone with a filter to avoid too many
> > space to be taken by some large blobs, and one might realize after
> > some time that a number of the large blobs have still be downloaded
> > because some old branches referencing them were checked out. In this
> > case a filtering repack could remove some of those large blobs.
> >
> > Some of the comments on the patch series that John sent were related
> > to the possible data loss and repo corruption that a filtering repack
> > could cause. It's indeed true that it could be very dangerous, and we
> > agree that improvements were needed in this area.
>
> The wish is understandable, but I do not think this gives a good UI.
>
> This feature is, from an end-user's point of view, very similar to
> "git prune-packed", in that we prune data that is not necessary due
> to redundancy.  Nobody runs "prune-packed" directly; most people are
> even unaware of it being run on their behalf when they run "git gc".

I am Ok with adding the --filter option to `git gc`, or a config
option with a similar effect. I wonder how `git gc` should implement
that option though.

If we implement a new command called for example `git filter-packed`,
similar to `git prune-packed`, then this new command will call `git
pack-objects --filter=...`.

`git gc` is already running `git repack` under the hood in a number of
cases though. So running `git gc --filter=...` would in many cases
call `git pack-objects` twice, as it would call it once through git
repack and once through `git filter-packed`. Or am I missing something
here?

If on the other hand --filter was implemented in some way in `git
repack`, then `git gc --filter=...` could just call `git repack` once.

So even if the new feature should be run only through `git gc` and
perhaps a new command possibly called `git filter-packed`, I think it
might make sense for efficiency to implement it in some ways, like
maybe with some undocumented option or flag, in `git repack`.

> Reusing pack-objects as an underlying mechanism is OK, but this
> needs to be plumbed through to "git gc" for a more consistent
> experience for the end users.

It seems to me that `git prune-packed` might only remove objects that
are already in pack files. So there is no risk of losing data or
corrupting the repo.

Instead, the new feature could in some cases lose data and corrupt the
repo if some removed objects haven't yet been pushed. So on the client
side, it seems dangerous to me to make it run automatically without a
check that everything has been pushed.

Unfortunately some users might already run `git gc` automatically, in
cron scripts for example, and they might be tempted to just add the
`--filter=...` to their `git gc` script, or to set up a config option
with a similar effect without always properly checking that everything
has been pushed.

So I am Ok with trying to make the experience consistent, but I would
be worrying that it would let people shoot themselves in the foot too
easily.

I feel that an obscure `git repack` option would be less likely to be
run automatically.

> Is there a way to check if the "promisor remote" is still willing to
> keep the previous promise it made, so that the users do not have to
> see "we may corrupt the repository as the result of this operation,
> you have been warned", by the way?  Possibly with a protocol
> extension?
>
> In a sense, once you made a partial clone, your repository is at the
> mercy of the remote.  They can disappear any time with majority of
> the data you depend on, leaving only what you created locally and
> haven't pruned away, in a repository that may technically pass
> "fsck", because the things that are supposed to exist locally
> exists, but may not be usable in practice.

Yeah, when a user clones using --filter=..., the remote can disappear
anytime, and we haven't been very worried about that.

> So from that point of
> view, a simple check that asks "I earlier fetched from you with this
> filter and these tips of histories; are you still willing to support
> me?" and gets yes/no answer might be sufficient.  A remote that is
> not trustworthy can say "yes" and still change their mind later, so
> such a check may not even be needed.

Yeah, or a remote that is using some kind of high availability system
underneath might consider that it's too expensive and useless to check
if everything is properly saved everywhere, as the underlying system
has been designed for that purpose and already runs enough internal
checks.

> The above two big paragraphs is a way to say that I am not all that
> worried about losing objects that we should be able to refetch again
> by adding this feature.

I agree. I think it's more important to worry about objects that might
have been added locally to the repo, but might not have been pushed
somewhere else (yet).

> The perceived need for "--force" or "must
> run from terminal" may be overblown.  I do not think this negatively
> affects correctness or robustness at all (as long as the pruning is
> not buggy, of course).

I am Ok to remove the "must run from terminal" and "--force" if we
consider that people using this feature should know what they are
doing.

> HOWEVER
>
> Unlike prune-packed, pruning objects that could be refetched has
> negative performance impact.  So adding an option to enable/disable
> such pruning is needed.

I think a command line option like `--filter=...` is what makes it the
most obvious that something special is going on, compared to a config
option.

> If the frontmost UI entry point were "gc",
> it needs an opt-in option to invoke the "filtering repack", in other
> words.  "pack-objects" should not need any more work than what you
> have here (with the "terminal" and "force" discarded), as "--filter"
> is such an opt-in option already.

Yeah. So to sum up, it looks like you are Ok with `git gc
--filter=...`  which is fine for me, even if I wonder if `git repack
--filter=...` could be a good first step as it is less likely to be
used automatically (so safer in a way) and it might be better for
implementation related performance reasons.