Re: [PATCH 6/9] repack: add `--filter=<filter-spec>` option

Christian Couder <christian.couder@xxxxxxxxx> · Wed, 21 Jun 2023 16:40:10 +0200

On Fri, Jun 16, 2023 at 2:43 AM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Christian Couder <christian.couder@xxxxxxxxx> writes:
>
> > After cloning with --filter=<filter-spec>, for example to avoid
> > getting unneeded large files on a user machine, it's possible
> > that some of these large files still get fetched for some reasons
> > (like checking out old branches) over time.
> >
> > In this case the repo size could grow too much for no good reason and a
> > way to filter out some objects would be useful to remove the unneeded
> > large files.
>
> Makes sense.
>
> If we repack without these objects, when the repository has a
> promisor remote, we should be able to rely on that remote to supply
> them on demand, once we need them again, no?

Yeah, sure.

> > Deleting objects right away could corrupt a repo though,...
>
> Hmph, could you elaborate why it is the case?  Isn't it the whole
> point to have promisor remote and use a lazy clone with the --filter
> option, so that objects that _ought_ to exist from connectivity's
> point of view _are_ allowed to be missing because the promisor
> promises to make them available on-demand?
>
>         Side note: I think I know the answer. While trying to remove
>         UNNEEDED large files, doing so may discard NEEDED large
>         files when done carelessly (e.g. the file may have been
>         created locally and haven't been pushed back).

Yeah, right.

>         But (1) if
>         that is the problem, perhaps we should be more careful in
>         the first place?

Yeah, but earlier when we implemented `repack --filter=...` that was
removing objects, saying that one should be very careful in the docs
and implementing safeguards didn't seem to be safe enough for
reviewers. Reviewers said that the feature would anyway provide a too
easy way for users to shoot their own foot.

>        (2) if it inherently is impossible to tell
>         which ones are unneeded reliably, the reason why it is
>         impossible, and the reason why "try sifting into two bins,
>         one that we _think_ are unneeded and another for the rest,
>         and verify what we _thought_ are unneeded are all available
>         from the promisor remote" is the best we can do, must be
>         described, I think.

You mean described in the `repack --filter=` doc? Yeah, I can describe
this use case in the doc, but see below.

> > ... so it might be
> > better to put those objects into a separate packfile instead of
> > deleting them. The separate pack could then be removed after checking
> > that all the objects in it are still available on a promisor remote it
> > can access.
>
> Surely, sifting the objects into two bins (i.e. those that we
> wouldn't have received if we cloned from the promisor remote just
> now, which are prunable, and those that we cannot lose because the
> promisor remote would not have them, e.g. we created them and have
> not pushed them to the remote yet) without removing anything would
> be safe, but if the result of such sifting must be verified, doesn't
> it indicate that the sifting step was buggy or misdesigned?

It might indicate that we prefer to be safe, do things in different
steps and not provide an easy way for users to shoot their own foot.
For example it seems pretty safe to do things like this:

  1) put all the objects we think should be on the promisor remote in
a separate packfile
  2) start checking that each object in that packfile is available on
the promisor remote
  3) if an object in that packfile isn't on the promisor remote, try
to send it there
  4) if we couldn't send the object, error out
  5) if we haven't errored out after checking all the objects in the
packfile, it means all these objects are now available from the
promisor remote and we can safely delete the packfile

The above steps can be done while new objects are created on the repo,
or fetched, or pushed into the repo. And, at least for now, it would
be done by a custom script, so users writing and installing it should
know what they are doing and would hopefully not complain that we
provided an easy way for them to shoot their foot.

If we don't even document the above in the --filter=... doc, it makes
it even less likely that they will do this and that their script might
be wrong. So even if I could document it in version 2, I am not sure I
should.

>  It does
> not sound like a very good justification to save them in a separate
> packfile.  It does smell somewhat similar to the cruft packs but not
> really (the choice over there is between exploding to loose and
> keeping in a pack, and never involves loss of objects).

If we are still worried about possible loss of objects, I am Ok with
not talking at all about use cases involving possible loss of objects.

> > Also splitting a packfile into 2 packs depending on a filter could be
> > useful in other usecases. For example some large blobs might take a lot
> > of precious space on fast storage while they are rarely accessed, and
> > it could make sense to move them in a separate cheaper, though slower,
> > storage.
>
> This one, outside the context of partial clone client, does make
> tons of sense.

Ok, so perhaps it is enough to justify this feature and patch series.
And I can just avoid talking about other use cases at all?

> I guess what I suspect is that this option, while it would be very
> useful for the "in other usecases" scenario above, may not become
> all that useful in the "our lazy clone got bloated and we want to
> trim objects we know we can retrieve from the promisor remote again
> if necessary" scenario, until the repack machinery learns to use an
> extra piece of information (namely "these are objects that we can
> fetch from the promisor remote") at the same time.

Yeah, perhaps we should wait for a command or a repack option or some
helper scripts to be able to perform steps 2) to 4) or 2) to 5) above
before talking about use cases involving a promisor remote.

On the other hand, it's possible to imagine other steps than the steps
2) to 4) described above. For example, if we want to repack on a
server where new large blobs can hardly be created and where there is
a receive hook that automatically sends all the large blobs to a
promisor remote as soon as they are received, we might not need steps
3) and 4) to send objects to the promisor remote. Just checking that
they are on the promisor remote might be enough.

Also even if we think we should have features covering all the 5
steps, should we cover all the ways blobs could be sent to the
promisor remote as part of step 3)? Some people or server platforms
might want to use git for that purpose, but others might prefer for
example FTP or plain HTTP(S) so that a transfer can be restarted if it
fails.

So should we really wait until we have all possible such use cases
covered by some features or scripts, or not? When does it become Ok to
talk about this? And then how much is it Ok to talk about this?

> > This commit implements a new `--filter=<filter-spec>` option in
> > `git repack` that moves filtered out objects into a separate pack.
> >
> > This is done by reading filtered out objects from `git pack-objects`'s
> > output and piping them into a separate `git pack-objects` process that
> > will put them into a separate packfile.
>
> So, for example, you may say "no blobs" in the filter, and while
> packing the local repository with the filter, resulting in a pack
> that exclude all blobs, we will learn what blob objects we did not
> pack into that packfile.  We can pack them into a separate one, and
> most of the blobs are what we could retrieve again from the promisor
> remote, but some of the blobs are what we locally created ourselves
> and haven't pushed back to the promisor remote yet.  Now what?  My
> earlier suspicion that this mechanism may not be all that useful for
> the "slim bloated lazy clone" comes from that I cannot think of a
> good answer to this "Now what?" question---my naive solution would
> involve enumerating the objects in that "separate packfile" that is
> a mixture of precious ones and expendable ones, and then learning
> which ones are precious, and creating a new pack that is a subset of
> that "separate packfile" with only the precious ones.  But if I do
> so, I do not think we need this new mechanism that seems to go only
> the half-way.

I hope the above 2) to 5) steps and related explanations are a good
answer to the "Now what?" question.

Thanks,
Christian.