Re: [PATCH v2 3/3] upload-pack: allow configuring a missing-action

Christian Couder <christian.couder@xxxxxxxxx> · Fri, 24 May 2024 18:41:56 +0200

On Wed, May 15, 2024 at 7:09 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Christian Couder <christian.couder@xxxxxxxxx> writes:
>
> > From: Christian Couder <chriscool@xxxxxxxxxxxxx>
> >
> > In case some objects are missing from a server, it might still be
> > useful to be able to fetch or clone from it if the client already has
> > the missing objects or can get them in some way.
>
> Be more assertive.  We do not want to add a new feature randomly
> only because it _might_ be useful to somebody in a strange and
> narrow use case that _might_ exist.

Ok, I have changed "it might still be useful" to "it is sometimes
useful" in the v3 I just sent.

> > For example, in case both the server and the client are using a
> > separate promisor remote that contain some objects, it can be better
> > if the server doesn't try to send such objects back to the client, but
> > instead let the client get those objects separately from the promisor
> > remote. (The client needs to have the separate promisor remote
> > configured, for that to work.)
>
> Is it "it can be better", or is it "it is always better"?  Pick an
> example that you can say the latter to make your example more
> convincing.
>
> Repository S borrows from its "promisor" X, and repository C which
> initially cloned from S borrows from its "promisor" S.  Even if C
> wants an object in order to fill in the gap in its object graph, S
> may not have it (and S itself may have no need for that object), and
> in such a case, bypassing S and let C go directly to X would make
> sense.

Ok, I use your example above in v3.

> I am puzzled by this new option.
>
> It feels utterly irresponsible to give an option to set up a server
> that essentially declares: I'll serve objects you ask me as best
> efforts basis, the pack stream I'll give you may not have all
> objects you asked for and missing some objects, and when I do so, I
> am not telling you which objects I omitted.

I don't think it's irresponsible. The client anyways checks that it
got something usable in the same way as it does when it performs a
partial fetch or clone. The fetch or clone fails if that's not the
case. For example if the checkout part of a clone needs some objects
but cannot get them, the whole clone fails.

> How do you ensure that a response with an incomplete pack data would
> not corrupt the repository when the sending side configures
> upload-pack with this option?  How does the receiving end know which
> objects it needs to ask from elsewhere?

Git already has support for multiple promisor remotes. When a repo is
configured with 2 promisor remotes, let's call them A and B, then A
cannot guarantee that B has all the objects that are missing in A. And
it's the same for B, it cannot guarantee A has all the objects missing
in B. Also when fetching or cloning from A for example, then no list
of missing objects is transfered.

> Or is the data integrity of the receiving repository is the
> responsibility of the receiving user that talks with such a server?

Yes, it's the same as when a repo is configured with multiple promisor remotes.

When using bundle-uri, it's also the responsibility of the receiving
user to check that the bundle it gets from a separate endpoint is
correct.

Also when only large blobs are on remotes except for the main remote,
then, after cloning or fetching from the main remote, the client knows
the hashes of the objects it should get from the other remotes. So
there is still data integrity.

> If that is the case, I am not sure if I want to touch such a feature
> even with 10ft pole.

To give you some context, at GitLab we have a very large part of the
disk space on our servers taken by large blobs which often don't
compress well (see
https://gitlab.com/gitlab-org/gitaly/-/issues/5699#note_1794464340).
It makes a lot of sense for us at GitLab to try to move those large
blobs to some cheaper separate storage.

With repack --filter=... we can put large blobs on a separate path,
but it has some drawbacks:
  - they are still part of a packfile, which means that storing and
accessing the objects is expensive in terms of CPU and memory (large
binary files are often already compressed and might not delta well
with other objects),
  - mounting object storage on a machine might not be easy or might
not perform as well as using an object storage API.

So using a separate remote along with a remote helper for large blobs
makes sense. And when a client is cloning, it makes sense for a
regular Git server, to let the client fetch the large blobs directly
from a remote that has them.

> Is there anything the sender can do but does not do to help the
> receiver locate where to fetch these missing objects to fill the
> "unfilled promises"?
>
> For example, the sending side _could_ say that "Sorry, I do not have
> all objects that you asked me to---but you could try these other
> repositories".

In the cover letter of this v2 and the v3, I suggested the following:

---
For example in case of a client
cloning, something like the following is currently needed:

  GIT_NO_LAZY_FETCH=0 git clone
      -c remote.my_promisor.promisor=true \
      -c remote.my_promisor.fetch="+refs/heads/*:refs/remotes/my_promisor/*" \
      -c remote.my_promisor.url=<MY_PROMISOR_URL> \
      --filter="blob:limit=5k" server

But it would be nice if there was a capability for the client to say
that it would like the server to give it information about the
promisor that it could use, so that the user doesn't have to pass all
the "remote.my_promisor.XXX" config options on the command like. (It
would then be a bit similar to the bundle-uri feature where all the
bundle related information comes from the server.)
---

So yes we are thinking about adding such a way to "help the receiver
locate where to fetch these missing objects" soon. We needed to start
somewhere and we decided to start with this series, because this patch
series is quite small and let us already experiment with offloading
blobs to object storage (see
https://gitlab.com/gitlab-org/gitaly/-/issues/5987).

Also the client will anyway very likely store the information about
where it can get the missing objects as a promisor remote
configuration in its config. So after the clone, the resulting repo
will very likely be very similar as what it would be with the clone
command above.