On Wed, May 15, 2024 at 7:09 PM Junio C Hamano <gitster@xxxxxxxxx> wrote: > > Christian Couder <christian.couder@xxxxxxxxx> writes: > > > From: Christian Couder <chriscool@xxxxxxxxxxxxx> > > > > In case some objects are missing from a server, it might still be > > useful to be able to fetch or clone from it if the client already has > > the missing objects or can get them in some way. > > Be more assertive. We do not want to add a new feature randomly > only because it _might_ be useful to somebody in a strange and > narrow use case that _might_ exist. Ok, I have changed "it might still be useful" to "it is sometimes useful" in the v3 I just sent. > > For example, in case both the server and the client are using a > > separate promisor remote that contain some objects, it can be better > > if the server doesn't try to send such objects back to the client, but > > instead let the client get those objects separately from the promisor > > remote. (The client needs to have the separate promisor remote > > configured, for that to work.) > > Is it "it can be better", or is it "it is always better"? Pick an > example that you can say the latter to make your example more > convincing. > > Repository S borrows from its "promisor" X, and repository C which > initially cloned from S borrows from its "promisor" S. Even if C > wants an object in order to fill in the gap in its object graph, S > may not have it (and S itself may have no need for that object), and > in such a case, bypassing S and let C go directly to X would make > sense. Ok, I use your example above in v3. > I am puzzled by this new option. > > It feels utterly irresponsible to give an option to set up a server > that essentially declares: I'll serve objects you ask me as best > efforts basis, the pack stream I'll give you may not have all > objects you asked for and missing some objects, and when I do so, I > am not telling you which objects I omitted. I don't think it's irresponsible. The client anyways checks that it got something usable in the same way as it does when it performs a partial fetch or clone. The fetch or clone fails if that's not the case. For example if the checkout part of a clone needs some objects but cannot get them, the whole clone fails. > How do you ensure that a response with an incomplete pack data would > not corrupt the repository when the sending side configures > upload-pack with this option? How does the receiving end know which > objects it needs to ask from elsewhere? Git already has support for multiple promisor remotes. When a repo is configured with 2 promisor remotes, let's call them A and B, then A cannot guarantee that B has all the objects that are missing in A. And it's the same for B, it cannot guarantee A has all the objects missing in B. Also when fetching or cloning from A for example, then no list of missing objects is transfered. > Or is the data integrity of the receiving repository is the > responsibility of the receiving user that talks with such a server? Yes, it's the same as when a repo is configured with multiple promisor remotes. When using bundle-uri, it's also the responsibility of the receiving user to check that the bundle it gets from a separate endpoint is correct. Also when only large blobs are on remotes except for the main remote, then, after cloning or fetching from the main remote, the client knows the hashes of the objects it should get from the other remotes. So there is still data integrity. > If that is the case, I am not sure if I want to touch such a feature > even with 10ft pole. To give you some context, at GitLab we have a very large part of the disk space on our servers taken by large blobs which often don't compress well (see https://gitlab.com/gitlab-org/gitaly/-/issues/5699#note_1794464340). It makes a lot of sense for us at GitLab to try to move those large blobs to some cheaper separate storage. With repack --filter=... we can put large blobs on a separate path, but it has some drawbacks: - they are still part of a packfile, which means that storing and accessing the objects is expensive in terms of CPU and memory (large binary files are often already compressed and might not delta well with other objects), - mounting object storage on a machine might not be easy or might not perform as well as using an object storage API. So using a separate remote along with a remote helper for large blobs makes sense. And when a client is cloning, it makes sense for a regular Git server, to let the client fetch the large blobs directly from a remote that has them. > Is there anything the sender can do but does not do to help the > receiver locate where to fetch these missing objects to fill the > "unfilled promises"? > > For example, the sending side _could_ say that "Sorry, I do not have > all objects that you asked me to---but you could try these other > repositories". In the cover letter of this v2 and the v3, I suggested the following: --- For example in case of a client cloning, something like the following is currently needed: GIT_NO_LAZY_FETCH=0 git clone -c remote.my_promisor.promisor=true \ -c remote.my_promisor.fetch="+refs/heads/*:refs/remotes/my_promisor/*" \ -c remote.my_promisor.url=<MY_PROMISOR_URL> \ --filter="blob:limit=5k" server But it would be nice if there was a capability for the client to say that it would like the server to give it information about the promisor that it could use, so that the user doesn't have to pass all the "remote.my_promisor.XXX" config options on the command like. (It would then be a bit similar to the bundle-uri feature where all the bundle related information comes from the server.) --- So yes we are thinking about adding such a way to "help the receiver locate where to fetch these missing objects" soon. We needed to start somewhere and we decided to start with this series, because this patch series is quite small and let us already experiment with offloading blobs to object storage (see https://gitlab.com/gitlab-org/gitaly/-/issues/5987). Also the client will anyway very likely store the information about where it can get the missing objects as a promisor remote configuration in its config. So after the clone, the resulting repo will very likely be very similar as what it would be with the clone command above.