Re: [PATCH v6 0/6] cat-file: add remote-object-info to batch-command

Peijian Ju <eric.peijian@xxxxxxxxx> · Mon, 18 Nov 2024 11:28:23 -0500

On Sun, Nov 10, 2024 at 11:39 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Eric Ju <eric.peijian@xxxxxxxxx> writes:
>
> > This is a continuation of Calvin Wan's (calvinwan@xxxxxxxxxx)
> > patch series [PATCH v5 0/6] cat-file: add --batch-command remote-object-info command at [1].
> >
> > Sometimes it is useful to get information about an object without having to download
> > it completely. The server logic for retrieving size has already been implemented and merged in
> > "a2ba162cda (object-info: support for retrieving object info, 2021-04-20)"[2].
> > This patch series implement the client option for it.
> >
> > This patch series add the `remote-object-info` command to `cat-file --batch-command`.
> > This command allows the client to make an object-info command request to a server
> > that supports protocol v2. If the server is v2, but does not have
> > object-info capability, the entire object is fetched and the
> > relevant object info is returned.
> >
> > A few questions open for discussions please:
> >
> > 1. In the current implementation, if a user puts `remote-object-info` in protocol v1,
> >    `cat-file --batch-command` will die. Which way do we prefer? "error and exit (i.e. die)"
> >    or "warn and wait for new command".
>
> In the primary use case envisioned, would it be a program that is
> driving the "cat-file --batch-command" process?  Can it sensibly
> react to "warn and wait" and throw different commands to achieve
> what it wanted to do with the remote-object-info command?
>
> If the answer is "no", die would be more appropriate.
>

Thank you, sir.

I’m inclined to answer "no."

Our primary use case is to use git cat-file remote-object-info in a
promisor remote setup to retrieve metadata about an object stored in
the promisor remote, without fetching it back to the local repository.
This approach helps conserve disk space. I don’t believe other
commands can achieve this functionality, particularly without
requiring the object to be downloaded.

In the context of GitLab, we can mandate a specific version of Git to
be used alongside GitLab. Therefore, it is acceptable to error out if
the required Git version is not available, as we can ensure
compatibility by enforcing the version requirement.

Also, Mr. Christian Couder provided me with another more concrete example:

For example, consider a partial clone user initially interested in
only the foo/ and bar/ directories. They might execute git clone
--filter=blob:none --no-checkout <url> followed by git sparse-checkout
set foo bar. Later, they decided to estimate how much space would be
required to fetch everything related to the baz/ directory.

The challenge arises because remote-object-info might need to operate
recursively to calculate the total size for all objects related to
baz/. A driver program that drives these recursive operations would
struggle to handle a “warn and wait” mechanism effectively, as it
would need to issue additional commands dynamically based on the
warning. If the only alternative is to fetch the objects directly, it
would defeat the purpose of using remote-object-info—which is intended
to provide size information without actually downloading the objects.

> > 2. Right now, only the size is supported. If the batch command format
> >    contains objectsize:disk or deltabase, it will die. The question
> >    is about objecttype. In the current implementation, it will die too.
> >    But dying on objecttype breaks the default format. We have changed the
> >    default format to %(objectname) %(objectsize) when remote-object-info is used.
> >    Any suggestions on this approach?
>
> Why bend the default format to the shortcoming of the new feature?
> What makes it impossible to learn what type of object it is?  If the
> limitation that makes it impossible cannot be avoided, would it make
> more sense to fall back to the "fetch and locally inspect" just like
> "the other side does not know how to do object-info" case?
>

Thank you.

It is indeed possible to determine the type of an object, and the plan
was to include this functionality in a follow-up patch series to
ensure an iterative development approach. We expect that developers
using this feature will have some experience with Git and will notice
the warnings in the documentation, which caution against relying on
the default format remaining unchanged.

While the “fetch and locally inspect” approach is an option, it would
undermines the purpose of the feature, as highlighted by Christian’s
partial clone and sparse checkout example. This feature is
specifically designed to provide information without requiring the
objects to be fetched, making such an alternative counterproductive.

> Another thing you did not list, which is related, is where the
> "fetch and locally inspect" fallback fetch the object into.  Would
> we use a quarantine mechanism, so that a mere request for remote
> object info for an object will not contaminate our local object
> store until the next gc realizes that such an object is dangling?
>
> Thanks.

Thank you.

Currently, the fetched object becomes a loose object in the local
object store. We have a bunch of test cases covering it in
t1017-cat-file-remote-object-info.sh to cover it. For example:
'remote-object-info fallback git://: fetch objects to client' '
'remote-object-info fallback file://: fetch objects to client' '
'remote-object-info fallback http://: fetch objects to client' '
where transfer.advertiseobjectinfo is set to false.

I am not sure about adding a quarantine mechanism at this stage:

Pros:
- The quarantine area can be garbage collected, preventing
contamination of the local storage.
- The quarantine area could be utilized to calculate metadata
information, such as in the partial clone and sparse checkout example
mentioned above. Objects in the quarantine can be selectively included
or excluded from such calculations.

Cons:
- Implementing a quarantine mechanism seems like a separate feature.
This patch series already introduces a number of changes, and
including the quarantine mechanism might make it too extensive.

- Based on Mr. Patrick Steinhardt’s comment at [1], since
remote-object-info operates only on protocol v2, adding a quarantine
mechanism may lead to differing client-side behavior depending on the
protocol, which could complicate the feature’s consistency.

In my opinion, the quarantine mechanism appears to have a broader
scope that extends beyond just remote-object-info. If deemed
necessary, it would be more appropriate to address it in its own
dedicated patch series.

[1] https://gitlab.com/gitlab-org/git/-/merge_requests/168#note_2212333586