On Sun, Nov 10, 2024 at 11:39 PM Junio C Hamano <gitster@xxxxxxxxx> wrote: > > Eric Ju <eric.peijian@xxxxxxxxx> writes: > > > This is a continuation of Calvin Wan's (calvinwan@xxxxxxxxxx) > > patch series [PATCH v5 0/6] cat-file: add --batch-command remote-object-info command at [1]. > > > > Sometimes it is useful to get information about an object without having to download > > it completely. The server logic for retrieving size has already been implemented and merged in > > "a2ba162cda (object-info: support for retrieving object info, 2021-04-20)"[2]. > > This patch series implement the client option for it. > > > > This patch series add the `remote-object-info` command to `cat-file --batch-command`. > > This command allows the client to make an object-info command request to a server > > that supports protocol v2. If the server is v2, but does not have > > object-info capability, the entire object is fetched and the > > relevant object info is returned. > > > > A few questions open for discussions please: > > > > 1. In the current implementation, if a user puts `remote-object-info` in protocol v1, > > `cat-file --batch-command` will die. Which way do we prefer? "error and exit (i.e. die)" > > or "warn and wait for new command". > > In the primary use case envisioned, would it be a program that is > driving the "cat-file --batch-command" process? Can it sensibly > react to "warn and wait" and throw different commands to achieve > what it wanted to do with the remote-object-info command? > > If the answer is "no", die would be more appropriate. > Thank you, sir. I’m inclined to answer "no." Our primary use case is to use git cat-file remote-object-info in a promisor remote setup to retrieve metadata about an object stored in the promisor remote, without fetching it back to the local repository. This approach helps conserve disk space. I don’t believe other commands can achieve this functionality, particularly without requiring the object to be downloaded. In the context of GitLab, we can mandate a specific version of Git to be used alongside GitLab. Therefore, it is acceptable to error out if the required Git version is not available, as we can ensure compatibility by enforcing the version requirement. Also, Mr. Christian Couder provided me with another more concrete example: For example, consider a partial clone user initially interested in only the foo/ and bar/ directories. They might execute git clone --filter=blob:none --no-checkout <url> followed by git sparse-checkout set foo bar. Later, they decided to estimate how much space would be required to fetch everything related to the baz/ directory. The challenge arises because remote-object-info might need to operate recursively to calculate the total size for all objects related to baz/. A driver program that drives these recursive operations would struggle to handle a “warn and wait” mechanism effectively, as it would need to issue additional commands dynamically based on the warning. If the only alternative is to fetch the objects directly, it would defeat the purpose of using remote-object-info—which is intended to provide size information without actually downloading the objects. > > 2. Right now, only the size is supported. If the batch command format > > contains objectsize:disk or deltabase, it will die. The question > > is about objecttype. In the current implementation, it will die too. > > But dying on objecttype breaks the default format. We have changed the > > default format to %(objectname) %(objectsize) when remote-object-info is used. > > Any suggestions on this approach? > > Why bend the default format to the shortcoming of the new feature? > What makes it impossible to learn what type of object it is? If the > limitation that makes it impossible cannot be avoided, would it make > more sense to fall back to the "fetch and locally inspect" just like > "the other side does not know how to do object-info" case? > Thank you. It is indeed possible to determine the type of an object, and the plan was to include this functionality in a follow-up patch series to ensure an iterative development approach. We expect that developers using this feature will have some experience with Git and will notice the warnings in the documentation, which caution against relying on the default format remaining unchanged. While the “fetch and locally inspect” approach is an option, it would undermines the purpose of the feature, as highlighted by Christian’s partial clone and sparse checkout example. This feature is specifically designed to provide information without requiring the objects to be fetched, making such an alternative counterproductive. > Another thing you did not list, which is related, is where the > "fetch and locally inspect" fallback fetch the object into. Would > we use a quarantine mechanism, so that a mere request for remote > object info for an object will not contaminate our local object > store until the next gc realizes that such an object is dangling? > > Thanks. Thank you. Currently, the fetched object becomes a loose object in the local object store. We have a bunch of test cases covering it in t1017-cat-file-remote-object-info.sh to cover it. For example: 'remote-object-info fallback git://: fetch objects to client' ' 'remote-object-info fallback file://: fetch objects to client' ' 'remote-object-info fallback http://: fetch objects to client' ' where transfer.advertiseobjectinfo is set to false. I am not sure about adding a quarantine mechanism at this stage: Pros: - The quarantine area can be garbage collected, preventing contamination of the local storage. - The quarantine area could be utilized to calculate metadata information, such as in the partial clone and sparse checkout example mentioned above. Objects in the quarantine can be selectively included or excluded from such calculations. Cons: - Implementing a quarantine mechanism seems like a separate feature. This patch series already introduces a number of changes, and including the quarantine mechanism might make it too extensive. - Based on Mr. Patrick Steinhardt’s comment at [1], since remote-object-info operates only on protocol v2, adding a quarantine mechanism may lead to differing client-side behavior depending on the protocol, which could complicate the feature’s consistency. In my opinion, the quarantine mechanism appears to have a broader scope that extends beyond just remote-object-info. If deemed necessary, it would be more appropriate to address it in its own dedicated patch series. [1] https://gitlab.com/gitlab-org/git/-/merge_requests/168#note_2212333586