Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command

Peijian Ju <eric.peijian@xxxxxxxxx> · Fri, 21 Mar 2025 14:24:05 -0400

On Thu, Mar 13, 2025 at 2:02 AM Jeff King <peff@xxxxxxxx> wrote:
>
> On Tue, Mar 11, 2025 at 10:19:55PM -0400, Peijian Ju wrote:
>
> > > BTW, this strstr() isn't quite sufficient to prevent problems, as it
> > > would not find placeholders which _do_ exist but which aren't handled.
> > > One of the first things I tried was:
> > >
> > >   git cat-file --batch-command='%(objecttype) %(objectsize)'
> > >
> > > and feeding it "remote-object-info /path/to/repo some-oid". And it
> > > segfaulted.
> > >
> > > -Peff
> >
> > Thank you, Peff. Yes, you are right. It is a bug. I am adding a new
> > logic in v12:
> > 1. Iterating on the `opt->format` to see if there are any unsupported
> > placeholders. If there is,  error with unspported placeholders.
> > 2. Adding more test cases to cover different formats, e.g., just
> > `%(objectsize)`, just `%(objectname)`, mixed usage of supported and
> > unsupported placeholders.
>
> Yes, though it would be nice for step 1 to avoid re-parsing the string.
> I think you could either:
>
>   1. After the mark_query pass in batch_objects(), check for unsupported
>      pointers in expand_data. The downside here is that you'd have to
>      match each one that you _don't_ allow (so if somebody adds a new
>      one and forgets to update your list, it wouldn't be caught).
>
>   2. In expand_atom() or expand_format(), check an allow-list using
>      is_atom(), when remote-mode is in use. The downside here is that I
>      think we'd eventually want to move that parsing and formatting to
>      the shared ref-filter API. But maybe that API could provide some
>      kind of "check that this atom is allowed" function pointer.
>

Thank you, Peff. I prefer option 2. Maintaining an allow-list of
supported placeholders seems more practical than tracking unsupported
ones with a disallow-list. This approach has the added benefit that
any newly added placeholders would automatically be treated as
unsupported until explicitly added to the allow-list, reducing the
chance of oversights.

> I do wonder if there might be a way to also just notice that we don't
> have the requested information and handle it gracefully. I didn't
> reproduce it again just now, but I'd guess the segfault is due to
> feeding garbage to type_name() in expand_atom().
>
> So maybe if we initialized expand_data fully (so that data->type is
> always OBJ_BAD or something) and then checked for a NULL return from
> type_name(), we could do something sensible in expand_atom(), like
> insert a blank string or similar. And then it is not an error to ask for
> %(objecttype), but you will just not get useful data for those entries.
> From the description of the protocol, it sounds like you could actually
> intermix remote and local object requests?
>
> -Peff

Thank you Peff. I like the idea "... it is not an error to ask for
%(objecttype), but you will just not get useful data for those
entries."

So if we do remote-object-info with format "%(objectname)
%(objectsize) %(objecttype) %(objectsize:disk)", the response can be:

4346b22767c07e31d0f9b524fcb377972d957313 199 ??? ???

Where ??? means the placeholder is not yet supported. In this way we
don't have to change the default format, and as new support for the
placeholders is added, ??? will be replaced by meaningful data.

About intermixing remote and local object requests, do you mean what
happens when remote-object-info is passed oids of objects that are
available locally instead of on a remote? If so, I have these
scenarios:

1. An object is on remote but not on local. This is what
`remote-object-info` primarily focuses on: we retrieve info from
remote without downloading the object.
2. An object is on remote as well as on local. I think
`remote-object-info` should still retrieve info from remote instead of
checking local data. After all, if the user knows the object is on
local, they can use the `info` command. If remote-object-info is used,
it means we are interested in the information stored on the remote.
3. An object is not on remote, but only on local. I think
remote-object-info should fail in this case, since the remote doesn't
have the object. The info command should be used in this case.