Re: [PATCH] rev-list: print missing object type with --missing=print-type

Justin Tobler <jltobler@xxxxxxxxx> · Wed, 8 Jan 2025 16:18:35 -0600

On 25/01/08 07:17AM, Junio C Hamano wrote:
> The users need to be told what this "type" information really means,
> as its meaning is quite different from what "git cat-file -t <oid>"
> would give them.  We do not have the object, so we are not learning
> its type from the object itself.  How much trust should the users
> put in this information, for example?
> 
> That comes back to the "where does it come from" that the future
> readers of "git log" and reviewers need to be told by the proposed
> log message.  Knowing the internals, I know you'd be getting it from
> the "containing" objects, e.g., an object name that was found on the
> "parent" object header field of another commit, which is _expected_
> to be a commit, or an object name that was found in a tree entry
> whose mode bits were 100644, which is _expected_ to be a blob, etc.

I'll update the log message in the next version to explain how
information about a missing object gets inferred. As you explained, this
is relying on the containing object to figure this out. This is why for
some missing objects it may not be possible to infer the type. For
example, if a missing object is only reffered to by a reference, there
is not a containing object that can be used.

> There are other places that you _could_ glean information about
> (possibly missing) objects.  An object that is found during
> "rev-list --objects" traversal (which is the topic of this patch
> after all) but turned out to be missing may not just have an
> expected type (because it was found in a tree object that we
> successfully read) but also the full path to the object in the
> top-level tree, for example.
> 
> In modern Git, there are even more places that you may be able to
> use, like commit-graph that not just hints the object itself is a
> commit, but what its parents are and when the commit was created.
> 
> Note that I am not suggesting to implement more code to learn "type"
> information from more places than the current patch is doing.  At
> least not in this iteration of the patch.  What I am getting at is
> that it would help us to avoid unnecessarily limiting ourselves by
> stressing on "type" too much if we at least imagine what the
> possible sources of these extra pieces of information are and what
> they could provide.
> 
> As I suspect that we would want to leave the door open for us to
> extend this later, I would perhaps suggest an output format format
> like:
> 
>     ?<object name> [<token>=<value>]...

I think this is a great idea. To select which attributes get printed
with the missing object we could add an option. Something like:

  $ git rev-list --objects --missing=print \
  --missing-attr=path --missing-attr=type

I like the idea of also adding a path attribute, but this raises a
couple of questions. The way `--missing=print` currently works is that
it prints the unique set of missing object IDs. A missing object could
possibly be referenced by multiple trees and thus have multiple valid
paths. There are a couple of different ways this situation could be
handled:

  - We could record each of the encounted paths for an object and print
    out each. Something like: `?<oid> path=foo path=bar`

  - Historically, `--missing=print` would only ever print a single
    instance of the OID, but we could print a missing object with
    multiple paths each on a separate line. Something like this:

      ?e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 path=foo
      ?e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 path=bar

  - We could keep it simple a just report a single path per missing
    object. It doesn't capture the whole picture, but it does provide
    some insight into the missing object.

Since this missing object type is also inferred from the containing
object, in theory different containing objects could indicate different
types for the missing object. I'm not sure though if this is a scenario
worth accounting for. Maybe it would be fine to rely on a single
containing object to provide the type and assume it is consistent across
the others.

For paths, I'm currently leaning towards having each identified path
printed out as a separate attribute on the same line. For types, I'm
thinking we can just print a single type and assume it is consistent.
I'm certainly open to suggestions though. :)

> where <token> tells what kind of extra information it is.  I expect
> that the initial implementation only knows about "type" as the
> <token>.  For future extensibility, we only need to say that under
> the syntax:
> 
>  (1) How multiple attributes are shown?
>  (2) How would a <value> with SP or LF in it is represented?
> 
> My suggestion is to have multiple <token>=<value> on the same line,
> with a SP in between, and problematic bytes in <value> are quoted,
> using cquote(). i.e. a <token>=<value> whose <value> part does not
> begin with a double-quote ends at the first SP after it, otherwise
> <value> is taken as a C-quoted string inside a pair of double-quote.

Ok, so using a path value as an example, if it contains a SP, or a
special character that would be handled by `quote_c_style()` it should
be wrapped in double-quotes to indicate that it is all part of the
single attribute. That makes sense.

> If you are adventurous, I would not mind seeing "path" implemented
> as another token, since that would be fairly easily obtainable, but
> it does not have to be in the initial attempt.

Handling paths appears pretty straight-forward to add and seems like a
good idea. In my next version I'll also add support for a path
attribute. Thanks for the feedback.

-Justin