Re: [PATCH 0/4] rev-list: introduce NUL-delimited output mode

Justin Tobler <jltobler@xxxxxxxxx> · Tue, 11 Mar 2025 17:59:23 -0500

On 25/03/10 06:38PM, D. Ben Knoble wrote:
> On Mon, Mar 10, 2025 at 3:32 PM Justin Tobler <jltobler@xxxxxxxxx> wrote:
> >
> > When walking objects, git-rev-list(1) prints each object entry on a
> > separate line in the form:
> >
> >         <oid> LF
> >
> > Some options, such as `--objects`, may print additional information
> > about the object on the same line:
> >
> >         <oid> SP [<path>] LF
> >
> > In this mode, if the object path contains a newline it is truncated at
> > the newline.
> >
> > When the `--missing={print,print-info}` option is provided, information
> > about any missing objects encountered during the object walk are also
> > printed in the form:
> >
> >         ?<oid> [SP <token>=<value>]... LF
> >
> > where values containing LF or SP are printed in a token specific fashion
> > so that the resulting encoded value does not contain either of these two
> > problematic bytes. For example, missing object paths are quoted in the C
> > style so they contain LF or SP.
> >
> > To make machine parsing easier, this series introduces a NUL-delimited
> > output mode for git-rev-list(1) via a `-z` option following a suggestion
> > from Junio in a previous thread[1]. In this mode, instead of LF, each
> > object is delimited with two NUL bytes and any object metadata is
> > separated with a single NUL byte. Examples:
> >
> >         <oid> NUL NUL
> >         <oid> [NUL <path>] NUL NUL
> >         ?<oid> [NUL <token>=<value>]... NUL NUL
> >
> > In this mode, path and value info are printed as-is without any special
> > encoding or truncation.
> >
> > For now this series only adds support for use with the `--objects` and
> > `--missing` options. Usage of `-z` with other options is rejected, so it
> > can potentially be added in the future.
> >
> > One idea I had, but did not implement in this version, was to also use
> > the `<token>=<value>` format for regular non-missing object info while
> > in the NUL-delimited mode. I could see this being a bit more flexible
> > instead of relying strictly on order. Interested if anyone has thoughts
> > on this. :)
> 
> Without taking a deeper look, I think token=value has the benefit of
> being self-describing at the cost of more output bytes (which might
> matter over the wire, for example). Generally I like the idea;
> sometimes I find it troublesome having to parse prose manuals for the
> specifics of output formats like field order, especially when I end up
> coding a parser for the format. If the field order doesn’t matter to
> the consumer, then perhaps using ordered fields AWK-style is
> inappropriately terse?
> 
> OTOH, the -z format is for machines, and they don’t need human labels
> ;) [I think token labels would be a great parser-writing and debugging
> aid]

One of the challenges with parsing git-rev-list(1) is all the various
forms it can take based on the options provided. For example:

    $ git rev-list --timestamp --objects --parents <rev>

    timestamp SP <oid> [SP <parent oid>] LF   (commit)
    <oid> SP [<path>] LF                      (tree/blob)

Relying strictly on order can be a bit tricky to parse due to how the
output format can change even line to line. So even for machine parsing,
labels may help simplify things if all object records follow something
along the lines of:

    <oid> NUL [<token>=<value> NUL]...

As you mentioned, this could potentially also be useful for users since
the attributes would be self-describing. This series is currently
focussed on the machine parsing side, but I think support for this mode
in a human-readable format could be added via a separate option in the
future.

-Justin