Re: [PATCH 0/4] rev-list: introduce NUL-delimited output mode

"D. Ben Knoble" <ben.knoble@xxxxxxxxx> · Mon, 10 Mar 2025 18:38:54 -0400

On Mon, Mar 10, 2025 at 3:32 PM Justin Tobler <jltobler@xxxxxxxxx> wrote:
>
> When walking objects, git-rev-list(1) prints each object entry on a
> separate line in the form:
>
>         <oid> LF
>
> Some options, such as `--objects`, may print additional information
> about the object on the same line:
>
>         <oid> SP [<path>] LF
>
> In this mode, if the object path contains a newline it is truncated at
> the newline.
>
> When the `--missing={print,print-info}` option is provided, information
> about any missing objects encountered during the object walk are also
> printed in the form:
>
>         ?<oid> [SP <token>=<value>]... LF
>
> where values containing LF or SP are printed in a token specific fashion
> so that the resulting encoded value does not contain either of these two
> problematic bytes. For example, missing object paths are quoted in the C
> style so they contain LF or SP.
>
> To make machine parsing easier, this series introduces a NUL-delimited
> output mode for git-rev-list(1) via a `-z` option following a suggestion
> from Junio in a previous thread[1]. In this mode, instead of LF, each
> object is delimited with two NUL bytes and any object metadata is
> separated with a single NUL byte. Examples:
>
>         <oid> NUL NUL
>         <oid> [NUL <path>] NUL NUL
>         ?<oid> [NUL <token>=<value>]... NUL NUL
>
> In this mode, path and value info are printed as-is without any special
> encoding or truncation.
>
> For now this series only adds support for use with the `--objects` and
> `--missing` options. Usage of `-z` with other options is rejected, so it
> can potentially be added in the future.
>
> One idea I had, but did not implement in this version, was to also use
> the `<token>=<value>` format for regular non-missing object info while
> in the NUL-delimited mode. I could see this being a bit more flexible
> instead of relying strictly on order. Interested if anyone has thoughts
> on this. :)

Without taking a deeper look, I think token=value has the benefit of
being self-describing at the cost of more output bytes (which might
matter over the wire, for example). Generally I like the idea;
sometimes I find it troublesome having to parse prose manuals for the
specifics of output formats like field order, especially when I end up
coding a parser for the format. If the field order doesn’t matter to
the consumer, then perhaps using ordered fields AWK-style is
inappropriately terse?

OTOH, the -z format is for machines, and they don’t need human labels
;) [I think token labels would be a great parser-writing and debugging
aid]

Best,
Ben

>
> This series is structured as follows:
>
>         - Patches 1 and 2 do some minor preparatory refactors.
>
>         - Patch 3 adds the `-z` option to git-rev-list(1) to print
>           objects in a NUL-delimited fashion. Printed object paths with
>           the `--objects` option are also handled.
>
>         - Patch 4 teaches the `--missing` option how to print info in a
>           NUL-delimited fashion.
>
> Thanks for taking a look,
> -Justin
>
> [1]: <xmqq5xlor0la.fsf@gitster.g>
>
> Justin Tobler (4):
>   rev-list: inline `show_object_with_name()` in `show_object()`
>   rev-list: refactor early option parsing
>   rev-list: support delimiting objects with NUL bytes
>   rev-list: support NUL-delimited --missing option
>
>  Documentation/rev-list-options.adoc | 26 +++++++++
>  builtin/rev-list.c                  | 86 ++++++++++++++++++++++-------
>  revision.c                          |  8 ---
>  revision.h                          |  2 -
>  t/t6000-rev-list-misc.sh            | 34 ++++++++++++
>  t/t6022-rev-list-missing.sh         | 30 ++++++++++
>  6 files changed, 155 insertions(+), 31 deletions(-)
>
>
> base-commit: 87a0bdbf0f72b7561f3cd50636eee33dcb7dbcc3
> --
> 2.49.0.rc2
>
>

-- 
D. Ben Knoble