Re: [PATCH 0/4] rev-list: introduce NUL-delimited output mode

Justin Tobler <jltobler@xxxxxxxxx> · Tue, 11 Mar 2025 18:19:45 -0500

On 25/03/10 01:37PM, Junio C Hamano wrote:
> Justin Tobler <jltobler@xxxxxxxxx> writes:
> > To make machine parsing easier, this series introduces a NUL-delimited
> > output mode for git-rev-list(1) via a `-z` option following a suggestion
> > from Junio in a previous thread[1]. In this mode, instead of LF, each
> > object is delimited with two NUL bytes and any object metadata is
> > separated with a single NUL byte. Examples:
> >
> >         <oid> NUL NUL
> >         <oid> [NUL <path>] NUL NUL
> 
> Why do we need double-NUL in the above two cases?

In the `<oid> [NUL <path>] NUL NUL` case, it would technically be
possible for an object path to match an OID. The use of two NUL bytes
signals when the object record ends.

Without someother mechanism to know when a record starts/stops, even the
`<oid> NUL NUL` case would need the two trailing NUL bytes to avoid
being considered a potential path.

If the output format would not result in any additional object metadata
being appended, we could use a single NUL byte to delimit between
objects in this case, but always using two NUL bytes allowed for a more
consistent format.

> 
> >         ?<oid> [NUL <token>=<value>]... NUL NUL
> 
> This one I understand; we could do without double-NUL and take the
> lack of "=" in the token after NUL termination as the sign that the
> previous record ended, though, to avoid double-NUL while keeping the
> format extensible.
> 
> As this topic is designing essentially a new and machine parseable
> format, we could even unify all three formats into one.  For example,
> the format could be like this:
> 
> 	<oid> NUL [<attr>=<value> NUL]...

I was also considering something similar. This format could allow other
object metadata like `--timestamp` to be supported in the future with a
more flexible format. In the next version I'll implement a unified
format here.

> 
> where
> 
>  (1) A record ends when a new record begins.
> 
>  (2) The beginning of a new record is signaled by <oid> that is all
>      hexadecimal and does not have any '=' in it.

I think this is a good idea. By always appending printed object metadata
in the form `<token>=<value>`, we know that any entry without '=' must
be the start of a new record. This removes the need for the two NUL
bytes to indicate the end of a record.

I'll use only a single NUL byte to delimit in the next version.

> 
>  (3) The traditional "rev-list --objects" output that gives path in
>      addition to the object name uses "path" as the <attr> name,
>      i.e. such a record looks like "<oid> NUL path=<path> NUL".
> 
>  (4) The traditional "rev-list --missing" output loses the leading
>      "?"; it is replaced by "missing" as the <attr> name, i.e. such
>      a record may look like "<oid> NUL missing=yes NUL..." together
>      with other "<token>=<value> NUL" pairs appended as needed at
>      the end.

I think this is good. Instead of prefixing missing OIDs with '?', we can
just append another token/value pair `missing=yes`.

Thanks,
-Justin