Re: Structured (ie: json) output for query commands?

Jeff King <peff@xxxxxxxx> · Thu, 1 Jul 2021 12:00:19 -0400

On Wed, Jun 30, 2021 at 08:19:49PM +0000, brian m. carlson wrote:

> On 2021-06-30 at 17:59:43, Jeff King wrote:
> > One complication we faced is that a lot of Git's data is bag-of-bytes,
> > not utf8. And json technically requires utf8. I don't remember if we
> > simply fudged that and output possibly non-utf8 sequences, or if we
> > actually encode them.
> 
> I think we just emit invalid UTF-8 in that case, which is a problem.
> That's why Git is not well suited to JSON output and why it isn't a good
> choice for structured data here.  I'd like us not to do more JSON in our
> codebase, since it's practically impossible for users to depend on our
> output if we do that due to encoding issues[0].
> 
> We could emit data in a different format, such as YAML, which does have
> encoding for arbitrary byte sequences.  However, in YAML, binary data is
> always base64 encoded, which is less readable, although still
> interchangeable.  CBOR is also a possibility, although it's not human
> readable at all.

I don't love the invalid-utf8-in-json thing in general. But I think it
may be the least-bad solution. I seem to recall that YAML has its own
complexities, and losing human-readability (even to base64) is a pretty
big downside. And the tooling for working with json seems more common
and mature (certainly over something like CBOR, but I think even YAML
doesn't have anything nearly as nice as jq).

Our sloppy json encoding does work correctly if you use utf8 paths, and
I think we could provide options to cover other common cases (e.g., a
single option for "assume my paths are latin1"). I think life is hardest
on somebody writing a script/service which is meant to process arbitrary
repositories (and isn't in control of the strictness of whatever is
parsing the json).

I'm sensitive to the issue of implementing something that works most of
the time, but then fails spectacularly when somebody does something
unusual. But it also sucks for many users not to have that "something
that works most of the time" if it would make their lives easier.

> I'm personally fine with the ad-hoc approach we use now, which is
> actually very convenient to script and, in my view, not to terrible to
> parse in other tools and languages.  Your mileage may vary, though.

There are a lot of gotchas, there, too. When the data gets complex, "-z"
splitting becomes ambiguous (e.g., "git log -z --raw" uses a NUL both to
separate commits from their diffs, diffs from each other, and diffs from
subsequent commits, so you have to pattern-match each type). It's also
context-dependent (e.g., you can't parse a "--raw -z" entry without
interpreting its type character, since "R" and "C" will have multiple
path fields; there are almost certainly a lot of "works most of the
time" parsers out there).

-Peff