On Thu, Jul 01 2021, brian m. carlson wrote: > [[PGP Signed Part:Undecided]] > On 2021-07-01 at 16:00:19, Jeff King wrote: >> On Wed, Jun 30, 2021 at 08:19:49PM +0000, brian m. carlson wrote: >> >> > On 2021-06-30 at 17:59:43, Jeff King wrote: >> > > One complication we faced is that a lot of Git's data is bag-of-bytes, >> > > not utf8. And json technically requires utf8. I don't remember if we >> > > simply fudged that and output possibly non-utf8 sequences, or if we >> > > actually encode them. >> > >> > I think we just emit invalid UTF-8 in that case, which is a problem. >> > That's why Git is not well suited to JSON output and why it isn't a good >> > choice for structured data here. I'd like us not to do more JSON in our >> > codebase, since it's practically impossible for users to depend on our >> > output if we do that due to encoding issues[0]. >> > >> > We could emit data in a different format, such as YAML, which does have >> > encoding for arbitrary byte sequences. However, in YAML, binary data is >> > always base64 encoded, which is less readable, although still >> > interchangeable. CBOR is also a possibility, although it's not human >> > readable at all. >> >> I don't love the invalid-utf8-in-json thing in general. But I think it >> may be the least-bad solution. I seem to recall that YAML has its own >> complexities, and losing human-readability (even to base64) is a pretty >> big downside. And the tooling for working with json seems more common >> and mature (certainly over something like CBOR, but I think even YAML >> doesn't have anything nearly as nice as jq). > > I'm not opposed to JSON as long as we don't write landmines. We could > URI-encode anything that contains a bag-of-bytes, which lets people have > the niceties of JSON without the breakage when people don't write valid > UTF-8. Most things will still be human-readable. > > We could even have --json be an alias for --json=encoded (URI-encoding) > and also have --json=strict for the situation where you assert > everything is valid UTF-8 and explicitly said you wanted us to die() if > we saw non-UTF-8. I don't want us to say that something is JSON and > then emit junk, since that's a bad user experience. > > Ideally, we'd have some generic serializer support for this case, so if > people _do_ want to add YAML or CBOR output, it can be stuffed in. I'd think the ideal end-state is for us to have some standardized way to pass structs of structured data around at the C-level. Then everything that now supports a format such as git-log, for-each-ref, cat-file --batch etc. could share the same formatting logic. Our human-readable output would just be a special-case of providing a default format, as it is in the case of some of these commands. If we had a bit of an extension of the %(if) etc. syntax that for-each-ref uses to handle such nested structures we could emit arbitrary structured data, e.g. the formatting language would be sufficient to start a nested structure, only emit commas between elements etc. You could then emit JSON, XML or whatever you'd like with a "simple" (well, it would be quite verbose) format specification. We could then ship some default formats. A related (but not quite the same) benefit would be to make the logic driving the built-ins reeantrant, so a formatting feature like this could be combined with a "--batch" mode supported by every (or most) commands. So you could also e.g. run "log --batch" not just "cat-file --batch" if you needed some of the formatting it provides. This would be immensely useful to editor implementations and things invoking git on the server, where we often need to pay the startup cost for invoking N number of commands that are built into the "git" binary anyway. So if they could be sent as a --batch ... >> Our sloppy json encoding does work correctly if you use utf8 paths, and >> I think we could provide options to cover other common cases (e.g., a >> single option for "assume my paths are latin1"). I think life is hardest >> on somebody writing a script/service which is meant to process arbitrary >> repositories (and isn't in control of the strictness of whatever is >> parsing the json). > > I think I'd rather provide a general encoding functionality than try to > handle random encodings. I _do_ want people to be able to do things > like store arbitrary bytes in paths, because many people do use that > functionality for shipping test files that verify their code works > correctly on Unix systems. I also want us to handle arbitrary bytes > where we've stated that's a thing we support (e.g., in refs). I _don't_ > want to encourage people to use non-UTF-8 text encodings, because I > firmly believe those are obsolete. > > So, correct binary data support, yes; non-UTF-8 text, no. I don't know how widely they're used with gits, but there's several non-Unicode encodings in wide use, and e.g. non-UTF-8 but Unicode encodings like UTF-16 in some contexts/platforms: https://stackoverflow.com/questions/1200063/why-does-anyone-use-an-encoding-other-than-utf-8/2470079