On Wed, Jun 30, 2021 at 02:20:09PM -0400, Martin Langhoff wrote: > > One complication we faced is that a lot of Git's data is bag-of-bytes, > > Great point -- hadn't thought of that. Don't see anything in > json-writer.c but we do use iconv already. We do, but the problem is deeper than that. We don't always know the intended encoding of bytes in the repository. For commits, there's an "encoding" header and we default to utf8 if it's not specified. But filenames in trees do not have an encoding (nor are two entries in a single tree even required to be in the same encoding). They really are just sequences of NUL-terminated binary bytes from Git's perspective. Most of the time that just works, of course. People tend to use utf8 these days anyway. And even if they aren't utf8, as long as the user's terminal is configured to match, then everything will look OK to them (you do have to turn off core.quotepath to see any high-bit characters in filenames). So in practice I suspect it is fine to just output them as-is in json. Things will Just Work for people using utf8 consistently. People using other encodings will have things look OK in their terminal, but probably JSON parsers would choke. We could provide an option to say "when you generate json, assume paths are in encoding XYZ (say, latin1) and convert to utf8". That wouldn't help people who have mix-and-match encodings in their trees, but that seems even more rare. -Peff