On Wed, Mar 21 2018, git@xxxxxxxxxxxxxxxxx wrote: > So, I'm not sure we have a route to get UTF-8-clean data out of Git, and if > we do it is beyond the scope of this patch series. > > So I think for our uses here, defining this as "JSON-like" is probably the > best answer. We write the strings as we received them (from the file system, > the index, or whatever). These strings are properly escaped WRT double > quotes, backslashes, and control characters, so we shouldn't have an issue > with decoders getting out of sync -- only with them rejecting non-UTF-8 > sequences. > > We could blindly \uXXXX encode each of the hi-bit characters, if that would > help the parsers, but I don't want to do that right now. > > WRT binary data, I had not intended using this for binary data. And without > knowing what kinds or quantity of binary data we might use it for, I'd like > to ignore this for now. I agree we should just ignore this problem for now given the immediate use-case.