On Mon, Mar 13, 2017 at 09:14:16AM +0100, Julian Reschke wrote: > So the changes in RFC 7159 allow top-level strings, so we can't rely on the > first *two* characters being US-ASCII. But we *can* rely on the first one > being US-ASCII, no? Correct. If one OR two bytes of the first four are NULs, then the encoding is UTF-16 (or something else or invalid): > So the following should still be correct: > > > Since the first character of a JSON text will always be an ASCII > > character [RFC0020], it is possible to determine whether an octet > > stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking > > at the pattern of nulls in the first four octets. > > > > 00 00 00 xx UTF-32BE > > 00 xx xx xx UTF-16BE > > xx 00 00 00 UTF-32LE > > xx 00 xx xx UTF-16LE > > xx xx xx xx UTF-8 Count the number of NULs in the first four bytes: - if zero -> UTF-8 - if one or two -> UTF-16 - if three -> UTF-32 Nico --