On 2017-03-11 03:08, John Cowan wrote: > > On Thu, Mar 9, 2017 at 12:53 AM, Benjamin Kaduk <kaduk@xxxxxxx > <mailto:kaduk@xxxxxxx>> wrote: > > If that's what's supposed to happen, it should probably be more > clear, yes. (But aren't there texts that have valid interpretations > in multiple encodings?) > > > Not if the content is well-formed JSON and the only possible encodings > are UTF-8, UTF-16, and UTF-32. It suffices to examine the first four > bytes of the input. If there are no NUL bytes in the first four bytes, > it is UTF-8; if there are two NUL bytes, it is UTF-16; if there are > three NUL bytes, it is UTF-32. This works because the grammar requires > the first character to be in the ASCII repertoire, and the NUL > *character* (U+0000) is not allowed at all.
Good explanation. Maybe the spec should include it.
+1 This exact issue just came up in a media type review, where someone specified a charset parameter because they weren't aware of this algorithm. It would be very helpful to have this text in the RFC. Ned