Re: [Json] secdir review of draft-ietf-jsonbis-rfc7159bis-03

Peter Cordell <petejson@xxxxxxxxxxxxx> · Sun, 12 Mar 2017 09:06:24 +0000

On 11/03/2017 15:41, Ned Freed wrote:
On 2017-03-11 03:08, John Cowan wrote:
>
> On Thu, Mar 9, 2017 at 12:53 AM, Benjamin Kaduk <kaduk@xxxxxxx
> <mailto:kaduk@xxxxxxx>> wrote:
>
>     If that's what's supposed to happen, it should probably be more
>     clear, yes.  (But aren't there texts that have valid
interpretations
>     in multiple encodings?)
>
>
> Not if the content is well-formed JSON and the only possible encodings
> are UTF-8, UTF-16, and UTF-32.  It suffices to examine the first four
> bytes of the input.  If there are no NUL bytes in the first four bytes,
> it is UTF-8; if there are two NUL bytes, it is UTF-16; if there are
> three NUL bytes, it is UTF-32.  This works because the grammar requires
> the first character to be in the ASCII repertoire, and the NUL
> *character* (U+0000) is not allowed at all.

Good explanation. Maybe the spec should include it.

+1

This exact issue just came up in a media type review, where someone
specified a charset parameter because they weren't aware of this algorithm.

It would be very helpful to have this text in the RFC.

Although it does need slightly more detail to take into account 
endian-ness in the case of UTF-16 and -32.

The XML spec may offer some example text:

https://www.w3.org/TR/2008/REC-xml-20081126/#sec-guessing

Pete Cordell
Codalogic Ltd
Read & write XML in C++, http://www.xml2cpp.com