Re: [Json] BOMs

Bjoern Hoehrmann <derhoermi@xxxxxxx> · Thu, 21 Nov 2013 18:41:08 +0100

* Allen Wirfs-Brock wrote:
>On Nov 21, 2013, at 5:28 AM, Henri Sivonen wrote:
>> On Thu, Nov 21, 2013 at 7:53 AM, Allen Wirfs-Brock
>> <allen@xxxxxxxxxxxxxxx> wrote:
>>> Just to be clear about this.  My tests directly tested JavaScript built-in
>>> JSON parsers WRT to BOM support in three major browsers.  The tests directly
>>> invoked the built-in JSON.parse functions and directly passed to them a
>>> source strings that was explicitly constructed to contain a BOM code point .

>> It would be surprising if JSON.parse() accepted a BOM, since it
>> doesn't take bytes as input.
>
>ECMAScript's JSON.parse accepts an ECMAScript string value as its input.
>ECMAScript strings are sequences of 16-bit values.  JSON.parse (and most
>other ECMAScript functions) interpret those values  as Unicode code 
>units.  The value U+FEFF can appear at any position within a string. 
>When defining a string as an ECMAScript literal, a sequence like \ufeff 
>is an escape sequence that means place the code unit value 0xefff into 
>the string at this position in the sequence. Also note that the actual 
>strings passed below to JSON.parse contain the actual code point value 
>U+FEFF not the escape sequence that was used to express it.  To include 
>the actual escape sequence characters in the string it would have to be 
>expressed as '\\feff'.

A byte order mark indicates the order of bytes in a sequence of bytes.
An ecmascript string is not a sequence of bytes and therefore it cannot
have a byte order mark inside it. Your test is not for BOM support but
for an egregious semantic error in the implementation of JSON.parse.

  http://shadowregistry.org/js/misc/#t2ea25a961255bb1202da9497a1942e09

That is a similar test. It makes Firefox see UTF-8 BOMs in ecmascript
strings. Firefox is not supposed to look for UTF-8 BOMs in ecmascript
strings because ecmascript strings are not sequences of bytes at that
level of reasoning.

Is there any chance, by the way, to change `JSON.stringify` so it does
not output strings that cannot be encoded using UTF-8? Specifically,

  JSON.stringify(JSON.parse("\"\uD800\""))

would need to escape the surrogate instead of emitting it literally.
-- 
Björn Höhrmann · mailto:bjoern@xxxxxxxxxxxx · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/