Re: BOMs

"Martin J. Dürst" <duerst@xxxxxxxxxxxxxxx> · Tue, 19 Nov 2013 13:32:37 +0900

Okay, here are some more tests.

http://www.sw.it.aoyama.ac.jp/2013/pub/json_tests/test1_utf8_nobom.json
http://www.sw.it.aoyama.ac.jp/2013/pub/json_tests/test2_utf8_bom.json

They are self-describing JSON files served with application/json, the 
first without a BOM, and the second with a BOM.

They contain some Japanese, and a tiny bit of Spanish.

[see more below]

On 2013/11/18 21:59, Henry S. Thompson wrote:
Bjoern Hoehrmann writes:

Perl's JSON module gives me

   malformed JSON string, neither array, object, number, string
   or atom, at character offset 0 (before "\x{ef}\x{bb}\x{bf}[]")

Python's json module gives me

   ValueError: No JSON object could be decoded

Go's "encoding/json" module gives me

   invalid character 'ï' looking for beginning of value

I'm curious to know what level you're invoking the parser at.  As
implied by my previous post about the Python 'requests' package, it
handles application/json resources by stripping any initial BOM it
finds -- you can try this with

import requests
r=requests.get("http://www.ltg.ed.ac.uk/ov-test/b16le.json";)
r.json()

I get a 404 on this example. I can put up UTF-16 examples, too.

Regards,   Martin.

Signatures are not part of the text of a document, as the UNICODE spec
makes clear, so asking what happens when you pass a string beginning
with a BOM to a parser is not really the right question in this
context, is it?

As I tried to say in an earlier post, there's a distinction which
needs to be carefully insisted on between, on the one hand, languages
and their parsers, where I agree signatures/BOMs have no place, and,
on the other hand, (media-typed) resources/entities/payloads and _their_
processing, where a discussion of BOMs/signatures _is_ appropriate
and, often, necessary.

BTW I agree that the status of the UTF-8 BOM as signature is slightly
hazy, but again the UNICODE spec itself [1] says

   "this sequence can serve as signature for UTF-8 encoded text where
    the character set is unmarked"

ht

[1] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf