Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM

Lars Schneider <larsxschneider@xxxxxxxxx> · Wed, 28 Feb 2018 22:34:17 +0100

> On 27 Feb 2018, at 06:17, Eric Sunshine <sunshine@xxxxxxxxxxxxxx> wrote:
> 
> On Sun, Feb 25, 2018 at 6:35 AM, Lars Schneider
> <larsxschneider@xxxxxxxxx> wrote:
>>> On 25 Feb 2018, at 04:41, Eric Sunshine <sunshine@xxxxxxxxxxxxxx> wrote:
>>> Is this interpretation correct? When I read [1], I interpret it as
>>> saying that no BOM _of any sort_ should be present when the encoding
>>> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.
>> 
>> Correct!
>> 
>>> This code, on the other hand, only checks for BOMs corresponding
>>> to the declared size (16 or 32 bits).
>> 
>> Hmm. Interesting thought. You are saying that my code won't complain if
>> a document declared as UTF-16LE has a UTF32-LE BOM, correct?
> 
> Well, not specifically that case since UTF-16LE BOM is a subset of UTF32-LE BOM.

Correct - bad example on my part!

> My observation was more general in that [1] seems to say that there
> should be _no_ BOM whatsoever if one of UTF-16BE, UTF-16LE, UTF-32BE,
> or UTF-32LE is declared.

You are saying that a document declared as UTF-16LE must not start 
with 0000feff (UTF-32BE BOM)? I interpreted that situation as a "feff"
in the middle of a file and therefore the BOM should be treated as
ZWNBSP as explained here: http://unicode.org/faq/utf_bom.html#bom6

Plus, if "_no_ BOM whatsoever" is allowed then wouldn't we need to check
for UTF-1, UTF-7, and UTF-8 BOM's too?

I dunno.

>> I would say
>> this is correct behavior in context of this function. This function assumes
>> that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly
>> declared with respect to its BOM in the .gitattributes. Would this
>> comment make it more clear to you?
>>        /*
>>         * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
>>         * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
>>         * The function returns true if this rule is violated.
>>         *
>>         * [1] http://unicode.org/faq/utf_bom.html#bom10
>>         */
>> I think what you are referring to is a different class of error and
>> would therefore warrant its own checker function. Would you agree?
> 
> I don't understand to what different class of error you refer. The
> FAQ[1] seems pretty clear to me in that if one of those declarations
> is used explicitly, then there should be _no_ BOM, period. It doesn't
> say anything about allowing a BOM for a differently-sized encoding (16
> vs 32).
> 
> If I squint very hard, I _guess_ I can see how you interpret [1] with
> the more narrow meaning of the restriction applying only to a BOM of
> the same size as the declared encoding, though reading it that way
> doesn't come easily to me.

For me it is somewhat the other way around :-)
Since I am not sure what is right, I decided to ask the Internet:
https://stackoverflow.com/questions/49038872/is-a-utf-32be-bom-valid-in-an-utf-16le-declared-data-stream

Let's see if someone has a good answer.

- Lars