Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM

Eric Sunshine <sunshine@xxxxxxxxxxxxx> · Tue, 27 Feb 2018 00:17:44 -0500

On Sun, Feb 25, 2018 at 6:35 AM, Lars Schneider
<larsxschneider@xxxxxxxxx> wrote:
>> On 25 Feb 2018, at 04:41, Eric Sunshine <sunshine@xxxxxxxxxxxxxx> wrote:
>> Is this interpretation correct? When I read [1], I interpret it as
>> saying that no BOM _of any sort_ should be present when the encoding
>> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.
>
> Correct!
>
>> This code, on the other hand, only checks for BOMs corresponding
>> to the declared size (16 or 32 bits).
>
> Hmm. Interesting thought. You are saying that my code won't complain if
> a document declared as UTF-16LE has a UTF32-LE BOM, correct?

Well, not specifically that case since UTF-16LE BOM is a subset of UTF32-LE BOM.

My observation was more general in that [1] seems to say that there
should be _no_ BOM whatsoever if one of UTF-16BE, UTF-16LE, UTF-32BE,
or UTF-32LE is declared.

> I would say
> this is correct behavior in context of this function. This function assumes
> that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly
> declared with respect to its BOM in the .gitattributes. Would this
> comment make it more clear to you?
>         /*
>          * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
>          * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
>          * The function returns true if this rule is violated.
>          *
>          * [1] http://unicode.org/faq/utf_bom.html#bom10
>          */
> I think what you are referring to is a different class of error and
> would therefore warrant its own checker function. Would you agree?

I don't understand to what different class of error you refer. The
FAQ[1] seems pretty clear to me in that if one of those declarations
is used explicitly, then there should be _no_ BOM, period. It doesn't
say anything about allowing a BOM for a differently-sized encoding (16
vs 32).

If I squint very hard, I _guess_ I can see how you interpret [1] with
the more narrow meaning of the restriction applying only to a BOM of
the same size as the declared encoding, though reading it that way
doesn't come easily to me.

>> I suppose the intention of [1] is to detect a mismatch between the
>> declared encoding and how the stream is actually encoded. The check
>> implemented here will fail to detect a mismatch between, say, declared
>> encoding UTF-16BE and actual encoding UTF-32BE.
>
> As stated above the intention is to detect wrong BOMs! I think we cannot
> detect the "declared as UTF-16BE but actually UTF-32BE" error.
>
> Consider this:
>
> printf "test" | iconv -f UTF-8 -t UTF-32BE | iconv -f UTF-16BE -t UTF-8 | od -c
> 0000000   \0   t  \0   e  \0   s  \0   t
> 0000010
>
> In the first step we "encode" the string to UTF-32BE and then we "decode" it as
> UTF-16BE. The result is valid although not correct. Does this make sense?

I'm probably being dense, but I don't understand what this is trying
to illustrate in relation to has_prohibited_utf_bom().