Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM

Lars Schneider <larsxschneider@xxxxxxxxx> · Sun, 25 Feb 2018 12:35:35 +0100

> On 25 Feb 2018, at 04:41, Eric Sunshine <sunshine@xxxxxxxxxxxxxx> wrote:
> 
> On Sat, Feb 24, 2018 at 11:27 AM,  <lars.schneider@xxxxxxxxxxxx> wrote:
>> Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
>> or UTF-32LE a BOM must not be used [1]. The function returns true if
>> this is the case.
>> 
>> [1] http://unicode.org/faq/utf_bom.html#bom10
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx>
>> ---
>> diff --git a/utf8.c b/utf8.c
>> @@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
>> +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
>> +{
>> +       return (
>> +         (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
>> +         (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
>> +          has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
>> +       ) || (
>> +         (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
>> +         (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
>> +          has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
>> +       );
>> +}
> 
> Is this interpretation correct? When I read [1], I interpret it as
> saying that no BOM _of any sort_ should be present when the encoding
> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.

Correct!

> This
> code, on the other hand, only checks for BOMs corresponding to the
> declared size (16 or 32 bits).

Hmm. Interesting thought. You are saying that my code won't complain if
a document declared as UTF-16LE has a UTF32-LE BOM, correct? I would say
this is correct behavior in context of this function. This function assumes
that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly
declared with respect to its BOM in the .gitattributes. Would this
comment make it more clear to you?

	/*
	 * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
	 * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
	 * The function returns true if this rule is violated.
	 *
	 * [1] http://unicode.org/faq/utf_bom.html#bom10
	 */

I think what you are referring to is a different class of error and
would therefore warrant its own checker function. Would you agree?

> I suppose the intention of [1] is to detect a mismatch between the
> declared encoding and how the stream is actually encoded. The check
> implemented here will fail to detect a mismatch between, say, declared
> encoding UTF-16BE and actual encoding UTF-32BE.

As stated above the intention is to detect wrong BOMs! I think we cannot 
detect the "declared as UTF-16BE but actually UTF-32BE" error.

Consider this:

printf "test" | iconv -f UTF-8 -t UTF-32BE | iconv -f UTF-16BE -t UTF-8 | od -c
0000000   \0   t  \0   e  \0   s  \0   t
0000010

In the first step we "encode" the string to UTF-32BE and then we "decode" it as
UTF-16BE. The result is valid although not correct. Does this make sense?

Thanks,
Lars