> On 25 Feb 2018, at 04:41, Eric Sunshine <sunshine@xxxxxxxxxxxxxx> wrote: > > On Sat, Feb 24, 2018 at 11:27 AM, <lars.schneider@xxxxxxxxxxxx> wrote: >> Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE >> or UTF-32LE a BOM must not be used [1]. The function returns true if >> this is the case. >> >> [1] http://unicode.org/faq/utf_bom.html#bom10 >> >> Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx> >> --- >> diff --git a/utf8.c b/utf8.c >> @@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz, >> +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len) >> +{ >> + return ( >> + (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) && >> + (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) || >> + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom))) >> + ) || ( >> + (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) && >> + (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) || >> + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom))) >> + ); >> +} > > Is this interpretation correct? When I read [1], I interpret it as > saying that no BOM _of any sort_ should be present when the encoding > is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE. Correct! > This > code, on the other hand, only checks for BOMs corresponding to the > declared size (16 or 32 bits). Hmm. Interesting thought. You are saying that my code won't complain if a document declared as UTF-16LE has a UTF32-LE BOM, correct? I would say this is correct behavior in context of this function. This function assumes that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly declared with respect to its BOM in the .gitattributes. Would this comment make it more clear to you? /* * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16 * BOM must not be used [1]. The same applies for the UTF-32 equivalents. * The function returns true if this rule is violated. * * [1] http://unicode.org/faq/utf_bom.html#bom10 */ I think what you are referring to is a different class of error and would therefore warrant its own checker function. Would you agree? > I suppose the intention of [1] is to detect a mismatch between the > declared encoding and how the stream is actually encoded. The check > implemented here will fail to detect a mismatch between, say, declared > encoding UTF-16BE and actual encoding UTF-32BE. As stated above the intention is to detect wrong BOMs! I think we cannot detect the "declared as UTF-16BE but actually UTF-32BE" error. Consider this: printf "test" | iconv -f UTF-8 -t UTF-32BE | iconv -f UTF-16BE -t UTF-8 | od -c 0000000 \0 t \0 e \0 s \0 t 0000010 In the first step we "encode" the string to UTF-32BE and then we "decode" it as UTF-16BE. The result is valid although not correct. Does this make sense? Thanks, Lars