On Sat, Feb 24, 2018 at 11:27 AM, <lars.schneider@xxxxxxxxxxxx> wrote: > Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE > or UTF-32LE a BOM must not be used [1]. The function returns true if > this is the case. > > [1] http://unicode.org/faq/utf_bom.html#bom10 > > Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx> > --- > diff --git a/utf8.c b/utf8.c > @@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz, > +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len) > +{ > + return ( > + (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) && > + (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) || > + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom))) > + ) || ( > + (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) && > + (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) || > + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom))) > + ); > +} Is this interpretation correct? When I read [1], I interpret it as saying that no BOM _of any sort_ should be present when the encoding is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE. This code, on the other hand, only checks for BOMs corresponding to the declared size (16 or 32 bits). I suppose the intention of [1] is to detect a mismatch between the declared encoding and how the stream is actually encoded. The check implemented here will fail to detect a mismatch between, say, declared encoding UTF-16BE and actual encoding UTF-32BE.