On Sun, Feb 25, 2018 at 6:35 AM, Lars Schneider <larsxschneider@xxxxxxxxx> wrote: >> On 25 Feb 2018, at 04:41, Eric Sunshine <sunshine@xxxxxxxxxxxxxx> wrote: >> Is this interpretation correct? When I read [1], I interpret it as >> saying that no BOM _of any sort_ should be present when the encoding >> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE. > > Correct! > >> This code, on the other hand, only checks for BOMs corresponding >> to the declared size (16 or 32 bits). > > Hmm. Interesting thought. You are saying that my code won't complain if > a document declared as UTF-16LE has a UTF32-LE BOM, correct? Well, not specifically that case since UTF-16LE BOM is a subset of UTF32-LE BOM. My observation was more general in that [1] seems to say that there should be _no_ BOM whatsoever if one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE is declared. > I would say > this is correct behavior in context of this function. This function assumes > that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly > declared with respect to its BOM in the .gitattributes. Would this > comment make it more clear to you? > /* > * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16 > * BOM must not be used [1]. The same applies for the UTF-32 equivalents. > * The function returns true if this rule is violated. > * > * [1] http://unicode.org/faq/utf_bom.html#bom10 > */ > I think what you are referring to is a different class of error and > would therefore warrant its own checker function. Would you agree? I don't understand to what different class of error you refer. The FAQ[1] seems pretty clear to me in that if one of those declarations is used explicitly, then there should be _no_ BOM, period. It doesn't say anything about allowing a BOM for a differently-sized encoding (16 vs 32). If I squint very hard, I _guess_ I can see how you interpret [1] with the more narrow meaning of the restriction applying only to a BOM of the same size as the declared encoding, though reading it that way doesn't come easily to me. >> I suppose the intention of [1] is to detect a mismatch between the >> declared encoding and how the stream is actually encoded. The check >> implemented here will fail to detect a mismatch between, say, declared >> encoding UTF-16BE and actual encoding UTF-32BE. > > As stated above the intention is to detect wrong BOMs! I think we cannot > detect the "declared as UTF-16BE but actually UTF-32BE" error. > > Consider this: > > printf "test" | iconv -f UTF-8 -t UTF-32BE | iconv -f UTF-16BE -t UTF-8 | od -c > 0000000 \0 t \0 e \0 s \0 t > 0000010 > > In the first step we "encode" the string to UTF-32BE and then we "decode" it as > UTF-16BE. The result is valid although not correct. Does this make sense? I'm probably being dense, but I don't understand what this is trying to illustrate in relation to has_prohibited_utf_bom().