> On 27 Feb 2018, at 06:17, Eric Sunshine <sunshine@xxxxxxxxxxxxxx> wrote: > > On Sun, Feb 25, 2018 at 6:35 AM, Lars Schneider > <larsxschneider@xxxxxxxxx> wrote: >>> On 25 Feb 2018, at 04:41, Eric Sunshine <sunshine@xxxxxxxxxxxxxx> wrote: >>> Is this interpretation correct? When I read [1], I interpret it as >>> saying that no BOM _of any sort_ should be present when the encoding >>> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE. >> >> Correct! >> >>> This code, on the other hand, only checks for BOMs corresponding >>> to the declared size (16 or 32 bits). >> >> Hmm. Interesting thought. You are saying that my code won't complain if >> a document declared as UTF-16LE has a UTF32-LE BOM, correct? > > Well, not specifically that case since UTF-16LE BOM is a subset of UTF32-LE BOM. Correct - bad example on my part! > My observation was more general in that [1] seems to say that there > should be _no_ BOM whatsoever if one of UTF-16BE, UTF-16LE, UTF-32BE, > or UTF-32LE is declared. You are saying that a document declared as UTF-16LE must not start with 0000feff (UTF-32BE BOM)? I interpreted that situation as a "feff" in the middle of a file and therefore the BOM should be treated as ZWNBSP as explained here: http://unicode.org/faq/utf_bom.html#bom6 Plus, if "_no_ BOM whatsoever" is allowed then wouldn't we need to check for UTF-1, UTF-7, and UTF-8 BOM's too? I dunno. >> I would say >> this is correct behavior in context of this function. This function assumes >> that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly >> declared with respect to its BOM in the .gitattributes. Would this >> comment make it more clear to you? >> /* >> * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16 >> * BOM must not be used [1]. The same applies for the UTF-32 equivalents. >> * The function returns true if this rule is violated. >> * >> * [1] http://unicode.org/faq/utf_bom.html#bom10 >> */ >> I think what you are referring to is a different class of error and >> would therefore warrant its own checker function. Would you agree? > > I don't understand to what different class of error you refer. The > FAQ[1] seems pretty clear to me in that if one of those declarations > is used explicitly, then there should be _no_ BOM, period. It doesn't > say anything about allowing a BOM for a differently-sized encoding (16 > vs 32). > > If I squint very hard, I _guess_ I can see how you interpret [1] with > the more narrow meaning of the restriction applying only to a BOM of > the same size as the declared encoding, though reading it that way > doesn't come easily to me. For me it is somewhat the other way around :-) Since I am not sure what is right, I decided to ask the Internet: https://stackoverflow.com/questions/49038872/is-a-utf-32be-bom-valid-in-an-utf-16le-declared-data-stream Let's see if someone has a good answer. - Lars