tboegi@xxxxxx writes: > From: Lars Schneider <larsxschneider@xxxxxxxxx> > > If the endianness is not defined in the encoding name, then let's > be strict and require a BOM to avoid any encoding confusion. The > has_missing_utf_bom() function returns true if a required BOM is > missing. > > The Unicode standard instructs to assume big-endian if there in no BOM > for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used > in HTML5 recommends to assume little-endian to "deal with deployed > content" [3]. Strictly requiring a BOM seems to be the safest option > for content in Git. I do not have strong opinion on encoding such policy-ish behaviour as our default, but am I alone to find that "has missing X" is a confusing name for a helper function? "is missing X" (or "lacks X") is a bit more understandable, I guess. > +int has_missing_utf_bom(const char *enc, const char *data, size_t len) > +{ > + return ( > + !strcmp(enc, "UTF-16") && > + !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) || > + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom))) > + ) || ( > + !strcmp(enc, "UTF-32") && > + !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) || > + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom))) > + ); > +}