Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

Lars Schneider <larsxschneider@xxxxxxxxx> · Tue, 6 Mar 2018 23:39:16 +0100

> On 06 Mar 2018, at 21:50, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> 
> lars.schneider@xxxxxxxxxxxx writes:
> 
>> +int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
>> +{
>> +	return (
>> +	   !strcmp(enc, "UTF-16") &&
>> +	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
>> +	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
>> +	) || (
>> +	   !strcmp(enc, "UTF-32") &&
>> +	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
>> +	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
>> +	);
>> +}
> 
> These strcmp() calls seem inconsistent with the principle embodied
> by utf8.c::fallback_encoding(), i.e. "be lenient to what we accept",
> and make the interface uneven. I am wondering if we also want to
> complain when the user gave us "utf16" and there is no byte order
> mark in the contents, for example?

Well, if I use stricmp() then I don't need to call and cleanup
xstrdup_toupper() as discussed with Eric [1]. Is there a case
insensitive starts_with() method?

[1] https://public-inbox.org/git/CAPig+cQE0pKs-AMvh4GndyCXBMnx=70jPpDM6K4jJTe-74FecQ@xxxxxxxxxxxxxx/

>  Also "UTF16" or other spelling
> the platform may support but this code fails to recognise will go
> unchecked.

That is true. However, I would assume all iconv implementations use the
same encoding names for UTF encodings, no? That means UTF16 would never be
valid. Would you agree?

- Lars