Re: [RFC PATCH 6/6] utf8.c: avoid char overflow

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 09 Jul 2018 09:33:18 -0700

Beat Bolli <dev+git@xxxxxxxxx> writes:

>>> -static const char utf16_be_bom[] = {0xFE, 0xFF};
>>> -static const char utf16_le_bom[] = {0xFF, 0xFE};
>>> -static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
>>> -static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
>>> +static const unsigned char utf16_be_bom[] = {0xFE, 0xFF};
>>> +static const unsigned char utf16_le_bom[] = {0xFF, 0xFE};
>>> +static const unsigned char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
>>> +static const unsigned char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
>>
>> An alternative approach that might be easier to read (and avoids the
>> confusion arising from our use of (signed) chars for strings pretty
>> much
>> everywhere):
>>
>> #define FE ((char)0xfe)
>> #define FF ((char)0xff)
>>
>> ...
>
> I have tried this first (without the macros, though), and thought
> it looked really ugly. That's why I chose this solution. The usage
> is pretty local and close to function has_bom_prefix().

I found that what you posted was already OK, as has_bom_prefix()
appears only locally in this file and that is the only thing that
cares about these foo_bom[] constants.  Casting the elements in
these arrays to (char) type is also fine and not all that ugly,
I think, and between the two (but without the macro) I have no
strong preference.  I wonder if writing them as '\376' and '\377'
as old timers would helps the compiler, though.