Re: [RFC PATCH 6/6] utf8.c: avoid char overflow

Beat Bolli <dev+git@xxxxxxxxx> · Mon, 9 Jul 2018 19:56:38 +0200

On 09.07.18 18:33, Junio C Hamano wrote:
> Beat Bolli <dev+git@xxxxxxxxx> writes:
> 
>>>> -static const char utf16_be_bom[] = {0xFE, 0xFF};
>>>> -static const char utf16_le_bom[] = {0xFF, 0xFE};
>>>> -static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
>>>> -static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
>>>> +static const unsigned char utf16_be_bom[] = {0xFE, 0xFF};
>>>> +static const unsigned char utf16_le_bom[] = {0xFF, 0xFE};
>>>> +static const unsigned char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
>>>> +static const unsigned char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
>>>
>>> An alternative approach that might be easier to read (and avoids the
>>> confusion arising from our use of (signed) chars for strings pretty
>>> much
>>> everywhere):
>>>
>>> #define FE ((char)0xfe)
>>> #define FF ((char)0xff)
>>>
>>> ...
>>
>> I have tried this first (without the macros, though), and thought
>> it looked really ugly. That's why I chose this solution. The usage
>> is pretty local and close to function has_bom_prefix().
> 
> I found that what you posted was already OK, as has_bom_prefix()
> appears only locally in this file and that is the only thing that
> cares about these foo_bom[] constants.  Casting the elements in
> these arrays to (char) type is also fine and not all that ugly,
> I think, and between the two (but without the macro) I have no
> strong preference.  I wonder if writing them as '\376' and '\377'
> as old timers would helps the compiler, though.
> 

Yes, it does, as I found out in
https://public-inbox.org/git/e3df2644b59b170e26b2a7c0d3978331@xxxxxxxxx/

But I prefer hex; it's closer to the usual definition of the BOM bytes.

Beat