Re: [RFC PATCH 6/6] utf8.c: avoid char overflow

Beat Bolli <dev+git@xxxxxxxxx> · Mon, 09 Jul 2018 16:48:28 +0200

Hi Dscho

Am 09.07.2018 15:14, schrieb Johannes Schindelin:
Hi Beat,

On Sun, 8 Jul 2018, Beat Bolli wrote:

In ISO C, char constants must be in the range -128..127. Change the 
BOM
constants to unsigned char to avoid overflow.

Signed-off-by: Beat Bolli <dev+git@xxxxxxxxx>
---
 utf8.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/utf8.c b/utf8.c
index d55e20c641..833ce00617 100644
--- a/utf8.c
+++ b/utf8.c
@@ -561,15 +561,15 @@ char *reencode_string_len(const char *in, int 
insz,
 #endif

 static int has_bom_prefix(const char *data, size_t len,
-			  const char *bom, size_t bom_len)
+			  const unsigned char *bom, size_t bom_len)
 {
 	return data && bom && (len >= bom_len) && !memcmp(data, bom, 
bom_len);
 }

-static const char utf16_be_bom[] = {0xFE, 0xFF};
-static const char utf16_le_bom[] = {0xFF, 0xFE};
-static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
-static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+static const unsigned char utf16_be_bom[] = {0xFE, 0xFF};
+static const unsigned char utf16_le_bom[] = {0xFF, 0xFE};
+static const unsigned char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const unsigned char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};

An alternative approach that might be easier to read (and avoids the
confusion arising from our use of (signed) chars for strings pretty 
much
everywhere):

#define FE ((char)0xfe)
#define FF ((char)0xff)

...

I have tried this first (without the macros, though), and thought it 
looked
really ugly. That's why I chose this solution. The usage is pretty local 
and
close to function has_bom_prefix().

Would an explaining comment help?

Beat