Re: [RFC PATCH 6/6] utf8.c: avoid char overflow

Beat Bolli <dev+git@xxxxxxxxx> · Mon, 09 Jul 2018 17:45:05 +0200

Am 09.07.2018 16:48, schrieb Beat Bolli:
Hi Dscho

Am 09.07.2018 15:14, schrieb Johannes Schindelin:
Hi Beat,

On Sun, 8 Jul 2018, Beat Bolli wrote:

In ISO C, char constants must be in the range -128..127. Change the 
BOM
constants to unsigned char to avoid overflow.

Signed-off-by: Beat Bolli <dev+git@xxxxxxxxx>
---
 utf8.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/utf8.c b/utf8.c
index d55e20c641..833ce00617 100644
--- a/utf8.c
+++ b/utf8.c
@@ -561,15 +561,15 @@ char *reencode_string_len(const char *in, int 
insz,
 #endif

 static int has_bom_prefix(const char *data, size_t len,
-			  const char *bom, size_t bom_len)
+			  const unsigned char *bom, size_t bom_len)
 {
 	return data && bom && (len >= bom_len) && !memcmp(data, bom, 
bom_len);
 }

-static const char utf16_be_bom[] = {0xFE, 0xFF};
-static const char utf16_le_bom[] = {0xFF, 0xFE};
-static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
-static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+static const unsigned char utf16_be_bom[] = {0xFE, 0xFF};
+static const unsigned char utf16_le_bom[] = {0xFF, 0xFE};
+static const unsigned char utf32_be_bom[] = {0x00, 0x00, 0xFE, 
0xFF};
+static const unsigned char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 
0x00};

An alternative approach that might be easier to read (and avoids the
confusion arising from our use of (signed) chars for strings pretty 
much
everywhere):

#define FE ((char)0xfe)
#define FF ((char)0xff)

...

I have tried this first (without the macros, though), and thought it 
looked
really ugly. That's why I chose this solution. The usage is pretty 
local and
close to function has_bom_prefix().

Would an explaining comment help?

I have found an even simpler solution. Use proper char literals.

I will put this into v2.

Regards,
Beat


diff --git a/utf8.c b/utf8.c
index d55e20c641..982217eec9 100644
--- a/utf8.c
+++ b/utf8.c
@@ -566,10 +566,10 @@ static int has_bom_prefix(const char *data, size_t 
len,
        return data && bom && (len >= bom_len) && !memcmp(data, bom, 
bom_len);
 }

-static const char utf16_be_bom[] = {0xFE, 0xFF};
-static const char utf16_le_bom[] = {0xFF, 0xFE};
-static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
-static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+static const char utf16_be_bom[] = {'\xFE', '\xFF'};
+static const char utf16_le_bom[] = {'\xFF', '\xFE'};
+static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
+static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};

 int has_prohibited_utf_bom(const char *enc, const char *data, size_t 
len)
 {