Re: [PATCH] mailinfo: support Unicode scissors

Jeff King <peff@xxxxxxxx> · Mon, 1 Apr 2019 06:11:57 -0400

On Mon, Apr 01, 2019 at 01:09:47AM +0200, SZEDER Gábor wrote:

> On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote:
> > diff --git a/mailinfo.c b/mailinfo.c
> > index b395adbdf2..4ef6cdee85 100644
> > --- a/mailinfo.c
> > +++ b/mailinfo.c
> > @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
> >  			c++;
> >  			continue;
> >  		}
> > +		if (!memcmp(c, "✂", 3)) {
> 
> This character is tiny.  Please add a comment that it's supposed to be
> a Unicode scissors character.

I think it might also be the first raw UTF-8 character in our source,
which is otherwise ASCII. Usually we'd spell out the binary (with a
comment).

I think I agree with Junio's response, tough, that this is probably not
a road we want to go down, unless this micro-format is being actively
used in the wild (I have no idea, but I have never seen it).

> Should we worry about this memcmp() potentially reading past the end
> of the string when 'c' points to the last character?

I also wondered if the existing memcmps for ">8", etc, would have this
problem. They don't, but it's somewhat subtle. They are only 2
characters long, and the outer loop guarantees we have at least 1
character. So at most we will look at the NUL. But obviously a 3-byte
sequence like this may invoke undefined behavior, and the existing
memcmps encourage anybody adding code to do it wrong.

I wonder if it's worth re-writing it like:

diff --git a/mailinfo.c b/mailinfo.c
index b395adbdf2..46b1b2a4a8 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -693,8 +693,8 @@ static int is_scissors_line(const char *line)
 			perforation++;
 			continue;
 		}
-		if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) ||
-		     !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) {
+		if ((starts_with(c, ">8") || starts_with(c, "8<") ||
+		     starts_with(c, ">%") || starts_with(c, "%<"))) {
 			in_perforation = 1;
 			perforation += 2;
 			scissors += 2;

-Peff