Re: [PATCH resend] Do not create commits whose message contains NUL

Jeff King <peff@xxxxxxxx> · Tue, 13 Dec 2011 12:59:32 -0500

On Tue, Dec 13, 2011 at 06:56:08PM +0700, Nguyen Thai Ngoc Duy wrote:

> We assume that the commit log messages are uninterpreted sequences of
> non-NUL bytes (see Documentation/i18n.txt). However the assumption
> does not really stand out and it's quite easy to set an editor to save
> in a NUL-included encoding. Currently we silently cut at the first NUL
> we see.
> 
> Make it more obvious that NUL is not welcome by refusing to create
> such commits. Those who deliberately want to create them can still do
> with hash-object.
> 
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx>
> ---
>  This is from UTF-16 in commit message discussion [1] a few months
>  ago. I don't want to resurrect the discussion again. However I think
>  it's a good idea to stop users from shooting themselves in this case,
>  especially at porcelain level.
>
>  There were no comments on this patch previously. So, any comments
>  this time ? Should I drop it?

I think this is a sane thing to do. Having thought about and
experimented a little with utf-16 in the past few months, I really don't
see how you could be disrupting anybody's workflow. utf-16 messages get
butchered so badly already; we are much better off letting the user know
of the problem as soon as possible.

It looks like we already have a check for is_utf8, and this is not
failing that check. I guess because is_utf8 takes a NUL-terminated
buffer, so it simply sees the truncated result (i.e., depending on
endianness, "foo" in utf16 is something like "f\0o\0o\0", so we check
only "f"). We could make is_utf8 take a length parameter to be more
accurate, and then it would catch this.

However, I think that's not quite what we want. We only check is_utf8 if
the encoding field is not set. And really, we want to reject NULs no
matter _which_ encoding they've set, because git simply doesn't handle
them properly.

> diff --git a/commit.c b/commit.c
> index d67b8c7..0775eec 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -855,6 +855,9 @@ int commit_tree(const char *msg, size_t msg_len, unsigned char *tree,

Hmm. My version of commit_tree does not have a "msg_len" parameter, nor
do I have d67b8c7. Is there some refactoring patch this is based on that
I missed?

> +	if (strlen(msg) < msg_len)
> +		die(_("cannot commit with NUL in commit message"));
> +

Two nits:

  1. For some reason, checking strlen(msg) seems a subtle way of looking
     for NULs in a buffer. I would have found:

         if (memchr(msg, '\0', msglen))

     much more obvious. But perhaps it is just me. Certainly not a big
     deal either way.

  2. The error message could be a little friendlier. The likely reason
     for NULs is a bogus encoding setting in the user's editor. We
     already have a nice "your message isn't utf-8" message. Though it
     does suggest setting i18n.commitencoding, which probably _isn't_
     the solution here (since their encoding clearly isn't supported).
     But maybe it would be nicer to say something like:

       error: your commit message contains NUL characters.
       hint: This is often caused by using multibyte encodings such as
       hint: UTF-16. Please check your editor settings.

     We could even go further and detect some common NUL-containing
     encodings, but I don't think it's worth the effort.

> diff --git a/t/t3900/UTF-16.txt b/t/t3900/UTF-16.txt
> new file mode 100644
> index 0000000000000000000000000000000000000000..53296be684253f40964c0604be7fa7ff12e200cb
> GIT binary patch
> literal 32
> mcmezOpWz6@X@-jo=NYasZ~@^#h9rjP3@HpR7}6Nh8Mpw;r3yp<

I was disappointed not to find a secret message. :)

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html