Re: [PATCH 11/15] find multi-byte comment chars in unterminated buffers

Jeff King <peff@xxxxxxxx> · Tue, 12 Mar 2024 04:19:31 -0400

On Fri, Mar 08, 2024 at 04:20:12PM +0000, Phillip Wood wrote:

> On 08/03/2024 15:58, Junio C Hamano wrote:
> > Phillip Wood <phillip.wood123@xxxxxxxxx> writes:
> > 
> > > I agree with your analysis. I do wonder though if we should reject
> > > whitespace and control characters when parsing core.commentChar, it
> > > feels like accepting them is a bug waiting to happen. If
> > > comment_line_char starts with ' ' or '\t' that part will be eaten by
> > > the strspn() above and so starts_with_mem() wont match. Also we will
> > > never match a comment if comment_line_str contains '\n'.
> > 
> > Another thing I was wondering is what we want to do a random
> > byte-sequence that may match from the middle of a multi-byte UTF-8
> > character.
> > 
> > The reason I haven't mentioned these "nonsense input" is because
> > they will at worst only lead to self-denial-of-service to those who
> > are too curious, and will fall into "don't do it then" category.
> 
> We could certainly leave it as-is and tell users they are only hurting
> themselves if they complain when it does not work.

That was mostly my plan. To some degree I think this is orthogonal to my
series. You can already set core.commentChar to space or newline, and
I'm sure the results are not very good. Actually, I guess it is easy to
try:

  git -c core.commentChar=$'\n' commit --allow-empty

treats everything as not-a-comment.

Maybe it's worth forbidding this at the start of the series, and then
carrying it through. I really do think newline is the most special
character here, just because it's obviously going to be meaningful to
all of our line-oriented parsing. So you'll get weird results, as
opposed to broken multibyte characters, where things would still work if
you choose to consistently use them (and arguably we cannot even define
"broken" as the user can use a different encoding).

Likewise, I guess people might complain that their core.commentChar is
NFD and their editor writes out NFC characters or something, and we
don't match. I was hoping we could just punt on that and nobody would
ever notice (certainly I think it is OK to punt for now and somebody who
truly cares can make a utf8_starts_with() or similar).

> > Also, what exactly is the definition of "nonsense" will become can
> > of worms.  I can sympathise if somebody wants to use "#\t" to give
> > themselves a bit more room than usual on the left for visibility,
> > for example, so there might be a case to want whitespace characters.
> 
> That's fair, maybe we could just ban leading whitespace if we do decide to
> restrict core.commentChar

Leading whitespace actually does work, though I think you'd be slightly
insane to use it.

I'm currently using "! COMMENT !" (after using a unicode char for a few
days). It's horribly ugly, but I wanted to see if any bugs cropped up
(and vim's built-in git syntax highlighting colors it correctly ;) ).

-Peff