Re: [PATCH 11/15] find multi-byte comment chars in unterminated buffers

Phillip Wood <phillip.wood123@xxxxxxxxx> · Fri, 8 Mar 2024 16:20:12 +0000

On 08/03/2024 15:58, Junio C Hamano wrote:
Phillip Wood <phillip.wood123@xxxxxxxxx> writes:

I agree with your analysis. I do wonder though if we should reject
whitespace and control characters when parsing core.commentChar, it
feels like accepting them is a bug waiting to happen. If
comment_line_char starts with ' ' or '\t' that part will be eaten by
the strspn() above and so starts_with_mem() wont match. Also we will
never match a comment if comment_line_str contains '\n'.

Another thing I was wondering is what we want to do a random
byte-sequence that may match from the middle of a multi-byte UTF-8
character.

The reason I haven't mentioned these "nonsense input" is because
they will at worst only lead to self-denial-of-service to those who
are too curious, and will fall into "don't do it then" category.

We could certainly leave it as-is and tell users they are only hurting 
themselves if they complain when it does not work.

Also, what exactly is the definition of "nonsense" will become can
of worms.  I can sympathise if somebody wants to use "#\t" to give
themselves a bit more room than usual on the left for visibility,
for example, so there might be a case to want whitespace characters.

That's fair, maybe we could just ban leading whitespace if we do decide 
to restrict core.commentChar

Best Wishes

Phillip

Not sure why lines that start with CR are considered comment lines,
though.

I think it is a lazy way of looking for an empty line ending in CR LF,
it should really be

	|| (bol[0] == '\r' && bol[1] == '\n') ||

My recollection matches your speculation.

IIRC the lazy persono was probably me but I didn't run "git blame".