Re: b4: unicode control characters -- warn or remove?

Konstantin Ryabitsev <konstantin@xxxxxxxxxxxxxxxxxxx> · Mon, 1 Nov 2021 17:02:59 -0400

On Mon, Nov 01, 2021 at 09:49:14PM +0100, Pavel Machek wrote:
> > I think the following would be a sane check:
> > 
> > 1. are there unicode control characters (CCs) present?
> > 2. are there other characters from RTL languages present in the same line?
> > 
> > if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is
> > true, then it's likely worth a warning.
> > 
> > Maybe even relax #2 to just check for unicode characters above a certain
> > barrier where RTL languages live. I think everyone will agree that if there
> > are unicode CCs and no other unicode characters in that same line, it's likely
> > not a legitimate use of control characters.
> 
> If you are worried about malicious patches, then it should be easy for
> attackers to add some RTL characters and escape the check...

Well, the point of this attack was to trick the reviewer into accepting code
that the compiler would treat differently (e.g. something that looked to be
inside a comment block is actually outside of it).

So, if attackers include some actual RTL text, then the reviewer would no
longer be (as easily) tricked because there would be stuff other than just
invisible characters in the line of code.

This actually similar to how we treat unicode domains. Most browsers only
allow unicode domains when the entire domain name consists of unicode
characters. I suggest we take a similar approach.

-K
Attachment:
signature.asc

Description: PGP signature