On Mon, Nov 01, 2021 at 09:49:14PM +0100, Pavel Machek wrote: > > I think the following would be a sane check: > > > > 1. are there unicode control characters (CCs) present? > > 2. are there other characters from RTL languages present in the same line? > > > > if both 1 && 2 are true, this is a legitimate use of Unicode CCs. If only 1 is > > true, then it's likely worth a warning. > > > > Maybe even relax #2 to just check for unicode characters above a certain > > barrier where RTL languages live. I think everyone will agree that if there > > are unicode CCs and no other unicode characters in that same line, it's likely > > not a legitimate use of control characters. > > If you are worried about malicious patches, then it should be easy for > attackers to add some RTL characters and escape the check... Well, the point of this attack was to trick the reviewer into accepting code that the compiler would treat differently (e.g. something that looked to be inside a comment block is actually outside of it). So, if attackers include some actual RTL text, then the reviewer would no longer be (as easily) tricked because there would be stuff other than just invisible characters in the line of code. This actually similar to how we treat unicode domains. Most browsers only allow unicode domains when the entire domain name consists of unicode characters. I suggest we take a similar approach. -K
Attachment:
signature.asc
Description: PGP signature