Re: b4: unicode control characters -- warn or remove?

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Mon, 01 Nov 2021 21:02:34 +0100

On Mon, Nov 01 2021, Eric Wong wrote:

> Konstantin Ryabitsev <konstantin@xxxxxxxxxxxxxxxxxxx> wrote:
>> Hi, all:
>> 
>> Per exhibit a, what should we do in the situation where we discover unicode
>> control characters in an email?
>> 
>> 1. Warn and strip these chars out, because they are extremely unlikely to be
>>    doing anything legitimate in the context of a patch (unless someone is
>>    sending patches for docs actually written in RTL languages)
>> 2. Warn and error out, refusing to produce an mbox
>> 3. Just warn and produce an mbox anyway
>> 
>> I'd normally do #3, but with many people piping things to git-am, I'm not sure
>> if it's the safest choice.
>> 
>> Exibit a: https://lwn.net/Articles/874546/
>
> +Cc: git@vger
>
> IMHO, defense for this belongs in git-am (which already checks
> things like whitespace).

It checks whitespace because that's something that's commonly a source
of patch corruption. I'm not adverse to adding this to core.whitespace,
but trying to catch malicious injected code seems like a rather big
expansion of its scope, particularly since:

    "[...]sending patches for docs actually written in RTL languages[...]"

Or just code? People write comment and even in their native languages,
and not all projects are as anglo-centric as those hosted on kernel.org.

I haven't checked what the overlap is between solving this issue & i18n
support, but we definitely should not be assuming that git's only using
by kernel.org users & similar, even something as relatively obscure as
git-am.