Re: git-am doesn't strip CRLF line endings when the mbox is base64-encoded

George Dunlap <george.dunlap@xxxxxxxxxx> · Mon, 6 Jan 2020 18:10:34 +0000

On 1/6/20 5:07 PM, Junio C Hamano wrote:
> George Dunlap <george.dunlap@xxxxxxxxxx> writes:
> 
>> On 12/18/19 12:15 PM, George Dunlap wrote:
>>> On 12/18/19 11:42 AM, George Dunlap wrote:
>>>> Using git 2.24.0 from Debian testing.
>>>>
>>>> It seems that git-am will strip CRLF endings from mails before applying
>>>> patches when the mail isn't encoded in any way.  It will also decode
>>>> base64-encoded mails.  But it won't strip CRLF endings from
>>>> base64-encoded mails.
>>>>
>>>> Attached are two mbox files for two different recent series.
>>>> plainenc.am applies cleanly with `git am`, while base64enc.am doesn't.
>>>>
>>>> Poking around the man pages, it looks like part of the issue might be
>>>> that the CRLF stripping is done in `git mailsplit`, before the base64
>>>> encoding, rather than after.
>>>
>>> Poking around -- it looks like the CRLF stripping would be better done
>>> in `git mailinfo` after the decoding.
>>
>> Anyone want to take this up?  I mean, I could try to send a patch, but
>> since I've never looked at the git source code before, I'm sure it would
>> take me about 10x as much effort for me to do it as for someone already
>> familiar with the codebase.
> 
> Even before writing a patch, somebody needs to come up with a
> sensible design first.  --[no-]keep-cr is about "because transfer of
> e-mail messages between MTAs and to the receiving MUA is defined in
> terms of CRLF delimited lines per RFC, Git cannot tell if the CRLF
> in the input was meant to be part of the patch (i.e. the diff is
> describing a change between preimage and postimage of a file that
> uses CRLF line endings) or they are cruft added during transit.  By
> default we favor LF endings so we will strip, but we leave an option
> to keep CRs at the end of lines".  
> 
> What you are asking for is quite different, isn't it?  "We know the
> CRLF in the payload is from the original because they were protected
> from getting munged during the transfer by being MIME-encased.
> Please tell Git to preprocess that payload to convert CRLF to LF
> before treating it as a patch".

Actually that's not true. :-)   The RFC specifies CRLF for text sections
regardless of the encoding:

8<----
   The canonical form of any MIME "text" subtype MUST always represent a
   line break as a CRLF sequence.  Similarly, any occurrence of CRLF in
   MIME "text" MUST represent a line break.  Use of CR and LF outside of
   line break sequences is also forbidden.
----->8

[1] https://tools.ietf.org/html/rfc2046#section-4.1.1

Just as for plaintext encoding, we do not, in fact, know that CRLF is in
the original; in the example I included above, I'm confident that CRLF
is *not* in the original, but was added by an MTA afterwards.

As such, at the moment what `mailsplit` does is generate a load of
non-RFC-compliant email messages (since text-plain encoded messages
won't have CRLF, in contravention of the spec), and `mailinfo`
incorrectly interprets base64-encoded sections.  Moving the CR-stripping
from mailsplit to mailinfo would make them both more RFC-compliant.

(Sorry this wasn't in the original report -- I've been doing a lot more
digging since then.)

> So, if you are imagining to change the meaning of --[no-]keep-cr, I
> do not think it will fly (that is why I said that we need a sensible
> design before a patch).
> 
> And by stepping back a bit like so, and once we start viewing this
> as "after receiving a piece of e-mail from MUA (where --[no-]keep-cr
> may affect the outermost CRLF line endings) and unwrapping possible
> MIME-encasing, we can optionally tell Git to pass the payload
> further through a preprocess filter", we'd realize that this does
> not have to be limited to just running dos2unix (you may want to run
> iconv to fix encodings, for example), which would mean that the new
> flag may not just want to be --strip-cr, which is too limiting, but
> rather want to be --filter-message=<how> where <how> could be one of
> the canned preprocess filter (among which your dos2unix may exist)
> or an external script.
> 
> I am not saying that "--filter-message=<how>" must be the "sensible
> design" I mentioned at the beginning of this message---the above is
> to illustrate what kind of thought needs to go in before even the
> first line of the patch gets written.
As I mentioned in another thread, there's already an `applypatch-msg`
hook which can be used to do arbitrary modifications on commit messages
before applying.  Another way to fix this would be to add an
`applypatch-patch` hook, which allowed you to do arbitrary modifications
on the patch before applying.

I certainly think that `applypatch-patch` would be a useful thing to
add.  But since making `mailsplit` and `mailinfo` more RFC-compliant is
both good in itself, and probably easier, I still think that's the best
thing to do first.

 -George