On Mon, Sep 15, 2014 at 11:56:16AM -0700, Junio C Hamano wrote: > Jeff King <peff@xxxxxxxx> writes: > > > It looks like we have a reasonably sane is_from_line() function. So at > > least _we_ will not generally break on reading our own output, except in > > some extreme circumstances (you'd have to come up with something > > contrived like "From me, at 10:30 30 minutes before 11!"). > > > > We can pretty easily reuse this to make the from-line check in mailinfo > > more robust. Here's a replacement for the patch I sent earlier that > > keeps the "magically ignore extra >From headers" behavior but fixes the > > case that started this discussion. > > Why cache.h when this is still only between mail{info,split}.c both > of which do not really deal with any "Git" data? My thought process was basically: 1. It should go into libgit.a, so we don't end up with an accidental libgit->builtin dependency. 2. It is only one function, and we seem to stuff everything into cache.h, so it is probably not a big deal to do it there. I'd be happy with mbox.h, too. > For mailsplit, we are trying to detect mbox boundary various MUAs > would use in their output, and is_from_line() may be appropriate, > but I am not sure if the same logic is appropriate for mailinfo. > What are we trying to protect us against? Those who paste a single > e-mail output from format-patch in full? Do people paste a single > e-mail received to their mailbox from somebody else and do we need > to protect against that, skipping the ">From " thing, while risking > to skip "From me at 10:30 30 minutes..."? > > If we only want to skip ">?From" in pasted format-patch output, we > would want a rule in mailinfo that is tighter than is_from_line() in > mailsplit. Yes, the current is_from_line is much looser than we need here. It must parse lines from an arbitrary writer, and you are right that here we are specifically interested in git header-lines (at least that is the rationale given by your 81c5cf7, and what I was trying to preserve). I mostly just didn't think it mattered much. Despite coming up with a silly false positive, I doubt it would happen in practice (remember that in addition to matching, it must _also_ be in an in-body header block at the start of the body). > By the way, I see ">From " in is_rfc2822_header(), too. Do we have > to worry about this comparison as well? I don't think so. We call that from read_one_header_line, which is used when running through the _real_ rfc2822 header. Such a ">From" line there would not make any sense (we skip "From" at arbitrary points in the header, too, which is kind of weird; our real goal is to skip the initial one, which we could do separately). But it does not hurt to be overly aggressive in eating them there, as they cannot possibly be body lines. We do the same thing after a multipart boundary, so if you had: <regular rfc2822 email headers> Content-Type: multipart/mixed; boundary=foo --foo Content-Type: text/plain your message --foo Content-Type: message/rfc822 From blah blah blah <commit's rfc2822 email headers> --foo-- then I guess it would help. I am not sure that doing so is actually valid (you do not need the mbox splitter inside the multipart, and it is actively wrong and a potential danger to mbox readers to put it there). But if you did put it there, _and_ an mbox writer quoted that line to ">From", I guess we would want to ignore it. So from my reading (and I am not 100% sure that I am not missing something), I think we could be much stricter in our parsing there. But given that it is not causing any problems, and has been that way for quite a while, I am quite hesitant to change it. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html