On Mon, Nov 24, 2014 at 02:50:04AM +0300, Роман Донченко wrote: > The RFC says that they are to be concatenated after decoding (i.e. the > intervening whitespace is ignored). > > I change the sender's name to an all-Cyrillic string in the tests so that > its encoded form goes over the 76 characters in a line limit, forcing > format-patch to split it into multiple encoded words. > > Since I have to modify the regular expression for an encoded word anyway, > I take the opportunity to bring it closer to the spec, most notably > disallowing embedded spaces and making it case-insensitive (thus allowing > the encoding to be specified as both "q" and "Q"). The overall goal makes sense to me. Thanks for working on this. I have a few questions/comments, though. > sub unquote_rfc2047 { > local ($_) = @_; > + > + my $et = qr/[!->@-~]+/; # encoded-text from RFC 2047 > + my $sep = qr/[ \t]+/; > + my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i; The first $et in $encoded_word is actually the charset, which is defined by RFC 2047 as: charset = token ; see section 3 token = 1*<Any CHAR except SPACE, CTLs, and especials> especials = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / " <"> / "/" / "[" / "]" / "?" / "." / "=" Your regex is a little more liberal. I doubt that it is a big deal in practice (actually, in practice, I suspect [a-zA-Z0-9-] would be fine). But if we are tightening things up in general, it may make sense to do so here (and I notice that is_rfc2047_quoted does a more thorough $token definition, and it probably makes sense for the two functions to be consistent). For your definition of encoded-text, RFC 2047 says: encoded-text = 1*<Any printable ASCII character other than "?" or SPACE> It looks like you pulled the definition of $et from is_rfc2047_quoted, but I am not clear on where that original came from (it is from a3a8262, but that commit message does not explain the regex). Also, I note that we handle 'q'-style encodings here, but not 'b'. I wonder if it is worth adding that in while we are in the area (it is not a big deal if you always send-email git-generated patches, as we never generate it). > + s{$encoded_word(?:$sep$encoded_word)+}{ If I am reading this right, it requires at least two $encoded_words. Should this "+" be a "*"? > + my @words = split $sep, $&; > + foreach (@words) { > + m/$encoded_word/; > + $encoding = $1; > + $_ = $2; > + s/_/ /g; > + s/=([0-9A-F]{2})/chr(hex($1))/eg; In the spirit of your earlier change, should this final regex be case-insensitive? RFC 2047 says only "Upper case should be used for hexadecimal digits "A" through "F." but that does not seem like a "MUST" to me. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html