Junio C Hamano <gitster@xxxxxxxxx> writes: > Kirill Smelkov <kirr@xxxxxxxxxxxxxxxxxxx> writes: > ... >> http://marc.info/?l=git&m=123031899307286&w=2 > > I have not had chance to look at your patch at marc yet, but from the look > of your problem description, I presume you could trigger this with any > utf-8 b-encoded loooooong subject line? Ok, I took a look at it after downloading from the marc archive. > diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c > index e890f7a..d138bc3 100644 > --- a/builtin-mailinfo.c > +++ b/builtin-mailinfo.c > @@ -436,6 +436,14 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg) > * for now we just trust the data. > */ > c = 0; > + > + /* XXX: the following is needed not to output NUL in > + * the resulting string > + * > + * This seems to be ok, but I'm not 100% sure -- that's > + * why this is an RFC. > + */ > + continue; > } > else > continue; /* garbage */ B encoding (RFC 2045) encodes an octet stream into a sequence of groups of 4 letters from 64-char alphabet, each of which encodes 6-bit, plus zero or more padding char '=' to make the result multiple of 4. * If the length of the payload is a multiple of 3 octets, there is no special handling. Padding char '=' is not produced; * If it is a multiple of 3 octets plus one, the remaining one octet is encoded with two letters, and two more padding char '=' is added; * If it is a multiple of 3 octets plus two, the remaining two octets are encoded with three letters, and one padding char '=' is added. Hence, a "correct" implementation should decode the input as if '=' were the same as 'A' (which encodes 6 bits of 0) til the end, making sure that the padding char '=' appears only at the end of the input, that no char outside the Base64 encoding alphabet appears in the input, and that the length of the entire encoded string is multiple of 4. Finally it would discard either one or two octets (depending on the number of padding chars it saw) from the end of the output. Our decode_b_segment() however emits each octet as it completes, without waiting for the 24-bit group that contains it to complete. When decoding a correctly encoded input, by the time we see a padding '=', all the real payload octets are complete and we would not have any real information still kept in the variable "acc" (accumulator), so ignoring '=' (you do not even need to assign c = 0) like your patch did would work just fine. An alternative would be to count the number of padding at the end and drop the NULs from the output as necessary after the loop but that does not add any value to the current code. Ideally we should validate the encoded string a bit more carefully (see the "correct" implementation about), and warn if a malformed input is found (but probably not reject outright). But as a low-impact fix for the maintenance branches, I think your fix is very good. Side note: I suspect that the existing code was Ok before strbuf conversion as we assumed NUL terminated output buffer. > @@ -513,7 +521,15 @@ static int decode_header_bq(struct strbuf *it) > strbuf_reset(&piecebuf); > rfc2047 = 1; > > - if (in != ep) { > + /* XXX: the follwoing is needed not to output '\n' on every > + * multi-line segment in Subject. > + * > + * I suspect this is not 100% correct, but I'm not a MIME guy > + * -- that's why this is an RFC. > + */ > + > + /* if in does not end with '=?=', we emit it as is */ > + if (in <= (ep-2) && !(ep[-1]=='\n' && ep[-2]=='=')) { > strbuf_add(&outbuf, in, ep - in); > in = ep; > > } I am not a MIME guy either (and mailinfo has a big comment that says we do not really do MIME --- we just pretend to do), but let me give it a try. RFC2046 specifies that an encoded-word ("=?charset?encoding?...?=") may not be more than 75 characters long, and multiple encoded-words, separated by CRLF SPACE can be used to encode more text if needed. It further specifies that an encoded-word can appear next to ordinary text or another encoded-word but it must be separated by linear white space, and says that such linear white space is to be ignored when displaying. Which means that we should be eating the CRLF SPACE we see if we have seen an encoded-word immediately before and we are about to process another encoded-word. Based on the above discussion, here is what I came up with. It passes your test, but I ran out of energy to try breaking it seriously in any other way than just running the existing test suite. We might want to steal some test cases from the "8. Examples" section of RFC2047 and add them to t5100. Thanks. builtin-mailinfo.c | 27 +++++++++++++++++++-------- 1 files changed, 19 insertions(+), 8 deletions(-) diff --git c/builtin-mailinfo.c w/builtin-mailinfo.c index e890f7a..fcb32c9 100644 --- c/builtin-mailinfo.c +++ w/builtin-mailinfo.c @@ -430,13 +430,6 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg) c -= 'a' - 26; else if ('0' <= c && c <= '9') c -= '0' - 52; - else if (c == '=') { - /* padding is almost like (c == 0), except we do - * not output NUL resulting only from it; - * for now we just trust the data. - */ - c = 0; - } else continue; /* garbage */ switch (pos++) { @@ -514,7 +507,25 @@ static int decode_header_bq(struct strbuf *it) rfc2047 = 1; if (in != ep) { - strbuf_add(&outbuf, in, ep - in); + /* + * We are about to process an encoded-word + * that begins at ep, but there is something + * before the encoded word. + */ + char *scan; + for (scan = in; scan < ep; scan++) + if (!isspace(*scan)) + break; + + if (scan != ep || in == it->buf) { + /* + * We should not lose that "something", + * unless we have just processed an + * encoded-word, and there is only LWS + * before the one we are about to process. + */ + strbuf_add(&outbuf, in, ep - in); + } in = ep; } /* E.g. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html