The best place to raise this erratum for formal consideration would be https://www.rfc-editor.org/errata.php Lloyd Wood http://about.me/lloydwood ________________________________________ From: ietf <ietf-bounces@xxxxxxxx> on behalf of A. Rothman <amichai2@xxxxxxxxxxxx> Sent: Wednesday, 7 October 2015 7:21 PM To: ietf@xxxxxxxx Subject: RFC 2152 - UTF-7 clarification Hi, I hope this is the right place to discuss RFC 2152 - I couldn't find a conclusive answer as to where and how one should comment on a specific Request For Comments (which is a bit unsettling :-) ) I'd like to raise an issue with the UTF-7 decoding process as described in the RFC with respect to trailing padding bits: "Next, the octet stream is encoded by applying the Base64 content transfer encoding algorithm as defined in RFC 2045, modified to omit the "=" pad character. Instead, when encoding, zero bits are added to pad to a Base64 character boundary. When decoding, any bits at the end of the Modified Base64 sequence that do not constitute a complete 16-bit Unicode character are discarded. If such discarded bits are non-zero the sequence is ill-formed." The way I understand this is that after decoding the modified-base64 data and grouping the resulting octets into 16-bit Unicode characters, any remaining zero bits at the end (up to 15 bits, theoretically) should simply be ignored. I'm not sure why an encoder would want to add extra zero bits at the end beyond the minimum necessary, but it is arguably allowed to pad 'to *a* Base64 character boundary', not specifically *the next* boundary. Perhaps an encoder would use some version of a standard Base64 routine and then replace the padding '=' characters with 'A' characters (which are then decoded to all zero bits). Such encoding would obviously be less space-efficient since it adds unnecessary octets to the encoding - but it seems like there are valid reasons to do so. The issue is with the decoding though, and the reason it came up is that I've checked various existing UTF-7 decoder implementations and resources (e.g. iconv, uconv, icu, jutf7, jcharset, Wikipedia, etc.) and they seem to disagree about this issue. Some generate an error after a maximum number of trailing zero bits (the maximum changes between implementations), and some agree with my interpretation above where any leftover partial group of zero bits is discarded and it's always valid. So, since there is such discrepancy in practice in how this is being interpreted, I submit that the description is not clear enough to make this unambiguous. Could someone please clarify what is officially valid/invalid according to the RFC regarding trailing zero bits? Can we add errata that clarifies it in either case? Finally, if it helps, here are some concrete test cases to consider: +A- +AA- +AAA- +AAAA- +AAAAA- +AAAAAA- +AAAAAAA- Which of these are valid inputs? Which are invalid? How many 0x0000 16-bit characters should each one be decoded into? Thanks! Amichai