In message <DB4PR06MB4573125043060E318DC6A30AD350@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx tlook.com>, l.wood@xxxxxxxxxxxx writes: > The best place to raise this erratum for formal consideration would be > https://www.rfc-editor.org/errata.php > > Lloyd Wood > http://about.me/lloydwood > ________________________________________ > From: ietf <ietf-bounces@xxxxxxxx> on behalf of A. Rothman <amichai2@amicha= > is.net> > Sent: Wednesday, 7 October 2015 7:21 PM > To: ietf@xxxxxxxx > Subject: RFC 2152 - UTF-7 clarification > > Hi, > > I hope this is the right place to discuss RFC 2152 - I couldn't find a > conclusive answer as to where and how one should comment on a specific > Request For Comments (which is a bit unsettling :-) ) > > I'd like to raise an issue with the UTF-7 decoding process as described > in the RFC with respect to trailing padding bits: > > "Next, the octet stream is encoded by applying the Base64 content > transfer encoding algorithm as defined in RFC 2045, modified to > omit the "=" pad character. Instead, when encoding, zero bits are > added to pad to a Base64 character boundary. When decoding, any > bits at the end of the Modified Base64 sequence that do not > constitute a complete 16-bit Unicode character are discarded. If > such discarded bits are non-zero the sequence is ill-formed." > > The way I understand this is that after decoding the modified-base64 > data and grouping the resulting octets into 16-bit Unicode characters, > any remaining zero bits at the end (up to 15 bits, theoretically) should > simply be ignored. I'm not sure why an encoder would want to add extra > zero bits at the end beyond the minimum necessary, but it is arguably > allowed to pad 'to *a* Base64 character boundary', not specifically *the > next* boundary. Perhaps an encoder would use some version of a standard > Base64 routine and then replace the padding '=' characters with 'A' > characters (which are then decoded to all zero bits). Such encoding > would obviously be less space-efficient since it adds unnecessary octets > to the encoding - but it seems like there are valid reasons to do so. It says omit, not replaced with 'A'. In addition just replacing '=' with 'A' can add a 0x0000 to the end of a unicode string as the pad characters can cover 12 bits with 4 bits from the second character of the 4 character base64 word. e.g. AAAAAA== 0x0000, 0x0000 (discard 4 bits) AAAAAAAA 0x0000, 0x0000, 0x0000 Though I can see how you could think this was a valid strategy if you only look at a single base64 word after encoding a single utf-16 character. AAA= 0x0000 (discard 2 bits) AAAA 0x0000 (discard 8 bits) Now you could safely replace all the '=' pad characters with a single 'A' but that would just be a perverse encoder and if you were to use such a encoder I wouldn't blame the decoder for rejecting the input. > The issue is with the decoding though, and the reason it came up is that > I've checked various existing UTF-7 decoder implementations and > resources (e.g. iconv, uconv, icu, jutf7, jcharset, Wikipedia, etc.) and > they seem to disagree about this issue. Some generate an error after > a maximum number of trailing zero bits (the maximum changes between > implementations), and some agree with my interpretation above where any > leftover partial group of zero bits is discarded and it's always valid. > > So, since there is such discrepancy in practice in how this is being > interpreted, I submit that the description is not clear enough to make > this unambiguous. Could someone please clarify what is officially > valid/invalid according to the RFC regarding trailing zero bits? Can we > add errata that clarifies it in either case? > > Finally, if it helps, here are some concrete test cases to consider: > > +A- illegal !modified base64 > +AA- illegal !a multiple of 16 bits in modified base64 > +AAA- legal 0x0000 (last 2 bits zero) > +AAAA- illegal !a multiple of 16 bits in modified base64 > +AAAAA- illegal !modified base64 > +AAAAAA- legal 0x0000, 0x0000 (last 4 bits zero) > +AAAAAAA- illegal !a multiple of 16 bits in modified base64 > > Which of these are valid inputs? Which are invalid? How many 0x0000 > 16-bit characters should each one be decoded into? > > Thanks! > > Amichai > -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@xxxxxxx