Re: RFC 2152 - UTF-7 clarification

<l.wood@xxxxxxxxxxxx> · Thu, 8 Oct 2015 06:01:53 +0000

The best place to raise this erratum for formal consideration would be
https://www.rfc-editor.org/errata.php

Lloyd Wood
http://about.me/lloydwood
________________________________________
From: ietf <ietf-bounces@xxxxxxxx> on behalf of A. Rothman <amichai2@xxxxxxxxxxxx>
Sent: Wednesday, 7 October 2015 7:21 PM
To: ietf@xxxxxxxx
Subject: RFC 2152 - UTF-7 clarification

Hi,

I hope this is the right place to discuss RFC 2152 - I couldn't find a
conclusive answer as to where and how one should comment on a specific
Request For Comments (which is a bit unsettling :-) )

I'd like to raise an issue with the UTF-7 decoding process as described
in the RFC with respect to trailing padding bits:

"Next, the octet stream is encoded by applying the Base64 content
transfer encoding algorithm as defined in RFC 2045, modified to
omit the "=" pad character. Instead, when encoding, zero bits are
added to pad to a Base64 character boundary. When decoding, any
bits at the end of the Modified Base64 sequence that do not
constitute a complete 16-bit Unicode character are discarded. If
such discarded bits are non-zero the sequence is ill-formed."

The way I understand this is that after decoding the modified-base64
data and grouping the resulting octets into 16-bit Unicode characters,
any remaining zero bits at the end (up to 15 bits, theoretically) should
simply be ignored. I'm not sure why an encoder would want to add extra
zero bits at the end beyond the minimum necessary, but it is arguably
allowed to pad 'to *a* Base64 character boundary', not specifically *the
next* boundary. Perhaps an encoder would use some version of a standard
Base64 routine and then replace the padding '=' characters with 'A'
characters (which are then decoded to all zero bits). Such encoding
would obviously be less space-efficient since it adds unnecessary octets
to the encoding - but it seems like there are valid reasons to do so.

The issue is with the decoding though, and the reason it came up is that
I've checked various existing UTF-7 decoder implementations and
resources (e.g. iconv, uconv, icu, jutf7, jcharset, Wikipedia, etc.) and
they seem to disagree about this issue. Some generate an error after
a maximum number of trailing zero bits (the maximum changes between
implementations), and some agree with my interpretation above where any
leftover partial group of zero bits is discarded and it's always valid.

So, since there is such discrepancy in practice in how this is being
interpreted, I submit that the description is not clear enough to make
this unambiguous. Could someone please clarify what is officially
valid/invalid according to the RFC regarding trailing zero bits? Can we
add errata that clarifies it in either case?

Finally, if it helps, here are some concrete test cases to consider:

+A-
+AA-
+AAA-
+AAAA-
+AAAAA-
+AAAAAA-
+AAAAAAA-

Which of these are valid inputs? Which are invalid? How many 0x0000
16-bit characters should each one be decoded into?

Thanks!

Amichai