Re: RFC 2152 - UTF-7 clarification

Mark Andrews <marka@xxxxxxx> · Thu, 08 Oct 2015 18:22:51 +1100

In message <DB4PR06MB4573125043060E318DC6A30AD350@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
tlook.com>, l.wood@xxxxxxxxxxxx writes:
> The best place to raise this erratum for formal consideration would be
> https://www.rfc-editor.org/errata.php
> 
> Lloyd Wood
> http://about.me/lloydwood
> ________________________________________
> From: ietf <ietf-bounces@xxxxxxxx> on behalf of A. Rothman <amichai2@amicha=
> is.net>
> Sent: Wednesday, 7 October 2015 7:21 PM
> To: ietf@xxxxxxxx
> Subject: RFC 2152 - UTF-7 clarification
> 
> Hi,
> 
> I hope this is the right place to discuss RFC 2152 - I couldn't find a
> conclusive answer as to where and how one should comment on a specific
> Request For Comments (which is a bit unsettling :-) )
> 
> I'd like to raise an issue with the UTF-7 decoding process as described
> in the RFC with respect to trailing padding bits:
> 
> "Next, the octet stream is encoded by applying the Base64 content
> transfer encoding algorithm as defined in RFC 2045, modified to
> omit the "=" pad character. Instead, when encoding, zero bits are
> added to pad to a Base64 character boundary. When decoding, any
> bits at the end of the Modified Base64 sequence that do not
> constitute a complete 16-bit Unicode character are discarded. If
> such discarded bits are non-zero the sequence is ill-formed."
> 
> The way I understand this is that after decoding the modified-base64
> data and grouping the resulting octets into 16-bit Unicode characters,
> any remaining zero bits at the end (up to 15 bits, theoretically) should
> simply be ignored. I'm not sure why an encoder would want to add extra
> zero bits at the end beyond the minimum necessary, but it is arguably
> allowed to pad 'to *a* Base64 character boundary', not specifically *the
> next* boundary. Perhaps an encoder would use some version of a standard
> Base64 routine and then replace the padding '=' characters with 'A'
> characters (which are then decoded to all zero bits). Such encoding
> would obviously be less space-efficient since it adds unnecessary octets
> to the encoding - but it seems like there are valid reasons to do so.

It says omit, not replaced with 'A'.  In addition just replacing
'=' with 'A' can add a 0x0000 to the end of a unicode string as the
pad characters can cover 12 bits with 4 bits from the second character
of the 4 character base64 word.

e.g.
	AAAAAA== 0x0000, 0x0000 (discard 4 bits)
	AAAAAAAA 0x0000, 0x0000, 0x0000

Though I can see how you could think this was a valid strategy if
you only look at a single base64 word after encoding a single utf-16
character.

	AAA=	 0x0000	(discard 2 bits)
	AAAA	 0x0000	(discard 8 bits)

Now you could safely replace all the '=' pad characters with a
single 'A' but that would just be a perverse encoder and if you
were to use such a encoder I wouldn't blame the decoder for rejecting
the input.

> The issue is with the decoding though, and the reason it came up is that
> I've checked various existing UTF-7 decoder implementations and
> resources (e.g. iconv, uconv, icu, jutf7, jcharset, Wikipedia, etc.) and
> they seem to disagree about this issue. Some generate an error after
> a maximum number of trailing zero bits (the maximum changes between
> implementations), and some agree with my interpretation above where any
> leftover partial group of zero bits is discarded and it's always valid.
>
> So, since there is such discrepancy in practice in how this is being
> interpreted, I submit that the description is not clear enough to make
> this unambiguous. Could someone please clarify what is officially
> valid/invalid according to the RFC regarding trailing zero bits? Can we
> add errata that clarifies it in either case?
> 
> Finally, if it helps, here are some concrete test cases to consider:
> 
> +A-			illegal	!modified base64
> +AA-			illegal	!a multiple of 16 bits in modified base64
> +AAA-			legal   0x0000 (last 2 bits zero)
> +AAAA-		illegal !a multiple of 16 bits in modified base64
> +AAAAA-		illegal	!modified base64
> +AAAAAA-		legal   0x0000, 0x0000 (last 4 bits zero)
> +AAAAAAA-		illegal !a multiple of 16 bits in modified base64
> 
> Which of these are valid inputs? Which are invalid? How many 0x0000
> 16-bit characters should each one be decoded into?
> 
> Thanks!
> 
> Amichai
> 
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742                 INTERNET: marka@xxxxxxx