Re: RFC 2152 - UTF-7 clarification

"A. Rothman" <amichai2@xxxxxxxxxxxx> · Thu, 8 Oct 2015 14:21:36 +0300

This reply by Mark and the previous one by Viktor both point out
correctly that the example of replacing '=' with 'A' is not a good one.
So let's ignore that part and assume an encoder decides to add padding
for some other unknown reason just because it can (this appears to be
allowed by the RFC with the current wording, as suggested earlier).

Everything else still stands. Specifically, the two replies beautifully
illustrate my point about ambiguousness - in their interpretation of the
actual test cases I submitted, one says that all inputs are valid, and
the other says some of them are invalid. That's exactly the problem I
saw when comparing libraries.

So before raising the erratum using the provided link, I thought this
would be the place to discuss what the desired outcome should be - i.e.
which of the two interpretation should be chosen as the correct one in
the errata.

As a starting point, my suggestion would be that an encoder SHOULD add
the minimal amount of padding necessary, which is likely what encoders
already do, while a decoder MUST accept and discard any amount of zero
padding (less than 16 bits of course), in line with being more lenient
on inputs, and simplifying/micro-optimizing the decoder by removing an
extra check+documentation and applying KISS. It would be nice to add one
of the test cases in the errata as well, to clarify the expected result.

Other suggestions are welcome - as long as there is a decision at the
end :-)

Amichai

On 10/08/2015 10:22 AM, Mark Andrews wrote:
> In message <DB4PR06MB4573125043060E318DC6A30AD350@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> tlook.com>, l.wood@xxxxxxxxxxxx writes:
>> The best place to raise this erratum for formal consideration would be
>> https://www.rfc-editor.org/errata.php
>>
>> Lloyd Wood
>> http://about.me/lloydwood
>> ________________________________________
>> From: ietf <ietf-bounces@xxxxxxxx> on behalf of A. Rothman <amichai2@amicha=
>> is.net>
>> Sent: Wednesday, 7 October 2015 7:21 PM
>> To: ietf@xxxxxxxx
>> Subject: RFC 2152 - UTF-7 clarification
>>
>> Hi,
>>
>> I hope this is the right place to discuss RFC 2152 - I couldn't find a
>> conclusive answer as to where and how one should comment on a specific
>> Request For Comments (which is a bit unsettling :-) )
>>
>> I'd like to raise an issue with the UTF-7 decoding process as described
>> in the RFC with respect to trailing padding bits:
>>
>> "Next, the octet stream is encoded by applying the Base64 content
>> transfer encoding algorithm as defined in RFC 2045, modified to
>> omit the "=" pad character. Instead, when encoding, zero bits are
>> added to pad to a Base64 character boundary. When decoding, any
>> bits at the end of the Modified Base64 sequence that do not
>> constitute a complete 16-bit Unicode character are discarded. If
>> such discarded bits are non-zero the sequence is ill-formed."
>>
>> The way I understand this is that after decoding the modified-base64
>> data and grouping the resulting octets into 16-bit Unicode characters,
>> any remaining zero bits at the end (up to 15 bits, theoretically) should
>> simply be ignored. I'm not sure why an encoder would want to add extra
>> zero bits at the end beyond the minimum necessary, but it is arguably
>> allowed to pad 'to *a* Base64 character boundary', not specifically *the
>> next* boundary. Perhaps an encoder would use some version of a standard
>> Base64 routine and then replace the padding '=' characters with 'A'
>> characters (which are then decoded to all zero bits). Such encoding
>> would obviously be less space-efficient since it adds unnecessary octets
>> to the encoding - but it seems like there are valid reasons to do so.
> It says omit, not replaced with 'A'.  In addition just replacing
> '=' with 'A' can add a 0x0000 to the end of a unicode string as the
> pad characters can cover 12 bits with 4 bits from the second character
> of the 4 character base64 word.
>
> e.g.
> 	AAAAAA== 0x0000, 0x0000 (discard 4 bits)
> 	AAAAAAAA 0x0000, 0x0000, 0x0000
>
> Though I can see how you could think this was a valid strategy if
> you only look at a single base64 word after encoding a single utf-16
> character.
>
> 	AAA=	 0x0000	(discard 2 bits)
> 	AAAA	 0x0000	(discard 8 bits)
>
> Now you could safely replace all the '=' pad characters with a
> single 'A' but that would just be a perverse encoder and if you
> were to use such a encoder I wouldn't blame the decoder for rejecting
> the input.
>
>> The issue is with the decoding though, and the reason it came up is that
>> I've checked various existing UTF-7 decoder implementations and
>> resources (e.g. iconv, uconv, icu, jutf7, jcharset, Wikipedia, etc.) and
>> they seem to disagree about this issue. Some generate an error after
>> a maximum number of trailing zero bits (the maximum changes between
>> implementations), and some agree with my interpretation above where any
>> leftover partial group of zero bits is discarded and it's always valid.
>>
>> So, since there is such discrepancy in practice in how this is being
>> interpreted, I submit that the description is not clear enough to make
>> this unambiguous. Could someone please clarify what is officially
>> valid/invalid according to the RFC regarding trailing zero bits? Can we
>> add errata that clarifies it in either case?
>>
>> Finally, if it helps, here are some concrete test cases to consider:
>>
>> +A-			illegal	!modified base64
>> +AA-			illegal	!a multiple of 16 bits in modified base64
>> +AAA-			legal   0x0000 (last 2 bits zero)
>> +AAAA-		illegal !a multiple of 16 bits in modified base64
>> +AAAAA-		illegal	!modified base64
>> +AAAAAA-		legal   0x0000, 0x0000 (last 4 bits zero)
>> +AAAAAAA-		illegal !a multiple of 16 bits in modified base64
>>
>> Which of these are valid inputs? Which are invalid? How many 0x0000
>> 16-bit characters should each one be decoded into?
>>
>> Thanks!
>>
>> Amichai
>>