Re: RFC 2152 - UTF-7 clarification

Viktor Dukhovni <ietf-dane@xxxxxxxxxxxx> · Thu, 8 Oct 2015 06:53:01 +0000

On Wed, Oct 07, 2015 at 11:21:49AM +0300, A. Rothman wrote:

> I'd like to raise an issue with the UTF-7 decoding process as described
> in the RFC with respect to trailing padding bits:
> 
> "Next, the octet stream is encoded by applying the Base64 content
> transfer encoding algorithm as defined in RFC 2045, modified to
> omit the "=" pad character. Instead, when encoding, zero bits are
> added to pad to a Base64 character boundary. When decoding, any
> bits at the end of the Modified Base64 sequence that do not
> constitute a complete 16-bit Unicode character are discarded. If
> such discarded bits are non-zero the sequence is ill-formed."

It seems to me that the encoder's behaviour is specified clearly
enough.  Namely, the encoder outputs an unpadded base64 encoding
of the input octet-stream that is zero padded with 0, 2 or 4 bits
(an odd padding length can't happen) to ensure that the total number
of bits is a multiple of 6, allowing each 6 bits to be encoded as
base64 output character.

The decoder's job is then to reverse this process.  The base64
input produces a stream of 6-bit blocks, which in total yields 16q
+ 2r bits where 0 <= r < 8.  The "q" groups of 16 bits are the
decoded text.  The "2r" extra bits must be zero and are discarded.

An encoder should never generate 6 <= 2r <= 14 extra bits, since
0, 2 or 4 is enough, however it seems that an encoder can get away
with up to 10 "extra" padding bits so long as the total count is
less than 16.

> The way I understand this is that after decoding the modified-base64
> data and grouping the resulting octets into 16-bit Unicode characters,
> any remaining zero bits at the end (up to 15 bits, theoretically)

Well 14, due to an even bit count.

> should simply be ignored.

Correct, though in practice that count should always be 0, 2 or 4.
Encoders that produce 6, 8, 10, 12 or 14 padding bits are appending
extraneous "A" output octets to the base64 stream.

> I'm not sure why an encoder would want to add extra zero bits at the end
> beyond the minimum necessary, but it is arguably allowed to pad 'to *a*
> Base64 character boundary', not specifically *the next* boundary.

Correct, with 6, 8 or 10 extra bits, it would have been simpler
for the encoder to save one output "A" and emit 0, 2 or 4 padding
bits.  With 12 or 14, save outputting "AA" and emit 0 or 2 extra
bits.  What the extra "A" or "AA" might allow the encoder to do is
to "round-up" the base64 output to a multiple of 4 octets, which
simplifies decoding.  The only time the encoder can't do that is
when the input length in bits is 24q + 2|4|6|8 (1/3 of the time),
because this would require 16 or more padding bits.

I would not write an encoder that makes the base64 output an exact
multiple of 4 octets 2/3 of the time.  Too much trouble for incomplete
success, but it seems that the specification allows this.

> Perhaps an encoder would use some version of a standard
> Base64 routine and then replace the padding '=' characters with 'A'
> characters (which are then decoded to all zero bits).

This does not work, because the total number of padding bits may
then equal or exceed 16.

> Such encoding
> would obviously be less space-efficient since it adds unnecessary octets
> to the encoding - but it seems like there are valid reasons to do so.

It would also be wrong, because it would not be able to represent
trailing zeros correctly.

> So, since there is such discrepancy in practice in how this is being
> interpreted, I submit that the description is not clear enough to make
> this unambiguous. Could someone please clarify what is officially
> valid/invalid according to the RFC regarding trailing zero bits? Can we
> add errata that clarifies it in either case?

Admittedly, I am just applying logic.  Perhaps the specification
is not as logical as I expect.  If it is logical, then it should
be as noted above.

> Finally, if it helps, here are some concrete test cases to consider:

+A- 		: empty + 6 (unnecessary) padding bits

+AA-		: empty + 12 (unnecessary) padding bits
+AAA-		: \U+0000, and 2 (required) padding bits
+AAAA-		: \U+0000, and 8 (6 extra) padding bits
+AAAAA-		: \U+0000, and 14 (12 extra) padding bits
+AAAAAA-	: \U+0000\U+0000, and 4 (required) padding bits
+AAAAAAA-	: \U+0000\U+0000, and 10 (6 extra) padding bits

> Which of these are valid inputs? Which are invalid? How many 0x0000
> 16-bit characters should each one be decoded into?

They are all valid, because any padding bits are zero in all of
them.  They decode to floor(6n/16) == floor(3n/8) 16-bit unicode
code points, where "n" is the length of the base64 input.

-- 
	Viktor.