Re: RFC 2152 - UTF-7 clarification

Viktor Dukhovni <ietf-dane@xxxxxxxxxxxx> · Thu, 8 Oct 2015 16:06:00 +0000

On Thu, Oct 08, 2015 at 06:22:51PM +1100, Mark Andrews wrote:

> Though I can see how you could think this was a valid strategy if
> you only look at a single base64 word after encoding a single utf-16
> character.
> 
> 	AAA=	 0x0000	(discard 2 bits)
> 	AAAA	 0x0000	(discard 8 bits)
> 
> Now you could safely replace all the '=' pad characters with a
> single 'A' but that would just be a perverse encoder and if you
> were to use such a encoder I wouldn't blame the decoder for rejecting
> the input.

I don't read Mark's response as saying that non-minimal padding is
*invalid*.  He says the encoder is "perverse", and I agree that
the encoder would be better off not generating excess padding.  

He further says that he would not be surprised if some decoders
rejected non-minimally padded input, and frankly I would also not
be surprised, but that does not make the input invalid.  The
specification says that up to 14 (< 16) bits of zero padding is to
be discarded by decoders, it does not limit the discard bit count
to 4 (< 6).  

There are lots of lazy and fragile implementations of standards
out there, encoders need to try to avoid generating non-mainstream
outputs if they want most decoders to handle the result.

On Thu, Oct 08, 2015 at 02:21:36PM +0300, A. Rothman wrote:

> Everything else still stands. Specifically, the two replies beautifully
> illustrate my point about ambiguousness - in their interpretation of the
> actual test cases I submitted, one says that all inputs are valid, and
> the other says some of them are invalid. That's exactly the problem I
> saw when comparing libraries.

Perhaps Mark really does consider 8 to 14 bits of padding as
"invalid" (not just "perverse").  If so, then indeed the specification
is open to multiple interpretations.  As I see it, so far Mark and I
are on the same page.

> As a starting point, my suggestion would be that an encoder SHOULD add
> the minimal amount of padding necessary, which is likely what encoders
> already do, while a decoder MUST accept and discard any amount of zero
> padding (less than 16 bits of course), in line with being more lenient
> on inputs, and simplifying/micro-optimizing the decoder by removing an
> extra check+documentation and applying KISS. It would be nice to add one
> of the test cases in the errata as well, to clarify the expected result.

The only thing "missing" from the specification is advice (or a
requirement) to make the padding "minimal".  That is to pad only
to the *closest* base64 (i.e. multiple of 6 bit) boundary.

-- 
	Viktor.