Re: RFC 2152 - UTF-7 clarification

"A. Rothman" <amichai2@xxxxxxxxxxxx> · Thu, 8 Oct 2015 21:40:25 +0300

Just in case someone missed it (I almost did): Mark added his own
detailed comments on the test cases, but they got buried within a long
quote from my original email so may have gone unnoticed. To recap, here
are the two interpretations:

+A-             empty + 6 (unnecessary) padding bits
+AA-            empty + 12 (unnecessary) padding bits
+AAA-           \U+0000, and 2 (required) padding bits
+AAAA-          \U+0000, and 8 (6 extra) padding bits
+AAAAA-         \U+0000, and 14 (12 extra) padding bits
+AAAAAA-        \U+0000\U+0000, and 4 (required) padding bits
+AAAAAAA-       \U+0000\U+0000, and 10 (6 extra) padding bits

+A-             illegal	!modified base64
+AA-            illegal	!a multiple of 16 bits in modified base64
+AAA-           legal   0x0000 (last 2 bits zero)
+AAAA-          illegal !a multiple of 16 bits in modified base64
+AAAAA-         illegal	!modified base64
+AAAAAA-        legal   0x0000, 0x0000 (last 4 bits zero)
+AAAAAAA-       illegal !a multiple of 16 bits in modified base64

Does anyone else want to vote or comment on the two interpretations above?

On 10/08/2015 07:06 PM, Viktor Dukhovni wrote:
> On Thu, Oct 08, 2015 at 06:22:51PM +1100, Mark Andrews wrote:
>
>> Though I can see how you could think this was a valid strategy if
>> you only look at a single base64 word after encoding a single utf-16
>> character.
>>
>> 	AAA=	 0x0000	(discard 2 bits)
>> 	AAAA	 0x0000	(discard 8 bits)
>>
>> Now you could safely replace all the '=' pad characters with a
>> single 'A' but that would just be a perverse encoder and if you
>> were to use such a encoder I wouldn't blame the decoder for rejecting
>> the input.
> I don't read Mark's response as saying that non-minimal padding is
> *invalid*.  He says the encoder is "perverse", and I agree that
> the encoder would be better off not generating excess padding.  
>
> He further says that he would not be surprised if some decoders
> rejected non-minimally padded input, and frankly I would also not
> be surprised, but that does not make the input invalid.  The
> specification says that up to 14 (< 16) bits of zero padding is to
> be discarded by decoders, it does not limit the discard bit count
> to 4 (< 6).  
>
> There are lots of lazy and fragile implementations of standards
> out there, encoders need to try to avoid generating non-mainstream
> outputs if they want most decoders to handle the result.
>
> On Thu, Oct 08, 2015 at 02:21:36PM +0300, A. Rothman wrote:
>
>> Everything else still stands. Specifically, the two replies beautifully
>> illustrate my point about ambiguousness - in their interpretation of the
>> actual test cases I submitted, one says that all inputs are valid, and
>> the other says some of them are invalid. That's exactly the problem I
>> saw when comparing libraries.
> Perhaps Mark really does consider 8 to 14 bits of padding as
> "invalid" (not just "perverse").  If so, then indeed the specification
> is open to multiple interpretations.  As I see it, so far Mark and I
> are on the same page.
>
>> As a starting point, my suggestion would be that an encoder SHOULD add
>> the minimal amount of padding necessary, which is likely what encoders
>> already do, while a decoder MUST accept and discard any amount of zero
>> padding (less than 16 bits of course), in line with being more lenient
>> on inputs, and simplifying/micro-optimizing the decoder by removing an
>> extra check+documentation and applying KISS. It would be nice to add one
>> of the test cases in the errata as well, to clarify the expected result.
> The only thing "missing" from the specification is advice (or a
> requirement) to make the padding "minimal".  That is to pad only
> to the *closest* base64 (i.e. multiple of 6 bit) boundary.
>