Re: RFC 2152 - UTF-7 clarification

"A. Rothman" <amichai2@xxxxxxxxxxxx> · Wed, 14 Oct 2015 12:27:30 +0300

On 10/14/2015 11:30 AM, Mark Andrews wrote:
> In message <168699907.511444808763309.JavaMail.root@shefa>, "A. Rothman" writes:
>> I'm not sure what the debate was - are you referring to having multiple
>> encoded representations of the same unicode character sequence? If so,
>> then the question of padding is moot since it's explicitly allowed to
>> have a whole lot of different representations in the UTF-7 encoding
>> (with or without an explicit trailing '-' to end a shift sequence, the
>> explicit optional set that may or may not be represented in a shift
>> sequence, any character of the direct set that may also be in a shift
>> sequence, whether to combine consecutive shift sequences into one or
>> not, etc.). There's an exponential number of valid encodings for a
>> character sequence.
>>
>> Requiring the encoder to always use the minimal possible encoding is
>> quite a big change to the spec. Requiring a decoder to reject
>> non-minimal encodings would increase decoder complexity significantly.
> Absolute garbage unless you think getting the length of a
> sequence is hard.
>
> You can decode the shifted seqence and if the byte count is not a
> multiple of 2 or is zero then is non-minimal.
>
> If you don't want to decode the entire shift sequence you can do
> the analysis on the shifted characters.  If the length of the shifted
> sequence * 3 / 4 is not even or zero it is non-minimal.
>
> A		1 * 3 / 4 -> 0
> AA		2 * 3 / 4 -> 1
> AAA		3 * 3 / 4 -> 2	(minimal) (2 zero bits)
> AAAA		4 * 3 / 4 -> 3
> AAAAA		5 * 3 / 4 -> 3
> AAAAAA		6 * 3 / 4 -> 4	(minimal) (4 zero bits)
> AAAAAAA		7 * 3 / 4 -> 5
> AAAAAAAA	8 * 3 / 4 -> 6  (minimal)
>
> The (length * 3 / 4) % 6 / 2 defines what values are legal in the
> last shifted character.
>
> 0 all base64 characters are legal
> 1 last 2 bits are zero  (base64[c - base64[0]] & 0x3) == 0
> 2 last 4 bits are zero  (base64[c - base64[0]] & 0xf) == 0

All well and good, but you seem to be talking about modified base64
shift sequences. I'm talking about full UTF-7 encoded strings, to which
the multiple-representations security consideration is relevant.

The way I understand rules 1-2, any character in the D-set or O-set
(e.g. ASCII alphanumerics) can be validly encoded either directly as an
ASCII byte, or as a modified base64 shift sequence between '+' and '-'
literals. IIRC there are no rules as to if/when adjacent shift sequences
must be combined into a single shift sequence - it's all valid.

So that's 2 ways to encode every single alphanumeric character, for 2^n
valid UTF-7 encodings of the same original Unicode character sequence.
And that's not counting the other variations (optional '-' in some
cases, combining adjacent shift sequences, etc.).

I don't see how the length of the decoded string nor of the encoded
representation can immediately tell you if it's minimal or not. While
there are some shortcuts, I'm guessing you'd have to decode the full
UTF-7 string, and then analyse the characters in a way that is not
unlike re-encoding it using a minimal encoder, before you can find out
if it was already minimal or not. I may be wrong.

>
>> Changing the behavior regarding padding bits is meaningless without
>> these two additional requirements.
>>
>> That said, it would certainly be useful to add a warning in the security
>> considerations section about such issues, and recommend that all data be
>> decoded first before performing any validity checks, comparisons, etc.
>> (as detailed in RFC 3629 for UTF-8).
>>
>> Is this what you meant?
>>
>> On 10/14/2015 09:44 AM, Harald Alvestrand wrote:
>>> On 10/13/2015 11:11 PM, A. Rothman wrote:
>>>> Ok, thanks for your analysis and for looking into this (Mark as well).
>>>>
>>>> I shall change my decoder implementation to the lenient interpretation,
>>>> adjust my unit tests, and hope it is considered RFC-compliant by
>>>> everyone :-)
>>> Note that this is a reprise of the UTF-8 "overlong encoding" debate,
>>> where we ended up banning overlong encodings because of the security
>>> issues it posed (see the UTF-8 RFC for more details on the security
>>> issues found).
>>>
>>>> Amichai
>>>>
>>>> On 10/09/2015 08:08 AM, Viktor Dukhovni wrote:
>>>>> On Thu, Oct 08, 2015 at 09:40:25PM +0300, A. Rothman wrote:
>>>>>
>>>>>> Just in case someone missed it (I almost did): Mark added his own
>>>>>> detailed comments on the test cases, but they got buried within a long
>>>>>> quote from my original email so may have gone unnoticed. To recap, here
>>>>>> are the two interpretations:
>>>>>>
>>>>>> +A-             empty + 6 (unnecessary) padding bits
>>>>>> +AA-            empty + 12 (unnecessary) padding bits
>>>>>> +AAA-           \U+0000, and 2 (required) padding bits
>>>>>> +AAAA-          \U+0000, and 8 (6 extra) padding bits
>>>>>> +AAAAA-         \U+0000, and 14 (12 extra) padding bits
>>>>>> +AAAAAA-        \U+0000\U+0000, and 4 (required) padding bits
>>>>>> +AAAAAAA-       \U+0000\U+0000, and 10 (6 extra) padding bits
>>>>>>
>>>>>>
>>>>>> +A-             illegal	!modified base64
>>>>>> +AA-            illegal	!a multiple of 16 bits in modified base64
>>>>>> +AAA-           legal   0x0000 (last 2 bits zero)
>>>>>> +AAAA-          illegal !a multiple of 16 bits in modified base64
>>>>>> +AAAAA-         illegal	!modified base64
>>>>>> +AAAAAA-        legal   0x0000, 0x0000 (last 4 bits zero)
>>>>>> +AAAAAAA-       illegal !a multiple of 16 bits in modified base64
>>>>>>
>>>>>>
>>>>>> Does anyone else want to vote or comment on the two interpretations above?
>>>>> Thanks for pointing this out more clearly.  Yes, they disagree.
>>>>> However, the manner in which they disagree is rather simple.
>>>>>
>>>>> They agree in all the cases where the padding is *minimal*.
>>>>>
>>>>> The first variant always tolerates non-minimal padding allowing
>>>>> anything less than 16 bits per the specification.  The second
>>>>> variant never tolerates non-minimal padding, because there's no
>>>>> need to produce it.
>>>>>
>>>>> It is clear that clients should produce minimal padding, and we
>>>>> seem to disgree on  wether to apply Postel's principle to the
>>>>> decoder or not.
>>>>>
>>>>> This is not a major disagreement, such differences of interpretation
>>>>> are endemic whether the standard is clear or not.  Many implementors
>>>>> are lazy, and stop writing code when the expected cases work.
>>>>>
>>>>> While this is no excuse for ambiguous specifications, in this case
>>>>> I don't think a revision is warranted.  Encoders that generate
>>>>> sensibly minimal padding will not run into any friction with
>>>>> non-broken decoders.  Encoders that get creative might find that
>>>>> some decoders object whether the standard allows their creativity
>>>>> or not.
>>>>>
>>