Re: RFC 2152 - UTF-7 clarification

Mark Andrews <marka@xxxxxxx> · Wed, 14 Oct 2015 19:30:56 +1100

In message <168699907.511444808763309.JavaMail.root@shefa>, "A. Rothman" writes:
> 
> I'm not sure what the debate was - are you referring to having multiple
> encoded representations of the same unicode character sequence? If so,
> then the question of padding is moot since it's explicitly allowed to
> have a whole lot of different representations in the UTF-7 encoding
> (with or without an explicit trailing '-' to end a shift sequence, the
> explicit optional set that may or may not be represented in a shift
> sequence, any character of the direct set that may also be in a shift
> sequence, whether to combine consecutive shift sequences into one or
> not, etc.). There's an exponential number of valid encodings for a
> character sequence.
> 
> Requiring the encoder to always use the minimal possible encoding is
> quite a big change to the spec. Requiring a decoder to reject
> non-minimal encodings would increase decoder complexity significantly.

Absolute garbage unless you think getting the length of a
sequence is hard.

You can decode the shifted seqence and if the byte count is not a
multiple of 2 or is zero then is non-minimal.

If you don't want to decode the entire shift sequence you can do
the analysis on the shifted characters.  If the length of the shifted
sequence * 3 / 4 is not even or zero it is non-minimal.

A		1 * 3 / 4 -> 0
AA		2 * 3 / 4 -> 1
AAA		3 * 3 / 4 -> 2	(minimal) (2 zero bits)
AAAA		4 * 3 / 4 -> 3
AAAAA		5 * 3 / 4 -> 3
AAAAAA		6 * 3 / 4 -> 4	(minimal) (4 zero bits)
AAAAAAA		7 * 3 / 4 -> 5
AAAAAAAA	8 * 3 / 4 -> 6  (minimal)

The (length * 3 / 4) % 6 / 2 defines what values are legal in the
last shifted character.

0 all base64 characters are legal
1 last 2 bits are zero  (base64[c - base64[0]] & 0x3) == 0
2 last 4 bits are zero  (base64[c - base64[0]] & 0xf) == 0

> Changing the behavior regarding padding bits is meaningless without
> these two additional requirements.
>
> That said, it would certainly be useful to add a warning in the security
> considerations section about such issues, and recommend that all data be
> decoded first before performing any validity checks, comparisons, etc.
> (as detailed in RFC 3629 for UTF-8).
> 
> Is this what you meant?
> 
> On 10/14/2015 09:44 AM, Harald Alvestrand wrote:
> > On 10/13/2015 11:11 PM, A. Rothman wrote:
> >> Ok, thanks for your analysis and for looking into this (Mark as well).
> >>
> >> I shall change my decoder implementation to the lenient interpretation,
> >> adjust my unit tests, and hope it is considered RFC-compliant by
> >> everyone :-)
> > Note that this is a reprise of the UTF-8 "overlong encoding" debate,
> > where we ended up banning overlong encodings because of the security
> > issues it posed (see the UTF-8 RFC for more details on the security
> > issues found).
> >
> >> Amichai
> >>
> >> On 10/09/2015 08:08 AM, Viktor Dukhovni wrote:
> >>> On Thu, Oct 08, 2015 at 09:40:25PM +0300, A. Rothman wrote:
> >>>
> >>>> Just in case someone missed it (I almost did): Mark added his own
> >>>> detailed comments on the test cases, but they got buried within a long
> >>>> quote from my original email so may have gone unnoticed. To recap, here
> >>>> are the two interpretations:
> >>>>
> >>>> +A-             empty + 6 (unnecessary) padding bits
> >>>> +AA-            empty + 12 (unnecessary) padding bits
> >>>> +AAA-           \U+0000, and 2 (required) padding bits
> >>>> +AAAA-          \U+0000, and 8 (6 extra) padding bits
> >>>> +AAAAA-         \U+0000, and 14 (12 extra) padding bits
> >>>> +AAAAAA-        \U+0000\U+0000, and 4 (required) padding bits
> >>>> +AAAAAAA-       \U+0000\U+0000, and 10 (6 extra) padding bits
> >>>>
> >>>>
> >>>> +A-             illegal	!modified base64
> >>>> +AA-            illegal	!a multiple of 16 bits in modified base64
> >>>> +AAA-           legal   0x0000 (last 2 bits zero)
> >>>> +AAAA-          illegal !a multiple of 16 bits in modified base64
> >>>> +AAAAA-         illegal	!modified base64
> >>>> +AAAAAA-        legal   0x0000, 0x0000 (last 4 bits zero)
> >>>> +AAAAAAA-       illegal !a multiple of 16 bits in modified base64
> >>>>
> >>>>
> >>>> Does anyone else want to vote or comment on the two interpretations above?
> >>> Thanks for pointing this out more clearly.  Yes, they disagree.
> >>> However, the manner in which they disagree is rather simple.
> >>>
> >>> They agree in all the cases where the padding is *minimal*.
> >>>
> >>> The first variant always tolerates non-minimal padding allowing
> >>> anything less than 16 bits per the specification.  The second
> >>> variant never tolerates non-minimal padding, because there's no
> >>> need to produce it.
> >>>
> >>> It is clear that clients should produce minimal padding, and we
> >>> seem to disgree on  wether to apply Postel's principle to the
> >>> decoder or not.
> >>>
> >>> This is not a major disagreement, such differences of interpretation
> >>> are endemic whether the standard is clear or not.  Many implementors
> >>> are lazy, and stop writing code when the expected cases work.
> >>>
> >>> While this is no excuse for ambiguous specifications, in this case
> >>> I don't think a revision is warranted.  Encoders that generate
> >>> sensibly minimal padding will not run into any friction with
> >>> non-broken decoders.  Encoders that get creative might find that
> >>> some decoders object whether the standard allows their creativity
> >>> or not.
> >>>
> >>
> >
> 
> 
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742                 INTERNET: marka@xxxxxxx