On Wed, Oct 07, 2015 at 11:21:49AM +0300, A. Rothman wrote: > I'd like to raise an issue with the UTF-7 decoding process as described > in the RFC with respect to trailing padding bits: > > "Next, the octet stream is encoded by applying the Base64 content > transfer encoding algorithm as defined in RFC 2045, modified to > omit the "=" pad character. Instead, when encoding, zero bits are > added to pad to a Base64 character boundary. When decoding, any > bits at the end of the Modified Base64 sequence that do not > constitute a complete 16-bit Unicode character are discarded. If > such discarded bits are non-zero the sequence is ill-formed." It seems to me that the encoder's behaviour is specified clearly enough. Namely, the encoder outputs an unpadded base64 encoding of the input octet-stream that is zero padded with 0, 2 or 4 bits (an odd padding length can't happen) to ensure that the total number of bits is a multiple of 6, allowing each 6 bits to be encoded as base64 output character. The decoder's job is then to reverse this process. The base64 input produces a stream of 6-bit blocks, which in total yields 16q + 2r bits where 0 <= r < 8. The "q" groups of 16 bits are the decoded text. The "2r" extra bits must be zero and are discarded. An encoder should never generate 6 <= 2r <= 14 extra bits, since 0, 2 or 4 is enough, however it seems that an encoder can get away with up to 10 "extra" padding bits so long as the total count is less than 16. > The way I understand this is that after decoding the modified-base64 > data and grouping the resulting octets into 16-bit Unicode characters, > any remaining zero bits at the end (up to 15 bits, theoretically) Well 14, due to an even bit count. > should simply be ignored. Correct, though in practice that count should always be 0, 2 or 4. Encoders that produce 6, 8, 10, 12 or 14 padding bits are appending extraneous "A" output octets to the base64 stream. > I'm not sure why an encoder would want to add extra zero bits at the end > beyond the minimum necessary, but it is arguably allowed to pad 'to *a* > Base64 character boundary', not specifically *the next* boundary. Correct, with 6, 8 or 10 extra bits, it would have been simpler for the encoder to save one output "A" and emit 0, 2 or 4 padding bits. With 12 or 14, save outputting "AA" and emit 0 or 2 extra bits. What the extra "A" or "AA" might allow the encoder to do is to "round-up" the base64 output to a multiple of 4 octets, which simplifies decoding. The only time the encoder can't do that is when the input length in bits is 24q + 2|4|6|8 (1/3 of the time), because this would require 16 or more padding bits. I would not write an encoder that makes the base64 output an exact multiple of 4 octets 2/3 of the time. Too much trouble for incomplete success, but it seems that the specification allows this. > Perhaps an encoder would use some version of a standard > Base64 routine and then replace the padding '=' characters with 'A' > characters (which are then decoded to all zero bits). This does not work, because the total number of padding bits may then equal or exceed 16. > Such encoding > would obviously be less space-efficient since it adds unnecessary octets > to the encoding - but it seems like there are valid reasons to do so. It would also be wrong, because it would not be able to represent trailing zeros correctly. > So, since there is such discrepancy in practice in how this is being > interpreted, I submit that the description is not clear enough to make > this unambiguous. Could someone please clarify what is officially > valid/invalid according to the RFC regarding trailing zero bits? Can we > add errata that clarifies it in either case? Admittedly, I am just applying logic. Perhaps the specification is not as logical as I expect. If it is logical, then it should be as noted above. > Finally, if it helps, here are some concrete test cases to consider: +A- : empty + 6 (unnecessary) padding bits +AA- : empty + 12 (unnecessary) padding bits +AAA- : \U+0000, and 2 (required) padding bits +AAAA- : \U+0000, and 8 (6 extra) padding bits +AAAAA- : \U+0000, and 14 (12 extra) padding bits +AAAAAA- : \U+0000\U+0000, and 4 (required) padding bits +AAAAAAA- : \U+0000\U+0000, and 10 (6 extra) padding bits > Which of these are valid inputs? Which are invalid? How many 0x0000 > 16-bit characters should each one be decoded into? They are all valid, because any padding bits are zero in all of them. They decode to floor(6n/16) == floor(3n/8) 16-bit unicode code points, where "n" is the length of the base64 input. -- Viktor.