Re: Troubles with UTF-8

Frank Ellermann <nobody@xxxxxxxxxxxxxxxxx> · Fri, 30 Dec 2005 11:09:44 +0100

Randy Presuhn wrote:

  [Tom Petch said:]
>> I was using the 'illegal syntax' to float an alternative
>> approach, like using %xC1 - which is illegal in UTF-8

Illegal today, it wasn't for some time.  My UTF-8 "decoder"
script would return one SUB for a %xC1 plus the next octet.
%xFF and %xFE were always illegal, %xFD was the worst case
for 5*6+1 bits u+7FFFFFF in UCS-4.

>> that idea does not seem to have caught on within the IETF.

u+FFFF (UTF-8 %xEFBFBF) is guaranteed to be no character, it
is AFAIK reserved for this purpose.  But not "on the wire".

> I think the use of explicitly encoded length, rather than
> special terminator or deliminator sequences, is simpler to
> code and debug, as well as being more robust in avoiding
> buffer overflow problems, etc.

Yes, abusing %xFF or similar tricks would be like an PDU with
an empty header and a constant trailer.  Your idea "length in
the header" (and maybe a checksum as trailer ?) is better.  

If that hits the limit for encoded lengths add a mechanism for
a "more" flag, or chunks with a "length = 0 is the end", etc.

> Reserving NUL as a special terminator is a C library-ism.

A leading length has its own drawbacks if you want a string
with more than 255 octets after one octet for the length. ;-)

> history has shown that the use of this kind of mechanism,
> rather than explicitly tracking the string's length, was a
> mistake.

<CRLF> or whatever isn't too bad with a decent maximal line
length (like 1000).  If you want arbitrary encoded lengths you
would need a delimiter to separate the length from the SDU, or
another trick for this effect.  Attackers could then try their
luck with huge encoded lengths.
                                Bye, Frank

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf