Re: Troubles with UTF-8

"Tom.Petch" <sisyphus@xxxxxxxxxxxxxx> · Sat, 24 Dec 2005 19:07:12 +0100



From: "Ned Freed" <ned.freed@xxxxxxxxxxx>
To: "TomPetch" <sisyphus@xxxxxxxxxxxxxx>
Cc: "ietf" <ietf@xxxxxxxx>
Sent: Friday, December 23, 2005 7:13 PM
Subject: Re: Troubles with UTF-8
<snip>

> > (Unicode
> > lacks a no-op, a meaningless octet, one that could be added or removed
without
> > causing any change to the meaning of the text).
>
> NBSP is used for this purpose.
>
Thank you for that; it is not something I have seen documented before.

> > Other protocols use a terminating sequence.  NUL is widely used in *ix; some
> > protocols specify that NUL must terminate the text, some specify that it
must
> > not, one at least specifies that embedded NUL means that text after a NUL
must
> > not be displayed (interesting for security).  Since UTF-8 encompasses so
much,
> > there is no natural terminating sequence.
>
> This simply isn't true. NUL is present in Unicode and is commonly used as  a
> terminator.
>
Not sure which bit isn't true.  I agree NUL is present in Unicode and agree that
some protocols use it as a terminator and prohibit its use in the text.  But
some allow it in the text in which case another form of termination is needed or
else the NUL must be escaped/encoded.  Presented with a comparable problem where
XML is in use, one WG has chosen to use an illegal XML sequence as a terminator
so what I was fishing for is if there were any parallels with UTF-8, which has
many
illegal sequences of octets and so it would be easy to choose one as a
terminator.

Tom Petch

> Ned.


_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf