Re: [Qemu-devel] KVM call minutes for Feb 15

Anthony Liguori <anthony@xxxxxxxxxxxxx> · Thu, 17 Feb 2011 07:37:26 -0600

On 02/17/2011 07:25 AM, Avi Kivity wrote:
On 02/17/2011 03:10 PM, Anthony Liguori wrote:
On 02/17/2011 06:23 AM, Avi Kivity wrote:
On 02/17/2011 02:12 PM, Anthony Liguori wrote:
(btw what happens in a non-UTF-8 locale? I guess we should just 
reject unencodable strings).

While QEMU is mostly ASCII internally, for the purposes of the JSON 
parser, we always encode and decode UTF-8.  We reject invalid UTF-8 
sequences.  But since JSON is string-encoded unicode, we can always 
decode a JSON string to valid UTF-8 as long as the string is well 
formed.

That is wrong.  If the user passes a Unicode filename it is expected 
to be translated to the current locale encoding for the purpose of, 
say, filename lookup.

QEMU does not support anything but UTF-8.

Since when?

AFAICT, JSON string conversion is the only place where there is any 
dependency on UTF-8.  Anything else should just work.

That's pretty common with Unix software.  I don't think any modern 
Unix platform actually uses UCS2 or UTF-16.  It's either ascii or UTF-8.

Most/all Linux distributions support UTF-8 as well as a zillion other 
encodings (single-byte ASCII + another charset, or multi-byte charsets 
for languages with many characters.

Maybe there's some confusion here.  UTF-8 is an encoding, not a locale.

The common encodings are ASCII, UTF-8, UCS2, UTF-16, and UTF-32.

An application has to explicitly support an encoding.  It is not 
transparent.  UCS2/UTF-16 means that strings are not 'const char *'s but 
'const wchar_t *' where typedef unsigned short wchar_t;.

QEMU assumes, in lots of places that strings are single-byte NUL 
terminated.  Basically, any use of snprintf, printf, strcpy, strlen, 
etc. pretty much tie you to ASCII/UTF-8.  You can have a single NUL byte 
as part of a valid UCS2 string.

The only place it even matters is Windows and Windows has ASCII and 
UTF-16 versions of their APIs.  So on Windows, non-ASCII characters 
won't be handled correctly (yet another one of the many issues with 
Windows support in QEMU).  UTF-8 is self-recovering though so it 
degrades gracefully.

It matters on Linux with el_GR.iso88597, for example.

The whole series of iso8859 (8-bit encodings) are officially abandoned 
in favor of UCS and encodings that support the full UCS code page 
(UTF-8/UTF-16).

I see no strong reason to try and support deprecated encodings when 
there are perfectly valid replacements like el_GR.utf8.

Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html