On Thu, May 08, 2008 at 04:44:38PM +0200, Jim Meyering wrote: > This change demonstrates that the new syntax-check rule's > regexp can be improved. It missed the unsafe tolower use, > since there was already a to_uchar use on that line. Since this shows up again, let me argue a bit more why I dislike using is* and to* in libvirt. <pedantic topic="string encodings"> We use to* only for comparing A-F/a-f in ethernet addresses comparisons, so those use are safe. The use of XML as the front end for all text parsed in libvirt also safeguards against what I consider a nasty problems with the use of strings in system code i.e. their dependance to locale. I think it's not acceptable to have a risk of different behaviour if the user locale has changed, this can lead to untractable bugs, and also mean we allowed at a given point the problem of string whose encoding is undefined to occur in the library (in the libxml2 docs I point to [1] and [2] as introductions to the problem). As long as we use XML for input, we are safe IMHO, because any string input will automatically come converted as UTF-8 by the way of libxml2, that means if someone want to use japanese characters for the name of their domain or files, at the libvirt level they can and we should not misparse or break those (it's likely to break down the line in xend, but i would hope that lxc and QEmu would be better at handling this though I don't know how we could enforce the exec'ec programs args to be interpreted as UTF-8 in the forked process, as a library we just can't touch the locale value, but maybe we can do that after fork() and before exec()). But each time we extract a string from libxml2 we should know and keep in mind it is an UTF-8 string. In practice the ethernet MAC addresses strings used with to_lower and to_upper in util.c are UTF-8, not strings in the user locale. We are safe here because most locales match with ASCII on the first 128 values, UTF-8 included, and the characters for the MAC address range cannot conflict for UTF-8 encoding, but from a pure correctness POV using those convenience macros on those UTF-8 strings is a mistake. And I would rather see automated checks on the code look for those mistakes, rather than doing a lookup for a wrong cast to then use a wrong macro with a false sense of security. </pedantic> okay now I feel better :-) Daniel [1] http://www.joelonsoftware.com/articles/Unicode.html [2] http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@xxxxxxxxxx | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ -- Libvir-list mailing list Libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list