On Jan 22, 2008 10:42 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > I'm praising UTF-8 (without normalization) as a wonderful format where you > can do 99.9% of everything without ever caring about all the expensive > stuff. *thanks* for these notes. Very useful, and... ... > And then, the (actually relatively few) things that want to do things like > show things on the screen, or check for equivalence, or worry about width > of the characters, *those* can still do so. I find the above amusing -- different worlds we live in. Programming webapps means that 90% of the code deals with a bit of metaprogramming (with lots of string manipulation) to talk SQL to a backend, and then doing lots of string manipulation on the data the DB returns, which ends up in humongous strings of goop otherwise known as HTML+CSS+JS. After waiting for the DB to return data, over 50% of cpu time is spent in regexes, concatenations, counting words, array ops, etc. So it is pretty significant. So now I have to worry about cost and correctness of stuff that I took for granted in the pre-unicode days - strtolower() can be quite expensive and... buggy! But that's mainly due to Unicode, not UTF8. I think the only slowdown I can pin on UTF-8 is in counting chars, and probably slower regexes. Not that I deal with the C implementation of any of this stuff -- and so happy about it! ;-) </offtopic> (...) > And that's the beauty of non-normalized (and possibly badly formed) UTF-8. I had a few issues with Perl v5.6's utf-8 handling that wasn't binary safe (fread() to a fixed-length buffer would break the input if a unicode char landed across the boundary - ouch!) -- made me think that you couldn't do this in binary safe ways. So I tend to tell Perl to treatfiles as binary, and switch to utf-8 in specially chosen spots. I suspect that 5.8 is a bit saner about this, but I'm not taking chances. cheers, martin - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html