On Tue, 22 Jan 2008, Martin Langhoff wrote: > > (slightly offtopic) are you praising UTF-8 as storage format (for disk > and network) or in general? UTF-8-aware string ops like counting > characters seem to me a horrendous thing at the ASM level. I'm praising UTF-8 (without normalization) as a wonderful format where you can do 99.9% of everything without ever caring about all the expensive stuff. But in order to do that, you really need to avoid normalization, and you also need to accept mis-formed UTF-8 strings (because even if it is real UTF-8, the string may actually be just a fragment of some larger string). Once you do that (and _only_ if you do that), then UTF-8 is actually a wonderful thing. You can consider it to be a traditional "everything is a stream of bytes", and everything that only cares about a stream of byte will work wonderfully well. And then, the (actually relatively few) things that want to do things like show things on the screen, or check for equivalence, or worry about width of the characters, *those* can still do so. So the beauty of UTF-8 is that you can switch between thinking of it like just a binary blob and thinking of it like text, and everythign works (including the traditional C null-termination). And yes, that was obviously the explicit design goal. It's a good thing. > More on topic, I suspect Kevin's experience is more on end-user apps, > where input sanitization and even canonicalisation are common > practice. Sure. And I'm not arguing against them. Knowing the rules for combining characters is really important for input and output. > At least in Moodle we store *exactly* what the user POSTed and > cleanup^Wcorrupt it when displaying it, so that if it does happen that > the cleanup was buggy, we never corrupted the data. Absolutely. It's what the kernel does, and I think that's what perl does too for their "strings". It works really well. It also allows you to handle binary data (ie data that *really* isn't text) with shared routines etc etc. And that's the beauty of non-normalized (and possibly badly formed) UTF-8. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html