On Jan 22, 2008 7:12 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > Now, git _also_ heavily depends on the actual encoding of those > codepoints, since we create hashes etc, so in fact, as far ass git is > concerned, names have to be in some particular encoding to be hashed, and > UTF-8 is the only sane encoding for Unicode. People can blather about > UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is > simply technically superior in so many ways that I don't even understand > why anybody ever uses anything else. > > So I would not disagree with using UTF-8 at all. Linus, (slightly offtopic) are you praising UTF-8 as storage format (for disk and network) or in general? UTF-8-aware string ops like counting characters seem to me a horrendous thing at the ASM level. More on topic, I suspect Kevin's experience is more on end-user apps, where input sanitization and even canonicalisation are common practice. From a kernel and filesystems POV, a filename is data as sacred as file data. On the webapp world, we "corrupt" user input liberally to avoid XSS attacks and the like. In some cases, these practices are stupid and can be replaced with escaping data properly, but in other cases, the web platform is so broken that there's no option. At least in Moodle we store *exactly* what the user POSTed and cleanup^Wcorrupt it when displaying it, so that if it does happen that the cleanup was buggy, we never corrupted the data. So no point in calling eachother stupid this much. Once is enough ;-) And no point in arguing that something that is ok for an end-user app is a good design decision for an OS. martin - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html