On Wed, 13 May 2009, Matthias Andree wrote: > Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx>: > > > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do > > the actual normalization if you find characters with the high bit set. And > > since I know that the OS X filesystems are so buggy as to not even do that > > whole NFD thing right, there is probably some OS-X specific "use this for > > filesystem names" conversion function. > > Sorry for interrupting, but NF_K_C? You don't want that (K for compatibility, > rather than canonical, normalization) for anything except normalizing > temporary variables inside strcasecmp(3) or similar. Probably not even that. > The normalizations done are often irreversible and also surprising. You don't > want to turn 2³.c into 23.c, do you? No, you're right. We want just plain NFC. I just googled for how some other projects handled this, and found the stringprep thing in a post about rsync, and didn't look any closer. But yes, you're absolutely right, stringprep is total crap, and nfkc is horrible. I have no idea of what library to use, though. For perl, there's Unicode::Normalize, but that's likely still subtly incorrect for the OS-X case due to the filesystem not using _strict_ NFD. I have this dim memory of somebody actually pointing to the documentation of exactly which characters OS X ends up decomposing. Maybe we could just do a git-specific inverse of that, knowing that NOBODY ELSE IN THE WHOLE UNIVERSE IS SO TERMINALLY STUPID AS TO DO THAT DECOMPOSITION, and thus the OS X case is the only one we need to care about? Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html