On Sun, 20 Jan 2008, Mike Hommey wrote: > > But there is no way to know whether 'ä' in a document is the Finnish 'ä' > or a 'ä' from, say, German, that sorts after 'a'. ... without knowing the locale. Correct. That's why sorting is locale-dependent, even in Unicode. And why you should always sort using the *combined* character, not think that you can sort by decompsed sequence. That said, even then you get the wrong thing. Some things cannot be sorted character by character at all, and have semantical sorting at a higher level entirely. I think most European family names are traditionally sorted by effectively using the prefixes (ie d', von, etc) as a secondary sort key (so even though they are in front, they sort as if they were at the _end_ of the name). So unicode doesn't help with sorting, and you shouldn't even try to find sort rules in the Unicode spec or tech reports. But in general, decomposing the characters just makes things worse, not better. To sort well, you tend to need the bigger picture, not the details. Of course, for something like git, we sort by binary value, because we also require the sort to be not just well-defined, but *stable*. A sort based on any kind of unicode rule is rather likely to change over time. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html