Re: git on MacOSX and files with decomposed utf-8 file names

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 19 Jan 2008 21:45:40 -0800 (PST)

On Sun, 20 Jan 2008, Mike Hommey wrote:
> 
> But there is no way to know whether 'ä' in a document is the Finnish 'ä'
> or a 'ä' from, say, German, that sorts after 'a'.

... without knowing the locale. Correct.

That's why sorting is locale-dependent, even in Unicode. And why you 
should always sort using the *combined* character, not think that you can 
sort by decompsed sequence.

That said, even then you get the wrong thing. Some things cannot be sorted 
character by character at all, and have semantical sorting at a higher 
level entirely. I think most European family names are traditionally 
sorted by effectively using the prefixes (ie d', von, etc) as a secondary 
sort key (so even though they are in front, they sort as if they were 
at the _end_ of the name).

So unicode doesn't help with sorting, and you shouldn't even try to find 
sort rules in the Unicode spec or tech reports. But in general, 
decomposing the characters just makes things worse, not better. To sort 
well, you tend to need the bigger picture, not the details.

Of course, for something like git, we sort by binary value, because we 
also require the sort to be not just well-defined, but *stable*. A sort 
based on any kind of unicode rule is rather likely to change over time.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html