On Tue, 22 Jan 2008, Reece Dunn wrote: > > 1. File name representation > > For Linux file systems (correct me if I am wrong here), they all store > the file name as-is. The question here is what happens on > Windows-based file systems (e.g. NTFS) that are being read on Linux? Generally, Linux tries to follow the conventions of the filesystem, so it's generally case-preserving and case-sensitive (but not normalizing in any way - the case sensitivity is literally a upcase lookup table, so you do "upcase(c1) == upcase(c2)" for each UCS-2 character, no combining or decomposition). But the fs volume can specify if it's a case-sensitive volume or not. And the volume will also actually contain the "upcase[]" array that defines the case-sensitivity, so exactly *which* characters are equivalent isn't actually defined by any external entity, it's defined by the particular filesystem instance itself! There's a default upcase table which is probably the one almost everybody uses. Caveat: I've never used NTFS myself, so I don't have any personal knowledge. I can see the sources, and what it thinks it is doing, but whether it works that way or not I'll leave to others. Also, note: at least as far as Linux is concerned, NTFS is pure UCS-2. Not UTF-16. > For Mac filesystems, you have the Unicode character decomposition > issues to deal with. > > For Windows, you have UTF-16 filename support. .. and for pretty much all unixes, you also have potentially Latin1 or any other local convention (eg EUC-JP or EUC-KR). Sometimes you'd have to guess from the name itself what it is (ie there might be a mixture). In those cases, it's probably best to *not* even try to convert to unicode. > There are two basic usages for file/directory names: passing the name > to the Operating System; getting the name from the Operating System. > Therefore, you have: > > os_to_git_path( const NATIVECHAR * ospath, strbuf * gitpath ); > git_to_os_path( const char * gitpath, const NATIVECHAR * ospath, int oslen ); It's not going to be that simple. And if you want type safety, it's the "ospath" that needs to be "char *", since that's what you get from the OS and it's really the "index form" that you want to protect from giving unconverted by mistake to "lstat()" and friends. And it's going to be really quite painful even with the compiler pointing out each point where you use an "index name" with an operation that wants an "OS name". It would be interesting to see how painful it is (make the actual "conversion" be a no-op at first, just casting the pointer), but I suspect the answer is "very". Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html