Re: [RFC] File system difference handling in git

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 22 Jan 2008 08:56:28 -0800 (PST)

On Tue, 22 Jan 2008, Reece Dunn wrote:
> 
>   1.  File name representation
> 
> For Linux file systems (correct me if I am wrong here), they all store
> the file name as-is. The question here is what happens on
> Windows-based file systems (e.g. NTFS) that are being read on Linux?

Generally, Linux tries to follow the conventions of the filesystem, so 
it's generally case-preserving and case-sensitive (but not normalizing in 
any way - the case sensitivity is literally a upcase lookup table, so you 
do "upcase(c1) == upcase(c2)" for each UCS-2 character, no combining or 
decomposition).

But the fs volume can specify if it's a case-sensitive volume or not. And 
the volume will also actually contain the "upcase[]" array that defines 
the case-sensitivity, so exactly *which* characters are equivalent isn't 
actually defined by any external entity, it's defined by the particular 
filesystem instance itself!

There's a default upcase table which is probably the one almost everybody 
uses.

Caveat: I've never used NTFS myself, so I don't have any personal 
knowledge. I can see the sources, and what it thinks it is doing, but 
whether it works that way or not I'll leave to others.

Also, note: at least as far as Linux is concerned, NTFS is pure UCS-2. Not 
UTF-16. 

> For Mac filesystems, you have the Unicode character decomposition
> issues to deal with.
> 
> For Windows, you have UTF-16 filename support.

.. and for pretty much all unixes, you also have potentially Latin1 or any 
other local convention (eg EUC-JP or EUC-KR). 

Sometimes you'd have to guess from the name itself what it is (ie there 
might be a mixture). In those cases, it's probably best to *not* even try 
to convert to unicode.

> There are two basic usages for file/directory names: passing the name
> to the Operating System; getting the name from the Operating System.
> Therefore, you have:
> 
>    os_to_git_path( const NATIVECHAR * ospath, strbuf * gitpath );
>    git_to_os_path( const char * gitpath, const NATIVECHAR * ospath, int oslen );

It's not going to be that simple. And if you want type safety, it's the 
"ospath" that needs to be "char *", since that's what you get from the OS 
and it's really the "index form" that you want to protect from giving 
unconverted by mistake to "lstat()" and friends.

And it's going to be really quite painful even with the compiler pointing 
out each point where you use an "index name" with an operation that wants 
an "OS name".

It would be interesting to see how painful it is (make the actual 
"conversion" be a no-op at first, just casting the pointer), but I suspect 
the answer is "very".

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html