Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Wed, 16 Jan 2008 23:07:25 -0500

On Jan 16, 2008, at 8:41 PM, Linus Torvalds wrote:

On Thu, 17 Jan 2008, Johannes Schindelin wrote:
On Thu, 17 Jan 2008, Wincent Colaiuta wrote:

El 17/1/2008, a las 1:40, Pedro Melo escribió:

That's the point I'm making. The fact that I need to set LANG  
across
all users of a project is insane...

FWIW if you use another filesystem, such as reiserfs or ext[2-4], the
filenames will be _unaffected_ by your particular setting of LANG.   
They
will be stored byte-wise exactly like asked for.  That's why I call  
them
"sane".

One of the advantages (the biggest one, in fact, apart from the  
obvious
US-ASCII down-compatibility and the fact that you can do C-compatible
NUL-terminated strings) of UTF-8 is that it's locale-independent, and
doesn't care about LANG, because it's valid in all languages.

And that's really important. It's important for a very simple reason:
there is almost never such a thing as "a locale" except for US- 
ASCII. Once
you move away from US-ASCII, it actually tends to be much more  
common that
you have a *mixture* of locales - often in the same "document" -  
than to
have one single locale.

It very much happens even in filenames - people "mix" locales in  
trivial
ways even within a single pathname component (non-US-ASCII filename,  
but
with a regular file extension), but much more interestingly they do so
within a directory tree (ie you have have translation subdirectories  
where
the filenames themselves are in another language, and you can have  
full
pathnames where different components are in different languages, for
example).

And UTF-8 is _wonderful_ for this, because LANG doesn't matter, and
cannot matter, and thus mixing isn't a problem.

Of course, you can screw it up. Locales still can change things like  
sort
order and capitalization etc, so even if you use UTF-8, you sure can  
get
into trouble with LANG and thinking that a per-session locale makes  
sense.

So choosing UTF-8 for the filesystem isn't wrong per se. It's a fine
choice, and has no issues with LANG in itself. Limiting it to strictly
valid UTF-8 encodings is also fine. Limiting it (further) to only
character normalized UTF-8 is also fine.

Most Linux filesystems don't limit it in any way, so you can make
filenames that aren't valid UTF-8 at all, much less normalizing
multi-character sequences.

I personally think that's the best option, but I probably do so mostly
because I know some people still use Latin1 as their only locale  
(and I
suspect Asia will take decades before it has converted to UTF-8 and  
will
also have cases where they use other non-UTF locales).

But enforcing clean UTF-8 is not a bad idea per se. Not allowing byte
sequences that aren't a valid UTF-8 encoding (eg \xc0\xc0 is not a  
valid
UTF-8 character) is fine.

I wouldn't call people crazy for doing that, although it does mean  
that
you cannot, for example, decide to write a Latin1 filename (which is  
not
necessarily a *good* idea in this day and age, but I think there's a
difference between "that's not a good idea" and "you cannot do that").

And even limiting the UTF-8 charset further to only the minimal
representation of one particular glyph (ie not allowing multi- 
character
sequences that can be represented more simply) may be even *more*
big-brother, but would at least not cause the technical aliasing  
issues. I
personally think that's so controlling as to be stupid (and has no  
real
advantage), but hey, at least it doesn't *corrupt* anything silently.

So I think that using UTF-8 as a character encoding is a *good*  
thing to
do, and that automatically means that LANG shouldn't matter for  
filenames,
but within that choice of UTF-8 there are still mistakes that you can
make. Notably multi-character normalization and case-insensitivity.

			Linus

Alright, you've made your point, and I'm willing to concede at least  
some of what you've said. So perhaps we can now move onto the more  
relevant and practical issue of: HFS+, despite how stupid it may or  
may not be, normalizes filenames (and is case-insensitive, which is a  
related issue). This causes a problem with git. How can this be solved?

I'm more than willing to do work to solve it, my biggest issue is I  
don't believe I actually have the free time to learn the git internals  
well enough to actually do proper work on what I would assume is a  
fairly performance-critical section of git's code. However, I would be  
happy to work with others who are perhaps more knowledgeable in this  
area.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>