Re: [PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]

"Rafael Garcia-Suarez" <rgarciasuarez@xxxxxxxxx> · Tue, 22 Jan 2008 10:13:12 +0100

On 22/01/2008, Junio C Hamano wrote:
> If the project uses UTF-8-NFC, we would need to adjust check-in
> and check-out codepath like Linus's readdir(3) hack suggested,
> but that needs to be done only on HFS+.  Of course, the project
> participants need to be careful not to create files that HFS+
> cannot handle (two paths that happen to be equivalent strings
> should not be created), but I do not think that is such a big
> issue as some people seem to make a big deal out of.  If you

Right, I don't see that as a big issue -- for new files. But we can have
files that were created in the past as non-handleable by HFS+, and later
renamed to something more portable.

More generally, the consensus encoding might change over time. We can
imagine a project which contains, say, a test file which a latin-1 name,
that gets later renamed to a UTF-8 name, (due to a project policy
change), but making necessary to adjust the said test. A checkout of the
earlier version would have that test failing. (But maybe I'm just
handwaving towards a non-existent problem here. I'd consider the issue
as minor anyway.)

> want to be interoperable with different filesystems, you should
> not create two paths that are different only in case, and if
> there are participants who are on such a filesystem, the mistake
> is quickly spotted and corrected.  It happened in git.git to a
> file other than that infamous Märchen.  It's exactly the same
> issue [*1*].
>
> In short, initially I did not like Linus's readdir(3) hack very
> much, but the more I think about it, I like it the better.
>
> We pick a reasonable default (i.e. "no conversion") at the
> technical level, and recommend (but do not pay for the overhead
> of enforcing) a reasonable normalization as the BCP at the human
> level.  Only on filesystems that mangle the pathnames, or if you
> want legacy encodings on the filesystem, we would need to pay
> overhead for conversion and help people with actual code to do
> so.
>
> To support the above scenarios, I think each instance of
> repository needs to be able to say "this path (specified with a
> matching pattern in the filename encoding) should be converted
> this way coming in, and that way going out."  UTF-8 only project
> would have NKC<->NKD on HFS+ partition, and nothing on
> everywhere else.  EUC-JP project that checks out as-is would
> specify nothing either, but people on Shift_JIS platforms would
> locally specify that EUC-JP <-> Shift_JIS conversion to be made.

Sounds sane, except maybe the part where you specify paths with a
pattern. Do you really need this layer of complexity? Pattern matching
in different encodings has proven to be troublesome. Usually that's
where UTF-8 normalisation rules and locale-specific behaviours kick in,
esp. when you're starting to use \w or \d characters classes, or case
insensitivity. For example, if you want to do it correctly, "I" will
match /i/ case-insensitively, except in Turkish locales... (Sorry, I'm
just handwaving again here...)
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html