Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04/29/2014 05:23 AM, Jeff King wrote:
On Mon, Apr 28, 2014 at 10:49:30PM +0200, Torsten Bögershausen wrote:

OK, thanks for the description.
In theory we can make Git "composition ignoring" by changing
index_file_exists() in name-hash.c.
(Both names must be precomposed first and compared then)
Yeah, we could perhaps get away without storing the extra precomposed
form if we just stored the entries under their precomposed hash, and
then taught same_name to use a slower precompose-aware comparison. But I
also see that we do binary searches in index_name_pos (called by
index_name_is_other). I don't know if we'd have to deal with this
problem there, too.
Just loud thinking:
We precompose whenever we read file names from disc, that's done in readdir() We precompose whenever we get an argv into Git, that's done in precompose_argv() We precompose every time we read file names from the index file on disc(?) into memory. That we don't do today, and my suggestion to hack name-hash.c isn't a good one.

Probably we need to go into read-cache.c, or more places. I'm not an expert here knowing
all Git internal details.
Basically all places where strings containing file names are put into memory are effected,
and I wouldn't be too concerned about CPU cycles.

I don't know how much people are using Git before 1.7.12 (the
first version supporting precomposed unicode).

Could we simply ask them to upgrade ?
I'm not sure what you mean here. Upgrading won't help, because the
values are baked into existing history created with the old versions
forever. Any time I "git checkout v1.0" on the broken project, a modern
git will complain about the ghost untracked file.
It depends if all file names in a certain repo are stored decomposed,
(in this case everybody can set core.precomposeunicode false)
or if there is a mixture having precomposed and decomposed
in different comits/directories...
In this case we can normalize the master branch.
For older commit the users need to wait for an updated Git version,
until that they need to live with the ghosts as good as they can.


The next problem is that people need to agree if the repo should store
names in pre- or decomposed form.
(My voice is for precomposed)
Unfortunatly the core.precomposeunicode is repo-local, so everybody
needs to "agree globally" and "configure locally".
Yeah, that was sort of my "point 1" from the patch. I'm a bit worried
that it creates problems for people on other systems, though. Linux
people do not need to care about precomposed/decomposed at all, and any
attempt we make to automatically handle it is going to run afoul of
non-utf8 encodings. Not to mention that it does not solve the "git
checkout v1.0" problem above.
Not sure what is meant by non-utf8 encodings.
Mac OS X can only handle Unicode filenames,
and a single ISO-8859-1 will be returned as "%XY" from readdir().
So if you want to share a repo with Mac OS X (and/or Windows)
Unicode should be used.
Are you saying that there is a Linux station feeding in file names in e.g. 8859-1, EUC ?
My experience/knowledge is that you can not do that (in a useful way).


Side note:
I which we had this config variable travelling with the repo, like .gitattributes does
for text dealing with CRLF-LF.
Yeah, I guess if we wanted to enforce it everywhere, you would have to
use .gitattributes since we cannot safely turn on such a feature by
default.

I don't know how many reports you have, reading all this it feels as if the effected users
could "normalize" their repos and run "git config core.precomposeunicode true", followed
by "git config --global core.precomposeunicode true".
Does that sound like a possible way forward ?
I have just a handful of reports. Maybe 3-4? I didn't dig them all up
today, as it would be a non-trivial effort. But I have no idea how good
a sampling that is.
The following could help, may be:
git -c core.quotepath=false ls-files | iconv -f UTF-8-MAC -t UTF-8 >expected
git -c core.quotepath=false ls-files >actual
diff expected actual

-Peff

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]