On 22/01/2008, Junio C Hamano wrote: > If the project uses UTF-8-NFC, we would need to adjust check-in > and check-out codepath like Linus's readdir(3) hack suggested, > but that needs to be done only on HFS+. Of course, the project > participants need to be careful not to create files that HFS+ > cannot handle (two paths that happen to be equivalent strings > should not be created), but I do not think that is such a big > issue as some people seem to make a big deal out of. If you Right, I don't see that as a big issue -- for new files. But we can have files that were created in the past as non-handleable by HFS+, and later renamed to something more portable. More generally, the consensus encoding might change over time. We can imagine a project which contains, say, a test file which a latin-1 name, that gets later renamed to a UTF-8 name, (due to a project policy change), but making necessary to adjust the said test. A checkout of the earlier version would have that test failing. (But maybe I'm just handwaving towards a non-existent problem here. I'd consider the issue as minor anyway.) > want to be interoperable with different filesystems, you should > not create two paths that are different only in case, and if > there are participants who are on such a filesystem, the mistake > is quickly spotted and corrected. It happened in git.git to a > file other than that infamous Märchen. It's exactly the same > issue [*1*]. > > In short, initially I did not like Linus's readdir(3) hack very > much, but the more I think about it, I like it the better. > > We pick a reasonable default (i.e. "no conversion") at the > technical level, and recommend (but do not pay for the overhead > of enforcing) a reasonable normalization as the BCP at the human > level. Only on filesystems that mangle the pathnames, or if you > want legacy encodings on the filesystem, we would need to pay > overhead for conversion and help people with actual code to do > so. > > To support the above scenarios, I think each instance of > repository needs to be able to say "this path (specified with a > matching pattern in the filename encoding) should be converted > this way coming in, and that way going out." UTF-8 only project > would have NKC<->NKD on HFS+ partition, and nothing on > everywhere else. EUC-JP project that checks out as-is would > specify nothing either, but people on Shift_JIS platforms would > locally specify that EUC-JP <-> Shift_JIS conversion to be made. Sounds sane, except maybe the part where you specify paths with a pattern. Do you really need this layer of complexity? Pattern matching in different encodings has proven to be troublesome. Usually that's where UTF-8 normalisation rules and locale-specific behaviours kick in, esp. when you're starting to use \w or \d characters classes, or case insensitivity. For example, if you want to do it correctly, "I" will match /i/ case-insensitively, except in Turkish locales... (Sorry, I'm just handwaving again here...) - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html