Junio C Hamano <gitster@xxxxxxxxx> writes: > Sam Vilain <sam.vilain@xxxxxxxxxxxxxxx> writes: > >> Some projects may like to enforce a particular encoding is used for >> all filenames in the repository. Within the UTF-8 encoding, there are >> four normal forms (see http://unicode.org/reports/tr15/), any of which >> may be a reasonable repository format choice. Additionally, some >> filesystems may have a single encoding that they support when writing >> local filenames. To support this, iconv and a normalization library >> must have the information they need to perform the correct conversion. > > Isn't there a chicken-and-egg problem? The attributes are by > nature per-path, and you need to match the pathname string with > a pattern to decide which attribute definition to apply to a > given path. Before knowing what encoding the pathname you have > just read from readdir(3), how would you match that pathname > with the pattern in the gitattributes file? > > I can buy the .git/config (and an in-tree .git-encoding, > perhaps), though. I admit that Documentação/ja/お読み下さい example was contrived (the last component is README-in-Japanese), and if anybody still wanted to have such a tree sanely, the only practical cross-platform and multi-language way to do so is to have everything in UTF-8 at the repository level. In that sense, the project does not need to specify anything, other than marking that "all of the pathnames in tree objects are in UTF-8 (we could go stronger, and say which kind of normalization we want)". As there is no other practical choice than UTF-8-NFC if you want to be cross-platform, compatible, and multi-language, the project can just declare that is what it uses and does not have to mark it any specially. A particular clone of such a project may want to check everything out as-is to get an UTF-8 only tree (I'll mention HFS+ shortly). Another clone may want to get mixed legacy encodings by running mkdir(utf8_to_latin1("Documentação")) and creat(utf8_to_eucjp(" お読み下さい")), but that is purely a local matter and should not be controlled by anything in-tree, be it .gitattributes or .git-encoding. On the other hand, it is not so unusual to see a legacy encoding used in the pathnames, especially if your project does not need to deal with multi-language issues. In such a repository, I do not want to enforce that all the paths in tree objects MUST be UTF-8. If all the project participant agree to work with EUC-JP pathnames in tree objects, we should not make the users always go through double conversion going from readdir(3) to index, and coming from index back to open(2) or creat(2). Again, that is done by agreement by project participants, so there is nothing that needs to be specified in-tree. If the project uses UTF-8-NFC, we would need to adjust check-in and check-out codepath like Linus's readdir(3) hack suggested, but that needs to be done only on HFS+. Of course, the project participants need to be careful not to create files that HFS+ cannot handle (two paths that happen to be equivalent strings should not be created), but I do not think that is such a big issue as some people seem to make a big deal out of. If you want to be interoperable with different filesystems, you should not create two paths that are different only in case, and if there are participants who are on such a filesystem, the mistake is quickly spotted and corrected. It happened in git.git to a file other than that infamous Märchen. It's exactly the same issue [*1*]. In short, initially I did not like Linus's readdir(3) hack very much, but the more I think about it, I like it the better. We pick a reasonable default (i.e. "no conversion") at the technical level, and recommend (but do not pay for the overhead of enforcing) a reasonable normalization as the BCP at the human level. Only on filesystems that mangle the pathnames, or if you want legacy encodings on the filesystem, we would need to pay overhead for conversion and help people with actual code to do so. To support the above scenarios, I think each instance of repository needs to be able to say "this path (specified with a matching pattern in the filename encoding) should be converted this way coming in, and that way going out." UTF-8 only project would have NKC<->NKD on HFS+ partition, and nothing on everywhere else. EUC-JP project that checks out as-is would specify nothing either, but people on Shift_JIS platforms would locally specify that EUC-JP <-> Shift_JIS conversion to be made. [Footnote] *1* This is an important point, especially the breakage was about tests that used files "a" and "A". No pathname enforcement in git-as-scm would have enforced anything to avoid the breakage. But there are humans involved in the project and they are an integral part of ensuring interoperability. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html