Hi guys, As part of trying to fix the problems in MSysGit around tree encodings, I would like to start a discussion on mitigating the backwards compatibility problems associated with tree path encodings being unspecified. ## History For those folks unfamiliar with the issue, I'll provide a quick refresh - Git has traditionally not specified the string encoding of paths inside the tree object - whatever strings the OS provided from the readdir syscall was used verbatim to write out tree objects. For most operating systems, this was UTF-8 (though even on certain POSIX OS's there are some caveats with normalized sequence points, such as OS X). However, on Windows until *very* recently (and on non-Unicode Linux locales), the strings returned by the OS are from a locale-specific OEM Code Page (i.e. Shift-JIS, Windows-1252, etc) and *not* Unicode. These repositories are currently incorrectly interpreted on other OSs (or even the same OS with a different locale configured). Note that *blob* (i.e. content) encoding is a separate issue and is out-of-scope at the moment. This will become a bigger problem in the near future , because MSysGit is seeking to fix this mistake on Windows by explicitly writing all tree objects in UTF-8. While this is great for new repositories, this will create a compatibility problem: people who upgrade their Git installation on their local machine will now have issues with their existing repos. ## Proposed Mitigation For an initial mitigation plan, I'd like to propose adding a warning to either git clone or git checkout, that if invalid UTF-8 strings are detected, a warning is printed to the user. However, without an actionable solution, it's not much of a help other than to suggest that they downgrade to a lower version of Git. Possible solutions that we've discussed are: * Add a git-config setting to explicitly set the code-page, defaulted to UTF-8. With this, the error message could instruct them to set this config locally. This has the additional benefit of enabling Linux users to use these existing Windows repositories. * Creating a conversion utility to rewrite all trees to use UTF-8. This is problematic for obvious reasons, even disregarding the fact that the result will be incompatible with the original repo - mainly that it may be non-trivial to detect which encoding the strings were originally written in. libicu (http://site.icu-project.org/) has code to do this. -- Paul Betts <paul@xxxxxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html