Johannes Sixt:
I don't think that this assumption is valid.
Depends on where you are coming from. For the files stored in the Git repositories, I believe all file names are supposed to be UTF-8 encoded (just like commit messages and user names are). That's the assumption I started working from.
Users will always have some code page set that is not UTF-8.
Indeed. And as long as the char-pointer interfaces in stdio and elsewhere work on that assumption, we have a problem.
For example, if the user specifies a file name on the command line, than it will not enter git in UTF-8, but in the current "ANSI" or "OEM code page" encoding.
That problem is already solved as we do have a wchar_t command line available. If you pass a file name that is not representable in the current "ANSI" codepage on the command line, it will come out as garbage in the char* version, but will be correct in the wchar_t* version. Thus we need to convert that to utf-8 and use that instead.
If git prints a file name under the assumption that it is UTF-8 encoded, then it will be displayed incorrectly because the system uses a different encoding.
Here setting the local codepage to UTF-8 *might* work, although I haven't tested that. Or always use the wchar_t versions of printf and friends.
I think you are grossly underestimating the venture that you want to undertake here.
I've done this before with other software, so, yes, I know it is quite a big undertaking. That is also why I started out with a minimal RFC patch to see if there was any interest in working with this.
Please come up with a plan how you are going to deal with the various issues. File names enter and leave the system through different channels: - the command line and terminal window
GetCommandLineW() as decribed above.
- object database (tree objects)
Those file names are supposedly always UTF-8.
- opendir/readdir; opening files or directories for reading or writing
Wrap file open and directory read to use the wchar_t versions, converting that to UTF-8 strings at the API level.
And there is probably some more... How do you treat encodings in these channels? What if the file names are not valid UTF-8? Etc.
Ill-formed UTF-8 should just be rejected. Invalid UTF-8 is worse. I'm not sure what the Linux version does, when running in a UTF-8 locale. Does it allow ill-formed or illegal UTF-8 sequences?
NTFS allows almost any sequence of wchar_t's, it doesn't even have to be valid UTF-16.
The biggest obstacle will be that git does not have a notion of "file name encoding" - it simply treats a file name as a stream of bytes.
Yeah, that is one of the major bugs in its design, IMHO. But almost everyone seems to assume that file names are UTF-8 strings anyway, so in the absence of any other information, it's a good assumption as any to make.
If the byte streams are regarded as having an encoding, then you can have ambiguities, mixed encodings, or invalid characters. You would have to deal with this in some way.
Considering we already see problems with file names that cannot properly be represented on some file systems (case-only differences in the Linux kernel when checked out on Windows; Mac OS' built-in Unicode normalization of file names, etc.)
Windows 9x is already out of the loop.
Good. -- \\// Peter - http://www.softwolves.pp.se/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html