Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Johannes Sixt:

I don't think that this assumption is valid.

Depends on where you are coming from. For the files stored in the Git repositories, I believe all file names are supposed to be UTF-8 encoded (just like commit messages and user names are). That's the assumption I started working from.

Users will always have some code page set that is not UTF-8.

Indeed. And as long as the char-pointer interfaces in stdio and elsewhere work on that assumption, we have a problem.

For example, if the user specifies a file name on the command line, than
it will not enter git in UTF-8, but in the current "ANSI" or "OEM code
page" encoding.

That problem is already solved as we do have a wchar_t command line available. If you pass a file name that is not representable in the current "ANSI" codepage on the command line, it will come out as garbage in the char* version, but will be correct in the wchar_t* version. Thus we need to convert that to utf-8 and use that instead.

If git prints a file name under the assumption that it is UTF-8 encoded, then it will be displayed incorrectly because the system uses a different encoding.

Here setting the local codepage to UTF-8 *might* work, although I haven't tested that. Or always use the wchar_t versions of printf and friends.

I think you are grossly underestimating the venture that you want to undertake here.

I've done this before with other software, so, yes, I know it is quite a big undertaking. That is also why I started out with a minimal RFC patch to see if there was any interest in working with this.

Please come up with a plan how you are going to deal with the various
issues. File names enter and leave the system through different channels:

- the command line and terminal window

GetCommandLineW() as decribed above.

- object database (tree objects)

Those file names are supposedly always UTF-8.

- opendir/readdir; opening files or directories for reading or writing

Wrap file open and directory read to use the wchar_t versions, converting that to UTF-8 strings at the API level.

And there is probably some more... How do you treat encodings in these channels? What if the file names are not valid UTF-8? Etc.

Ill-formed UTF-8 should just be rejected. Invalid UTF-8 is worse. I'm not sure what the Linux version does, when running in a UTF-8 locale. Does it allow ill-formed or illegal UTF-8 sequences?

NTFS allows almost any sequence of wchar_t's, it doesn't even have to be valid UTF-16.

The biggest obstacle will be that git does not have a notion of "file name encoding" - it simply treats a file name as a stream of bytes.

Yeah, that is one of the major bugs in its design, IMHO. But almost everyone seems to assume that file names are UTF-8 strings anyway, so in the absence of any other information, it's a good assumption as any to make.

If the byte streams are regarded as having an encoding, then you can have ambiguities, mixed encodings, or invalid characters. You would have to deal with this in some way.

Considering we already see problems with file names that cannot properly be represented on some file systems (case-only differences in the Linux kernel when checked out on Windows; Mac OS' built-in Unicode normalization of file names, etc.)

Windows 9x is already out of the loop.

Good.

--
\\// Peter - http://www.softwolves.pp.se/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux