Re: [RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.

Peter Krefting <peter@xxxxxxxxxxxxxxxx> · Mon, 02 Mar 2009 11:46:47 +0100 (CET)

Johannes Sixt:

I don't think that this assumption is valid.

Depends on where you are coming from. For the files stored in the Git 
repositories, I believe all file names are supposed to be UTF-8 encoded 
(just like commit messages and user names are). That's the assumption I 
started working from.

Users will always have some code page set that is not UTF-8.

Indeed. And as long as the char-pointer interfaces in stdio and elsewhere 
work on that assumption, we have a problem.

For example, if the user specifies a file name on the command line, than
it will not enter git in UTF-8, but in the current "ANSI" or "OEM code
page" encoding.

That problem is already solved as we do have a wchar_t command line 
available. If you pass a file name that is not representable in the current 
"ANSI" codepage on the command line, it will come out as garbage in the 
char* version, but will be correct in the wchar_t* version. Thus we need to 
convert that to utf-8 and use that instead.

If git prints a file name under the assumption that it is UTF-8 encoded, 
then it will be displayed incorrectly because the system uses a different 
encoding.

Here setting the local codepage to UTF-8 *might* work, although I haven't 
tested that. Or always use the wchar_t versions of printf and friends.

I think you are grossly underestimating the venture that you want to 
undertake here.

I've done this before with other software, so, yes, I know it is quite a big 
undertaking. That is also why I started out with a minimal RFC patch to see 
if there was any interest in working with this.

Please come up with a plan how you are going to deal with the various
issues. File names enter and leave the system through different channels:

- the command line and terminal window

GetCommandLineW() as decribed above.

- object database (tree objects)

Those file names are supposedly always UTF-8.

- opendir/readdir; opening files or directories for reading or writing

Wrap file open and directory read to use the wchar_t versions, converting 
that to UTF-8 strings at the API level.

And there is probably some more... How do you treat encodings in these 
channels? What if the file names are not valid UTF-8? Etc.

Ill-formed UTF-8 should just be rejected. Invalid UTF-8 is worse. I'm not 
sure what the Linux version does, when running in a UTF-8 locale. Does it 
allow ill-formed or illegal UTF-8 sequences?

NTFS allows almost any sequence of wchar_t's, it doesn't even have to be 
valid UTF-16.

The biggest obstacle will be that git does not have a notion of "file name 
encoding" - it simply treats a file name as a stream of bytes.

Yeah, that is one of the major bugs in its design, IMHO. But almost everyone 
seems to assume that file names are UTF-8 strings anyway, so in the absence 
of any other information, it's a good assumption as any to make.

If the byte streams are regarded as having an encoding, then you can have 
ambiguities, mixed encodings, or invalid characters. You would have to 
deal with this in some way.

Considering we already see problems with file names that cannot properly be 
represented on some file systems (case-only differences in the Linux kernel 
when checked out on Windows; Mac OS' built-in Unicode normalization of file 
names, etc.)

Windows 9x is already out of the loop.

Good.

--
\\// Peter - http://www.softwolves.pp.se/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html