Re: [PATCH/RFC v3 6/8] Add case insensitivity support when using git ls-files

Ãvar ArnfjÃrÃ Bjarmason <avarab@xxxxxxxxx> · Mon, 4 Oct 2010 17:53:48 +0000

On Mon, Oct 4, 2010 at 16:49, Joshua Jensen <jjensen@xxxxxxxxxxxxxxxxx> wrote:
>> Is anyone thinking "unicode" around here?
>
> On Windows, Unicode filenames are 16-bit wide characters. ÂThe current code
> doesn't handle them at all.
>
> I do not know about other file systems and what Git actually handles. ÂI was
> under the impression it didn't handle Unicode filenames well in general... ?

The only sane way of doing this sort of thing is to have a defined
*internal* encoding that gets converted to whatever the native
encoding is at the input/output points.

So Git could use Unicode represented by UTF-8, UTF-16 (whatever's
convenient) internally, but when you check out files those checked out
files can be in whatever encoding you choose.

So you could have a UTF-8 repository but check out UTF-8 filenames on
Windows. I.e. internally we'd have the file:

    Ãab

Represented by UTF-8:

    c3 a6 61 62 \0

But would check out UTF-16:

    ff fe e6 00 61 00 62 00

Then when you add a new file it'll know it's in UTF-16 and convert it
to UTF-8 before writing to the repository. All invisible to the user.

Perl handles encoding issues like this, and it's awesome. The only
thing you have to do is make sure that the system knows the encoding
of data going into it, and what encoding you want out of it.

But any implementation of this is far off, and just storing raw byte
streams is Good Enough now that almost everyone uses UTF-8 anyway, so
nobody's seriously worked on this.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html