Re: non-US-ASCII file names (e.g. Hiragana) on Windows

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



tisdag 01 december 2009 17:26:27 skrev du:
> Johannes Sixt <j.sixt@xxxxxxxxxxxxx> wrote:
> > Thomas Singer schrieb:
> > > To be more precise: Who is interpreting the bytes in the file names as
> > > characters? Windows, Git or Java?
> >
> > In the case of git: Windows does it, using the console's codepage to
> > convert between bytes and Unicode.
> >
> > I don't know about Java, but I guess that no conversion is necessary
> > because Java is Unicode-aware.
>
> Actually, conversion is necessary, and its something that is proving
> to be really painful within JGit.
>
> The Java IO APIs use UTF-16 for file names.  However we are reading
> a stream of unknown bytes from the index file and tree objects.
> Thus JGit must convert a stream of bytes into UTF-16 just to get
> to the OS.
>
> The JVM then turns around and converts from UTF-16 to some other
> encoding for the filesystem.
>
> On Win32 I suspect the JVM uses the native UTF-16 file APIs, so
> this translation is lossless.
>
> On POSIX, I suspect the JVM uses $LANG or some other related
> environment variable to guess the user's preferred encoding, and
> then converts from UTF-16 to bytes in that encoding.  And I have
> no idea how they handle normalization of composed code points.
>
> All of these layers make for a *very* confusing situation for us
> within JGit:
>
>   git tree
>   +---------+
>
>   | bytes   | -+
>
>   +---------+   \
>                  \             +--------+            +---------+
>                   +-- JGit --> | UTF-16 | -- JVM --> | OS call |
>   .git/index     /             +--------+            +---------+
>   +---------+   /
>
>   | bytes   | -+
>
>   +---------+
>
> Its impossible for us to do what C git does, which is just use the
> bytes used by the OS call within the git datastructure.  Which of
> course also isn't always portable, e.g. the Mac OS X HFS+ mess.

We can decode the index anyway we like but not file names coming from
the file system. On Windows, any sane name (it does allow invalid UTF-16 too, 
but...) will be readable by JGit, but on a UTF-8 posix that may not be so, if 
the filename is actually Latin.-1 encoded. In that case the Java runtime will 
return a decoded filename containing an "invalid" code point and any attempt to 
access the file from java will fail. I can see some horribly expensive ways to 
work around that but...

As for the more sane cases I have a compare routine that works on mixed 
encodings that may help to solve some of the problems. Ideally it would not
only be able to compare filenames with unknown encodings to handling case 
folding and composing characters in one go too. I guess one could make it
fall back to another encoding than Latin-1, but with lesser certainty, but
it will not (for sure) work with any arbitrary set of encodings. You'll have 
to choose, so it's only a legacy workaround, as opposed to a solution. 

-- robin

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]