Re: non-US-ASCII file names (e.g. Hiragana) on Windows

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Tue, 1 Dec 2009 08:26:27 -0800

Johannes Sixt <j.sixt@xxxxxxxxxxxxx> wrote:
> Thomas Singer schrieb:
> > To be more precise: Who is interpreting the bytes in the file names as
> > characters? Windows, Git or Java?
> 
> In the case of git: Windows does it, using the console's codepage to
> convert between bytes and Unicode.
> 
> I don't know about Java, but I guess that no conversion is necessary
> because Java is Unicode-aware.

Actually, conversion is necessary, and its something that is proving
to be really painful within JGit.

The Java IO APIs use UTF-16 for file names.  However we are reading
a stream of unknown bytes from the index file and tree objects.
Thus JGit must convert a stream of bytes into UTF-16 just to get
to the OS.

The JVM then turns around and converts from UTF-16 to some other
encoding for the filesystem.

On Win32 I suspect the JVM uses the native UTF-16 file APIs, so
this translation is lossless.

On POSIX, I suspect the JVM uses $LANG or some other related
environment variable to guess the user's preferred encoding, and
then converts from UTF-16 to bytes in that encoding.  And I have
no idea how they handle normalization of composed code points.

All of these layers make for a *very* confusing situation for us
within JGit:

  git tree
  +---------+
  | bytes   | -+
  +---------+   \
                 \             +--------+            +---------+
                  +-- JGit --> | UTF-16 | -- JVM --> | OS call |
  .git/index     /             +--------+            +---------+
  +---------+   /
  | bytes   | -+
  +---------+

Its impossible for us to do what C git does, which is just use the
bytes used by the OS call within the git datastructure.  Which of
course also isn't always portable, e.g. the Mac OS X HFS+ mess.

:-)

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html