Johannes Sixt <j.sixt@xxxxxxxxxxxxx> wrote: > Thomas Singer schrieb: > > To be more precise: Who is interpreting the bytes in the file names as > > characters? Windows, Git or Java? > > In the case of git: Windows does it, using the console's codepage to > convert between bytes and Unicode. > > I don't know about Java, but I guess that no conversion is necessary > because Java is Unicode-aware. Actually, conversion is necessary, and its something that is proving to be really painful within JGit. The Java IO APIs use UTF-16 for file names. However we are reading a stream of unknown bytes from the index file and tree objects. Thus JGit must convert a stream of bytes into UTF-16 just to get to the OS. The JVM then turns around and converts from UTF-16 to some other encoding for the filesystem. On Win32 I suspect the JVM uses the native UTF-16 file APIs, so this translation is lossless. On POSIX, I suspect the JVM uses $LANG or some other related environment variable to guess the user's preferred encoding, and then converts from UTF-16 to bytes in that encoding. And I have no idea how they handle normalization of composed code points. All of these layers make for a *very* confusing situation for us within JGit: git tree +---------+ | bytes | -+ +---------+ \ \ +--------+ +---------+ +-- JGit --> | UTF-16 | -- JVM --> | OS call | .git/index / +--------+ +---------+ +---------+ / | bytes | -+ +---------+ Its impossible for us to do what C git does, which is just use the bytes used by the OS call within the git datastructure. Which of course also isn't always portable, e.g. the Mac OS X HFS+ mess. :-) -- Shawn. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html