Re: git archive --format zip utf-8 issues

René Scharfe <rene.scharfe@xxxxxxxxxxxxxx> · Sat, 11 Aug 2012 22:53:29 +0200

Am 11.08.2012 00:47, schrieb Junio C Hamano:
Sven Strickroth <sven.strickroth@xxxxxxxxxxxxxxx> writes:

when I create a git repository, add a file containing utf-8 characters
or umlauts (like öäü.txt), commit and then export the HEAD revision to a
zip archive using "git archive --format zip -o 1.zip HEAD", the zip file
contains incorrect filenames:

My reading of archive-zip.c seems to suggest that we write out
whatever pathname you have in the tree, so a pathname encoded in
UTF-8 will be literally written out in the resulting zip archive.

Sorry for my imperialistic attitude of "ASCII filenames should be enough 
for everybody".  Laziness..

Do you know in what encoding the pathnames are _expected_ to be
stored in zip archives?  Random documentation seems to suggest that
there is no standard encoding, e.g. http://docs.python.org/library/zipfile.html
says:

     There is no official file name encoding for ZIP files. If you
     have unicode file names, you must convert them to byte strings
     in your desired encoding before passing them to write(). WinZip
     interprets all file names as encoded in CP437, also known as DOS
     Latin.

which may explain it.

http://www.pkware.com/documents/casestudies/APPNOTE.TXT is the standard 
document, as Sven noted, and it says that filenames are encoded in code 
page 437, or optionally UTF-8 (a later addition).  Discussions like 
http://stackoverflow.com/questions/106367/ seem to indicate that at 
least some archivers use the local code page as well.

It may not be a bad idea for "git archive --format=zip" to

  (1) check if pathname is a correct UTF-8; and
  (2) check if it can be reencoded to latin-1

and if (and only if) both are true, automatically re-encode the path
to latin-1.

The standard says we need to convert to CP437, or to UTF-8, or provide 
both versions. A more interesting question is: What's supported by which 
programs?

The ZIP functionality built into Windows 7 doesn't seem to work with 
UTF-8 encoded filenames (except for those that only use the ASCII 
subset), and to ignore the UTF-8 part if both are given.  Handling 
umlauts should be possible anyway, because they are on code page 437, 
but for other characters we'd have to aim for compatibility with other 
programs like Info-ZIP and 7-Zip.

How do we know which encoding was used for a filename?

Of course, "git archive --format=zip --path-reencode=utf8-to-latin1"
would be the most generic way to do this.

I really hope we can make do without additional options.

René

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html