Re: git archive --format zip utf-8 issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 31.08.2012 00:26, schrieb Jeff King:
> Ping on this stalled discussion.

Sorry, I got distracted by other stuff again.  I did some experiments,
though, and here's a preliminary result.

> It seems like there are two separate issues here:
> 
>    1. Knowing the encoding of pathnames in the repository.
> 
>    2. Setting the right flags in zip output.
> 
> A full solution would handle both parts, but let's ignore (1) for a
> moment, and assume we have utf-8 (or can massage into utf-8 from an
> encoding specified by the user).

Yes, good thinking.  Re-encoding may be beneficial for tar files as well,
but we can ignore that point for the moment.

> It seems like just setting the magic utf-8 flag would be the only thing
> we need to do, according to the standard. But according to discussions
> referenced elsewhere in this thread, that flag was invented only in
> 2007, so we may be dealing with older implementations (I have no idea
> how common they would be; that may be the problem with Windows 7's zip
> you are seeing). We could re-encode to cp437, which the standard
> specifies, but apparently some implementations do not respect that
> (and use a local code page instead). And it cannot represent all utf-8
> characters, anyway.

Yes, we could do that, plus adding an extra field with a UTF-8 version of
the path.  That's the legacy method invented by Info-ZIP.  They switched
to using the new flag on Linux at least, though.

> It sounds like 7-zip has figured out a more portable solution. Can you
> show us a sample of 7-zip's output with utf-8 characters to compare to
> what git generates? I wonder if it is using a combination of methods.

I'm not so sure they produce more portable files.  I created an archive
with files named jaя.txt, smørrebrød.txt, süd.txt and €uro.txt with 7-Zip
on Windows 7 and while unzip on Ubuntu 12.04 managed to recreate the
cyrillic character and the Euro symbol, it mangled the slashed o and the
umlaut.

With the following patch I could create archives with git on Linux and
msysgit and extract them flawlessly on Windows with 7-Zip and with
Info-ZIP unzip on Linux, but not with unzip on Windows, where it mangled
all non-ASCII characters.

This gets confusing; it would help to have a compatibility matrix for all
intersting extractors and character classes -- for each proposed solution
or archiver we'd like to imitate.

But now for the patch, which is a bit confusing as well.  I'm curious to
hear about results for more platforms, extractors and character classes.
Based on that we can see if we need to generate the extra fields instead
of relying on the new flag.

-- >8 --
Subject: [PATCH] archive-zip: support UTF-8 paths

Set general purpose flag 11 if we encounter a path that contains
non-ASCII characters.  We assume that all paths are given as UTF-8; no
conversion is done.

The flag seems to be ignored by unzip unless we also mark the archive
entry as coming from a Unix system.  This is done by setting the field
creator_version ("version made by" in the standard[1]) to 0x03NN.

The NN part represents the version of the standard supported by us, and
this patch sets it to 3f (for version 6.3) for Unix paths.  We keep
creator_version set to 0 (FAT filesystem, standard version 0) in the
non-special cases, as before.

But when we declare a file to have a Unix path, then we have to set the
file mode as well, or unzip will extract the files with the permission
set 0000, i.e. inaccessible by all.

[1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Signed-off-by: Rene Scharfe <rene.scharfe@xxxxxxxxxxxxxx>
---
 archive-zip.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/archive-zip.c b/archive-zip.c
index f5af81f..928da1d 100644
--- a/archive-zip.c
+++ b/archive-zip.c
@@ -4,6 +4,8 @@
 #include "cache.h"
 #include "archive.h"
 #include "streaming.h"
+#include "commit.h"
+#include "utf8.h"
 
 static int zip_date;
 static int zip_time;
@@ -16,7 +18,8 @@ static unsigned int zip_dir_offset;
 static unsigned int zip_dir_entries;
 
 #define ZIP_DIRECTORY_MIN_SIZE	(1024 * 1024)
-#define ZIP_STREAM (8)
+#define ZIP_STREAM	(1 <<  3)
+#define ZIP_UTF8	(1 << 11)
 
 struct zip_local_header {
 	unsigned char magic[4];
@@ -173,7 +176,8 @@ static int write_zip_entry(struct archiver_args *args,
 {
 	struct zip_local_header header;
 	struct zip_dir_header dirent;
-	unsigned long attr2;
+	unsigned int creator_version = 0;
+	unsigned long attr2 = 0;
 	unsigned long compressed_size;
 	unsigned long crc;
 	unsigned long direntsize;
@@ -187,6 +191,13 @@ static int write_zip_entry(struct archiver_args *args,
 
 	crc = crc32(0, NULL, 0);
 
+	if (has_non_ascii(path)) {
+		if (is_utf8(path))
+			flags |= ZIP_UTF8;
+		else
+			warning("Path is not valid UTF-8: %s", path);
+	}
+
 	if (pathlen > 0xffff) {
 		return error("path too long (%d chars, SHA1: %s): %s",
 				(int)pathlen, sha1_to_hex(sha1), path);
@@ -204,10 +215,15 @@ static int write_zip_entry(struct archiver_args *args,
 		enum object_type type = sha1_object_info(sha1, &size);
 
 		method = 0;
-		attr2 = S_ISLNK(mode) ? ((mode | 0777) << 16) :
-			(mode & 0111) ? ((mode) << 16) : 0;
 		if (S_ISREG(mode) && args->compression_level != 0 && size > 0)
 			method = 8;
+		if (S_ISLNK(mode) || (mode & 0111) || (flags & ZIP_UTF8)) {
+			creator_version = 0x033f;
+			attr2 = mode;
+			if (S_ISLNK(mode))
+				attr2 |= 0777;
+			attr2 <<= 16;
+		}
 		compressed_size = size;
 
 		if (S_ISREG(mode) && type == OBJ_BLOB && !args->convert &&
@@ -254,8 +270,7 @@ static int write_zip_entry(struct archiver_args *args,
 	}
 
 	copy_le32(dirent.magic, 0x02014b50);
-	copy_le16(dirent.creator_version,
-		S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0);
+	copy_le16(dirent.creator_version, creator_version);
 	copy_le16(dirent.version, 10);
 	copy_le16(dirent.flags, flags);
 	copy_le16(dirent.compression_method, method);
-- 
1.7.11.msysgit.1.1.gbf71771


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]