Re: git archive --format zip utf-8 issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Aug 11, 2012 at 11:37:05PM +0200, Sven Strickroth wrote:

> Am 11.08.2012 22:53 schrieb René Scharfe:
> > The standard says we need to convert to CP437, or to UTF-8, or provide 
> > both versions. A more interesting question is: What's supported by which 
> > programs?
> > 
> > The ZIP functionality built into Windows 7 doesn't seem to work with 
> > UTF-8 encoded filenames (except for those that only use the ASCII 
> > subset), and to ignore the UTF-8 part if both are given.
> 
> I played a bit with the git source code and found out, that
> 
> diff --git a/archive-zip.c b/archive-zip.c
> index f5af81f..e0ccb4f 100644
> --- a/archive-zip.c
> +++ b/archive-zip.c
> @@ -257,7 +257,7 @@ static int write_zip_entry(struct archiver_args *args,
>  	copy_le16(dirent.creator_version,
>  		S_ISLNK(mode) || (S_ISREG(mode) && (mode & 0111)) ? 0x0317 : 0);
>  	copy_le16(dirent.version, 10);
> -	copy_le16(dirent.flags, flags);
> +	copy_le16(dirent.flags, flags+2048);
>  	copy_le16(dirent.compression_method, method);
>  	copy_le16(dirent.mtime, zip_time);
>  	copy_le16(dirent.mdate, zip_date);
> --
> works with 7-zip, however, not with Windows 7 build-in zip.
> 
> If I create a zip file with 7-zip which contains umlauts and other
> unicode chars like (國立1-кккк.txt) the Windows 7 build-in zip displays
> them correctly, too.

Ping on this stalled discussion.

It seems like there are two separate issues here:

  1. Knowing the encoding of pathnames in the repository.

  2. Setting the right flags in zip output.

A full solution would handle both parts, but let's ignore (1) for a
moment, and assume we have utf-8 (or can massage into utf-8 from an
encoding specified by the user).

It seems like just setting the magic utf-8 flag would be the only thing
we need to do, according to the standard. But according to discussions
referenced elsewhere in this thread, that flag was invented only in
2007, so we may be dealing with older implementations (I have no idea
how common they would be; that may be the problem with Windows 7's zip
you are seeing). We could re-encode to cp437, which the standard
specifies, but apparently some implementations do not respect that
(and use a local code page instead). And it cannot represent all utf-8
characters, anyway.

It sounds like 7-zip has figured out a more portable solution. Can you
show us a sample of 7-zip's output with utf-8 characters to compare to
what git generates? I wonder if it is using a combination of methods.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]