Re: git archive --format zip utf-8 issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello again,

so two weeks have passed, and I've moved at a glacial pace towards a method how to measure compatibility of our generated ZIP files. Sorry, I just keep getting distracted.

Anyway, the idea is to have a bunch of files with names using different scripts, zip them with several packers (including git archive), unzip them and compare the result with the original files.

As test corpus I used files named like the pangrams on this UTF-8 sampler page, the exact commands are attached:

   http://www.columbia.edu/~fdc/utf8/index.html#quickbrownfox

The numbers below are how many lines the output of diff -ru contains for this pair of packer and unpacker. There are 37 files, so the worst result is 74 lines of difference ("Only in [...]" for both sides), while 0 indicates a perfect score.

Hmm, come to think of it, an empty directory would show up as 37, so this metric is not ideal. A better one would be to simply give one point for each correctly unpacked file.

                                         Windows    Info-ZIP unzip
                            7-Zip PeaZip builtin Linux msysgit Windows
7-Zip 9.20                      0      0      46    26      43      43
PeaZip 4.7.1 win64              0      0      46    26      42      42
Info-ZIP zip 3.0 Linux          0      0      72     0      43      43
Info-ZIP zip 3.0 Windows       45     45     n/a     0      43      43
git-master                     72     72      72    60      72      72
git-master-patch1               0      0      72    60      72      72
git-master-patch2               0      0      72     0      72      72
git-v1.7.11.msysgit.1          72     72      72    60      72      72
git-v1.7.11.msysgit.1-patch1    0      0      72    60      72      72
git-v1.7.11.msysgit.1-patch2    0      0      72     0      72      72

Info-ZIP's programs don't work too well on Windows. The built-in unzipper of Windows 7 even refuses to open the file created by the Windows version of zip. Speaking of which, this is the worst of the unpackers.

With the two patches applied, we can say "use 7-Zip or PeaZip on Windows and unzip on Linux" and filenames with all tested characters will be preserved. I was surprised to see this working fine with msysgit like that, even though no reencoding is introduced by the patches.

I wonder what 7-Zip and PeaZip do that gives them a slightly nicer score with the Windows-internal unzipper. Umlauts, Nordic characters and accents are preserved by that combination. It seems that unzip on Linux fails to unpack exactly these names, so perhaps they employ a dirty trick like using the local encoding in the ZIP file, which makes it unportable.

I'll reply with the two patches, which contain basically the same code as the previous patch, only split up. The second one declares that filenames with UTF-8 encoding came from Unix (instead of FAT), which makes unzip happy. This, however, implies that we contain Unix permissions for these entries, which is a bit ugly.

René
#!/bin/sh
(
	mkdir pangrams
	cd pangrams

	echo English >"The quick brown fox jumps over the lazy dog"
	echo Irish 1 >"An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a �eall"
	echo Irish 2 >"lena ṗóg éada ó ṡlí do leasa ṫú"
	echo Irish 3 >"D'ḟuascail �osa Úr�ac na hÓiġe Beannaiṫe pór"
	echo Irish 4 >"Éava agus �ḋai�"
	echo Dutch >"Pa's wijze lynx bezag vroom het fikse aquaduct"
	echo German 1 >"Falsches Üben von Xylophonmusik quält"
	echo German 2 >"jeden größeren Zwerg"
	echo Norwegian >"Blåbærsyltetøy"
	echo Danish >"Høj bly gom vandt fræk sexquiz på wc"
	echo Swedish >"Flygande bäckasiner söka strax hwila på mjuka tuvor"
	echo Icelandic >"Sævör grét áðan því úlpan var ónýt"
	echo Finnish >"Törkylempijävongahdus"
	echo Polish >"Pchnąć w tę łódź jeża lub osiem skrzyń fig"
	echo Czech >"PříliÅ¡ žluÅ¥ouÄ?ký kůň úpÄ›l Ä?ábelské kódy"
	echo Slovak 1 >"Starý kôň na hŕbe kníh žuje tíško povädnuté ruže"
	echo Slovak 2 >"na stĺpe sa Ä?ateľ uÄ?í kvákaÅ¥ novú ódu o živote"
	echo monotonic Greek >"ξεσκεπάζω την ψυχοφθόÏ?α βδελυγμία"
	echo polytonic Greek >"ξεσκεπάζω τὴν ψυχοφθόÏ?α βδελυγμία"
	echo Russian >"Съешь же ещё Ñ?тих мÑ?гких французÑ?ких булок да выпей чаю"
	echo Bulgarian 1 >"Жълтата дюлÑ? беше щаÑ?тлива"
	echo Bulgarian 2 >"че пухът, който цъфна, замръзна като гьон"
	echo Northern Sami >"Vuol Ruoŧa geÄ‘ggiid leat máŋga luosa ja Ä?uovžža"
	echo Hungarian >"�rvíztűrő tükörfúrógép"
	echo Spanish 1 >"El pingüino Wenceslao hizo kilómetros bajo exhaustiva"
	echo Spanish 2 >"lluvia y frío añoraba a su querido cachorro"
	echo Portuguese 1 >"O próximo vôo à noite sobre o Atlântico"
	echo Portuguese 2 >"põe freqüentemente o único médico"
	echo French 1 >"Les naïfs ægithales hâtifs pondant à Noël où il gèle"
	echo French 2 >"sont sûrs d'être déçus en voyant leurs drôles"
	echo French 3 >"d'œufs abîmés"
	echo Esperanto >"EÄ¥oÅ?anÄ?o ĉiuĵaÅ­de"
	echo Hebrew >"×–×” ×›×™×£ סת×? לשמוע ×?יך תנצח קרפד ×¢×¥ טוב בגן"
	echo Hiragana 1 >"������� �り�るを"
	echo Hiragana 2 >"��よ�れ� ���らむ"
	echo Hiragana 3 >"ã?†ã‚?ã?®ã?Šã??ã‚„ã?¾ã€€ã?‘ã?µã?“ã?ˆã?¦"
	echo Hiragana 4 >"ã?‚ã?•ã??ゆã‚?ã?¿ã?˜ã€€ã‚‘ã?²ã‚‚ã?›ã?š"
)

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]