Hello again,
so two weeks have passed, and I've moved at a glacial pace towards a
method how to measure compatibility of our generated ZIP files. Sorry,
I just keep getting distracted.
Anyway, the idea is to have a bunch of files with names using different
scripts, zip them with several packers (including git archive), unzip
them and compare the result with the original files.
As test corpus I used files named like the pangrams on this UTF-8
sampler page, the exact commands are attached:
http://www.columbia.edu/~fdc/utf8/index.html#quickbrownfox
The numbers below are how many lines the output of diff -ru contains for
this pair of packer and unpacker. There are 37 files, so the worst
result is 74 lines of difference ("Only in [...]" for both sides), while
0 indicates a perfect score.
Hmm, come to think of it, an empty directory would show up as 37, so
this metric is not ideal. A better one would be to simply give one
point for each correctly unpacked file.
Windows Info-ZIP unzip
7-Zip PeaZip builtin Linux msysgit Windows
7-Zip 9.20 0 0 46 26 43 43
PeaZip 4.7.1 win64 0 0 46 26 42 42
Info-ZIP zip 3.0 Linux 0 0 72 0 43 43
Info-ZIP zip 3.0 Windows 45 45 n/a 0 43 43
git-master 72 72 72 60 72 72
git-master-patch1 0 0 72 60 72 72
git-master-patch2 0 0 72 0 72 72
git-v1.7.11.msysgit.1 72 72 72 60 72 72
git-v1.7.11.msysgit.1-patch1 0 0 72 60 72 72
git-v1.7.11.msysgit.1-patch2 0 0 72 0 72 72
Info-ZIP's programs don't work too well on Windows. The built-in
unzipper of Windows 7 even refuses to open the file created by the
Windows version of zip. Speaking of which, this is the worst of the
unpackers.
With the two patches applied, we can say "use 7-Zip or PeaZip on Windows
and unzip on Linux" and filenames with all tested characters will be
preserved. I was surprised to see this working fine with msysgit like
that, even though no reencoding is introduced by the patches.
I wonder what 7-Zip and PeaZip do that gives them a slightly nicer score
with the Windows-internal unzipper. Umlauts, Nordic characters and
accents are preserved by that combination. It seems that unzip on Linux
fails to unpack exactly these names, so perhaps they employ a dirty
trick like using the local encoding in the ZIP file, which makes it
unportable.
I'll reply with the two patches, which contain basically the same code
as the previous patch, only split up. The second one declares that
filenames with UTF-8 encoding came from Unix (instead of FAT), which
makes unzip happy. This, however, implies that we contain Unix
permissions for these entries, which is a bit ugly.
René
#!/bin/sh
(
mkdir pangrams
cd pangrams
echo English >"The quick brown fox jumps over the lazy dog"
echo Irish 1 >"An ḃfuil do Ä‹roà ag bualaḋ ó ḟaitÃos an Ä¡rá a á¹?eall"
echo Irish 2 >"lena ṗóg éada ó ṡlà do leasa ṫú"
echo Irish 3 >"D'ḟuascail �osa Úr�ac na hÓiġe Beannaiṫe pór"
echo Irish 4 >"Éava agus �ḋai�"
echo Dutch >"Pa's wijze lynx bezag vroom het fikse aquaduct"
echo German 1 >"Falsches Üben von Xylophonmusik quält"
echo German 2 >"jeden größeren Zwerg"
echo Norwegian >"Blåbærsyltetøy"
echo Danish >"Høj bly gom vandt fræk sexquiz på wc"
echo Swedish >"Flygande bäckasiner söka strax hwila på mjuka tuvor"
echo Icelandic >"Sævör grét áðan þvà úlpan var ónýt"
echo Finnish >"Törkylempijävongahdus"
echo Polish >"Pchnąć w tę łódź jeża lub osiem skrzyń fig"
echo Czech >"PÅ™ÃliÅ¡ žluÅ¥ouÄ?ký kůň úpÄ›l Ä?ábelské kódy"
echo Slovak 1 >"Starý kôň na hÅ•be knÃh žuje tÃÅ¡ko povädnuté ruže"
echo Slovak 2 >"na stĺpe sa Ä?ateľ uÄ?à kvákaÅ¥ novú ódu o živote"
echo monotonic Greek >"ξεσκεπάζω την ψυχοφθόÏ?α βδελυγμία"
echo polytonic Greek >"ξεσκεπάζω τὴν ψυχοφθόÏ?α βδελυγμία"
echo Russian >"Съешь же ещё Ñ?тих мÑ?гких французÑ?ких булок да выпей чаю"
echo Bulgarian 1 >"Жълтата дюлÑ? беше щаÑ?тлива"
echo Bulgarian 2 >"че пухът, който цъфна, замръзна като гьон"
echo Northern Sami >"Vuol Ruoŧa geÄ‘ggiid leat máŋga luosa ja Ä?uovžža"
echo Hungarian >"Ã?rvÃztűrÅ‘ tükörfúrógép"
echo Spanish 1 >"El pingüino Wenceslao hizo kilómetros bajo exhaustiva"
echo Spanish 2 >"lluvia y frÃo añoraba a su querido cachorro"
echo Portuguese 1 >"O próximo vôo à noite sobre o Atlântico"
echo Portuguese 2 >"põe freqüentemente o único médico"
echo French 1 >"Les naïfs ægithales hâtifs pondant à Noël où il gèle"
echo French 2 >"sont sûrs d'être déçus en voyant leurs drôles"
echo French 3 >"d'œufs abîmés"
echo Esperanto >"EÄ¥oÅ?anÄ?o ĉiuĵaÅde"
echo Hebrew >"×–×” ×›×™×£ סת×? לשמוע ×?יך ×ª× ×¦×— קרפד ×¢×¥ טוב בגן"
echo Hiragana 1 >"������� �り�るを"
echo Hiragana 2 >"ã‚?ã?Œã‚ˆã?Ÿã‚Œã?žã€€ã?¤ã?ã?ªã‚‰ã‚€"
echo Hiragana 3 >"ã?†ã‚?ã?®ã?Šã??ã‚„ã?¾ã€€ã?‘ã?µã?“ã?ˆã?¦"
echo Hiragana 4 >"ã?‚ã?•ã??ゆã‚?ã?¿ã?˜ã€€ã‚‘ã?²ã‚‚ã?›ã?š"
)