Re: [PATCH 5/5] archive-zip: stream large blobs into zip file

René Scharfe <rene.scharfe@xxxxxxxxxxxxxx> · Tue, 01 May 2012 00:54:01 +0200

Am 30.04.2012 21:12, schrieb Junio C Hamano:
Nguyễn Thái Ngọc Duy<pclouds@xxxxxxxxx>  writes:

A large blob will be read twice. One for calculating crc32, one for
actual writing.

Is that because you need the checksum before the payload?  That is
unfortunate.  It would be nice (read: not a rejection of this patch---it
is a good first step to do it stupid but correct way before trying to
optimize it) to avoid it when the output is seekable, especially because
we are talking about a *large* payload.

The ZIP format optionally allows writing the CRC and the sizes after the 
data.  This adds a data descriptor with a size of 16 bytes to each 
streamed entry.  Seeking back and correcting these values in an output 
file would avoid that.

diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index fe47554..458fdde 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -138,4 +138,8 @@ test_expect_success 'tar achiving' '
  	git archive --format=tar HEAD>/dev/null
  '

+test_expect_success 'zip achiving' '
+	git archive --format=zip HEAD>/dev/null
+'

Can't we do better than "we only check if it finishes without barfing; we
cannot afford to check the correctness of the output"?  The same comment
applies to all the tests you added to this file in the past 3 months.

Streaming to tar can be tested by setting core.big_file_threshold big 
enough, creating a non-streamed version of the archive and comparing it 
to the streamed one.  With the seek trick, this would work for ZIP as 
well.  For streaming with an added data descriptor we'd need to actually 
unpack the ZIP file, though.

René
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html