On Wednesday 13 August 2008, Shawn O. Pearce wrote: > Jakub Narebski <jnareb@xxxxxxxxx> wrote: > > Nicolas Pitre <nico@xxxxxxx> writes: > > > On Tue, 12 Aug 2008, Geert Bosch wrote: > > > > One nice optimization we could do for those pesky binary large > > > > objects (like PDF, JPG and GZIP-ed data), is to detect such > > > > files and revert to compression level 0. This should be > > > > especially beneficial since already compressed data takes most > > > > time to compress again. > > > > > > That would be a good thing indeed. > > > > Perhaps take a sample of some given size and calculate entropy in > > it? Or just simply add gitattribute for per file compression > > ratio... > > Estimating the entropy would make it "just magic". Most of Git is > "just magic" so that's a good direction to take. I'm not familiar > enough with the PDF/JPG/GZIP/ZIP stream formats to know what the > first 4-8k looks like to know if it would give a good indication > of being already compressed. > > Though I'd imagine looking at the first 4k should be sufficient > for any compressed file. Having a header composed of 4k of _text_ > before binary compressed data would be nuts. Or a git-bundle with > a large refs listing. ;-) As for how to estimate entropy, isn't that just a matter of feeding it through zlib and compare the output size to the input size? Especially if we're already about to feed it through zlib anyway... In other words, feed (an initial part of) the data through zlib, and if the compression ratio so far looks good, keep going and write out the compressed object, otherwise abort zlib and write out the original object with compression level 0. > Hence, "just magic" is probably the better route. Agreed. Have fun! ...Johan -- Johan Herland, <johan@xxxxxxxxxxx> www.herland.net -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html