On Wed, Aug 13, 2008 at 4:04 PM, Shawn O. Pearce <spearce@xxxxxxxxxxx> wrote: > Jakub Narebski <jnareb@xxxxxxxxx> wrote: >> Nicolas Pitre <nico@xxxxxxx> writes: >> > On Tue, 12 Aug 2008, Geert Bosch wrote: >> > >> > > One nice optimization we could do for those pesky binary large objects >> > > (like PDF, JPG and GZIP-ed data), is to detect such files and revert >> > > to compression level 0. This should be especially beneficial >> > > since already compressed data takes most time to compress again. >> > >> > That would be a good thing indeed. >> >> Perhaps take a sample of some given size and calculate entropy in it? >> Or just simply add gitattribute for per file compression ratio... > > Estimating the entropy would make it "just magic". Most of Git is > "just magic" so that's a good direction to take. I'm not familiar > enough with the PDF/JPG/GZIP/ZIP stream formats to know what the > first 4-8k looks like to know if it would give a good indication > of being already compressed. > > Though I'd imagine looking at the first 4k should be sufficient > for any compressed file. Having a header composed of 4k of _text_ > before binary compressed data would be nuts. Or a git-bundle with > a large refs listing. ;-) FWIW, PDF format is a mix of sections of uncompressed higher level ASCII notation and sections of compressed actual glyph/location data for individual pages, and I don't think the rules are very strict about what goes where. Looking at some academic papers some contain compressed data within the first hundred characters whilst I've got a couple with the first compressed byte 1968 and 12304; I'm sure if I had a longer pdf to look at I'd find one where compression data first occurred even later. I leave discussions of whether this is nuts to others ;-) . JPG is pretty much guaranteed to contain compressed data after a couple of metadata lines. -- cheers, dave tweed__________________________ david.tweed@xxxxxxxxx Rm 124, School of Systems Engineering, University of Reading. "while having code so boring anyone can maintain it, use Python." -- attempted insult seen on slashdot -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html