Re: pack operation is thrashing my server

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Aug 13, 2008 at 4:04 PM, Shawn O. Pearce <spearce@xxxxxxxxxxx> wrote:
> Jakub Narebski <jnareb@xxxxxxxxx> wrote:
>> Nicolas Pitre <nico@xxxxxxx> writes:
>> > On Tue, 12 Aug 2008, Geert Bosch wrote:
>> >
>> > > One nice optimization we could do for those pesky binary large objects
>> > > (like PDF, JPG and GZIP-ed data), is to detect such files and revert
>> > > to compression level 0. This should be especially beneficial
>> > > since already compressed data takes most time to compress again.
>> >
>> > That would be a good thing indeed.
>>
>> Perhaps take a sample of some given size and calculate entropy in it?
>> Or just simply add gitattribute for per file compression ratio...
>
> Estimating the entropy would make it "just magic".  Most of Git is
> "just magic" so that's a good direction to take.  I'm not familiar
> enough with the PDF/JPG/GZIP/ZIP stream formats to know what the
> first 4-8k looks like to know if it would give a good indication
> of being already compressed.
>
> Though I'd imagine looking at the first 4k should be sufficient
> for any compressed file.  Having a header composed of 4k of _text_
> before binary compressed data would be nuts.  Or a git-bundle with
> a large refs listing.  ;-)

FWIW, PDF format is a mix of sections of uncompressed higher level
ASCII notation and sections of compressed actual glyph/location data
for individual pages, and I don't think the rules are very strict
about what goes where. Looking at some academic papers some contain
compressed data within the first hundred characters whilst I've got a
couple with the first compressed byte 1968 and 12304; I'm sure if I
had a longer pdf to look at I'd find one where compression data first
occurred even later. I leave discussions of whether this is nuts to
others ;-) .

JPG is pretty much guaranteed to contain compressed data after a
couple of metadata lines.

-- 
cheers, dave tweed__________________________
david.tweed@xxxxxxxxx
Rm 124, School of Systems Engineering, University of Reading.
"while having code so boring anyone can maintain it, use Python." --
attempted insult seen on slashdot
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux