Re: Unresolved issues #2 (shallow clone again)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Sun, 7 May 2006, Jeff King wrote:
>
>   - Total savings by going shallow: 10.7%
> 
> So basically, trees and commits DON'T compress as well as historical
> blobs (potentially because git-pack-objects isn't currently optimized
> for this -- I haven't checked). As a result, we're saving only 10% by
> going shallow instead of a potential 50%.

The biggest size savers from packing is (in rough order of relevance, if 
I recall the rough statistics I did):

 - avoiding block boundaries. 
 - delta packing of blobs
 - delta packing of trees
 - regular compression

The block boundaries are huge, we have tons of small objects, and that was 
one of the primary reasons for packing. I'd suspect that this is a 3:1 
factor for a lot of things for many "common" filesystem setups. You 
probably didn't even account for the size of inodes in your "du" setup.

And blobs with history generally delta very well (_much_ better than 
regular compression).

Trees should _delta_ very well, but they basically don't compress, 
especially after deltaing. The SHA1's are totally incompressible (in a 
tree they aren't even ASCII), and as a deta, the names won't compress much 
either because they are short.

Commits are fairly small, shouldn't delta all that much, and they don't 
even compress _that_ well either (they're normal text and often have some 
redundancy with the committer and author being the same, but they are 
short and have some fairly incompressible elements, so..)

The thing with trees in particular is that they are very common for the 
kernel (and probably not so much for many other projects). A single commit 
ends up quite commonly being just one commit object, one blob (that deltas 
really well), and three or four trees. Merges often have no new blobs at 
all, just several new trees and the commit object.

So a huge amount of the wins from packing come from the file _history_, 
the part that a shallow clone (on purpose) leaves behind. 

The regular compression will pick up a fair amount of slack with the 
blobs, but it's a much smaller factor than the delta compression for 
something that has a long history.

It's somewhat interesting to note that over the year that we've used git, 
the kernel pack-size hasn't even increased all that much. I forget exactly 
what it was when we started packing, but it was on the order of ~75M. It 
is now 115M for me. And the old linux-history thing (full BK history over 
three years) is 177M - not much more than twice the size of just a few 
kernel versions - with some higher packing ratios..

Exactly because blobs delta so incredibly well.

		Linus
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]