Re: Huge win, compressing a window of delta runs as a unit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 18 Aug 2006, Jon Smirl wrote:

> On 8/18/06, Nicolas Pitre <nico@xxxxxxx> wrote:
> > On Fri, 18 Aug 2006, Jon Smirl wrote:
> >
> > > I attached Shawn's code. He is gone until Monday and can't defend it.
> >
> > I will have a look at it next week as I'll be gone for the weekend as
> > well.
> 
> I looked at it some and couldn't see anything obviously wrong with it,
> but it wasn't a detailed inspection.

I looked at it too and the code looks OK.

This doesn't mean there is no problem at a higher level though.  The 
deltification process is extremely crude and I think this is the cause 
of the original pack size.

For example, last April we discovered that a small change in the 
heuristics to determine base delta objects in git-pack-objects could 
create a pack size regression up to 4x the size of the same pack created 
before such change.

It is also possible to have a denser delta stream but once deflated it 
is larger than a less dense delta to start with.

Just to say that many tweaks and heuristics have been implemented and 
studied in git-pack-objects for over a year now in order to get the 
really small packs we have today.  And a really subtle and 
inocent-looking change can break it size wize.

So what I think is happening with the fastimport code is that the delta 
selection is not really good.  It is certainly much better than no delta 
at all but still not optimal which smells deja vu to me.  Then by 
deflating them all together the redundent information that the bad delta 
set still carries along is eliminated -- thanks to zlib sort of 
mitigating the real issue.

But... as my recent experiments show, the grouping of related deltas 
into a single zlib stream doesn't produce significant improvements when 
implemented directly into git-pack-objects.  Certainly not worth the 
inconvenients and costs it brings along.  I even think that if you used 
git-repack -a -f on the pack produced by the import process, with only 
delta deflated individually just like it did originally, then the 
repacked pack would _also_ shrink significantly.  Most probably around 
4x just like you observed with the grouping of deltas in the same zlib 
stream.

Not only would git-repack make it much smaller, but it also provicdes a 
much better layout where all objects for recent commits are all stored 
together at the beginning of the pack.  The fastimport code is instead 
storing them scattered all over the pack for every commit by making all 
revisions of each file next to each other which will cause horrible 
access patterns and really bad IO.

So I think that trying to make fastimport too clever is wrong.  It 
should instead focus on creating an initial pack as fast as possible and 
then rely on a final git-repack pass to produce the shrinked pack.  I 
really doubt the import code could ever make a better job than 
git-pack-objects does.

If I can make a suggestion, you should forget about this multiple deltas 
in one zlib stream for now and focus on making the import process work 
all the way to tree and commit objects instead.  Then, only then, if 
git-repack -a -f doesn't produce satisfactory pack size we could look at 
better pack encoding.  And so far the grouping of related deltas in one 
zlib stream is _not_ a better encoding given the rather small 
improvement over unmodified git-pack-objects vs the inconvenients and 
cost it brings with it.

> As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB.
> The 295MB git pack number does not have commits and trees in it, it is
> revisions only.

Running git-repack -a -f from a recent GIT on the Mozilla repo converted 
through cvsps and friends produces a pack smaller than 500MB.  I even 
brought it down to 430MB by using non default delta window and depth.


Nicolas
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]