Re: git repack: --depth=100000 causing larger not smaler pack file?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 23 Mar 2009, Kjetil Barvik wrote:

> Nicolas Pitre <nico@xxxxxxx> writes:
> 
> > On Tue, 17 Mar 2009, Kjetil Barvik wrote:
> >
> >>   aloha!
> >> 
> >>   Yesterday I run the following command on the updated GIT respository:
> >> 
> >>     git repack -adf --window=250000 --depth=100000
> >> 
> >>   After 280 minutes or so it finished, but the strange thing was that
> >>   the resulting pack-file was larger than before.  I had expected that
> >>   it should be smaler, or at least the same size as before.
>   [snip]
> >>   I can think of one thing which is spesial with the "--depth=100000"
> >>   number, and that is that it is now larger than the total number of
> >>   objects in the pack, which is around 96000 to 97000, or so.
> >
> > No, the depth should have zero negative influence on the pack size.  
> > For tight compression, the larger the better.  What this will impact 
> > though is runtime access to the pack data afterward.  The deeper a 
> > given object is, the slower its access will be.  But since the object 
> > recency order tend to put newer objects at the top of a delta chain, 
> > this should impact older objects more than recent ones.
> 
>   I have done some more tests, and have copied the whole git/ directory
>   to a new directory (such that I do not accidentally add or delete any
>   objects/commits), and have made the following table:
> 
>   All pack file sizes, F, below was computed with the following git
>   command:
> 
>       git repack -adf --window=250000 --depth=D
> 
>      D   |     F      | (F - F_prev) / (D - D_prev)
>   -------|------------|----------------------------
>     5000 |  19129934  |
>    10000 |  19128956  |    -978 /  5000 =  -0.1956
>    15000 |  19126077  |   -2879 /  5000 =  -0.5758
>    20000 |  19126077  |       0 /  5000 =   0
>    25000 |  19126077  |       0 /  5000 =   0
>    30000 |  19197575  |   71498 /  5000 =  14.2996
>    45000 |  19312240  |  114665 / 15000 =   7.6443
>    60000 |  19560083  |  247843 / 15000 =  16.5229
>    75000 |  19803043  |  242960 / 15000 =  16.1973
>    90000 |  19669923  | -133120 / 15000 =  -8.8746
>    95000 |  20463780  |  793857 /  5000 = 155.7714
> 
>   From the table it seems that you get the smallest pack file (for this
>   particular repository) when --depth value is somewhere between 15000
>   and 25000.  And, when the --depth value was 95000 the resulting pack
>   file was (- 20463780 19126077) = 1 337 703 bytes, 1.25 MiB, or 7%
>   larger than this.

This is a bit intriguing.

Of course, before going any further, you must realize that having a 
depth of 15000 is a bit excessive.  That means that, if you have a delta 
chain with a depth of 15000 that means access to the object at the end 
of the chain will require that 14999 other objects be accessed before 
the 15000th one is retrieved.  This will have horrible runtime 
performances for something like 10% reduction in the best cases which is 
probably not a good tradeoff.

This being said, I still stand by my assertion that, in theory, greater 
delta depth should not make the pack bigger.  And your table appears to 
confirm that, even to the point of reaching a stable size as one would 
expect, until a breaking point is reached after which results tend to 
become rather random.

What I'm suspecting in that case is some computation overflow in 
try_delta().  Consider for instance this piece:

    max_size = max_size * (max_depth - src->depth) /
                                            (max_depth - ref_depth + 1);

[ This is the treshold slope I was talking about, but contrary to
  what I said before, it is affected by the depth not the window size. ]

In this case, if you have a max_depth of 95000, then any object larger 
than 90461 bytes will cause a multiplication overflow, and the resulting 
max_size will be capped to some random smaller value than expected 
depending on the remaining bits. For example, suppose max_size = 45211, 
max_depth = 95000 and src->depth = 0 then you should have max_size still 
equal to 45211, but in this case it'll become 0 and no delta will be 
attempted at all.  The number of deltas reported at the end of the 
repack process probably reflects that.

> > I doubt there is anything to debug.  In this case the window size is 
> > used to evaluate a threshold slope for matching objects in the delta 
> > search.  What we want is a broader delta tree more than a deep one in 
> > order to have more deltas with a lower depth limit.  Therefore a size 
> > threshold is applied, based on the object distance in the delta search 
> > window (see commit c83f032e and the other ones referenced therein).
> >
> > By providing a big window value, the threshold slope becomes rather flat 
> > and ineffective, and this changes the delta match outcome.  While delta 
> > selection is based on the uncompressed delta result, the compressed size 
> > of different deltas with the same size may vary.  I suspect you might 
> > have been unlucky in that regard and this could explain the negative 
> > effect on the pack size.
> 
>   From the table above it seems that I have been unlucky with _all_
>   --depth values above 25000 or so.

See explanation (and self correction) above.

>   Question: is there some low level GIT command I can run to compare 2
>   pack files to maybe be able to see the reason behind the above table?
>   Maybe to see some details about how many delta's, how big each are,
>   total sizes, etc..

Yes -- see the -v option of 'git verify-pack'.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux