On Mon, 28 Aug 2006, Shawn Pearce wrote: > Nicolas Pitre <nico@xxxxxxx> wrote: > > On Sun, 27 Aug 2006, Shawn Pearce wrote: > > > > > I'm going to try to get tree deltas written to the pack sometime this > > > week. That should compact this intermediate pack down to something > > > that git-pack-objects would be able to successfully mmap into a > > > 32 bit address space. A complete repack with no delta reuse will > > > hopefully generate a pack closer to 400 MB in size. But I know > > > Jon would like to get that pack even smaller. :) > > > > One thing to consider in your code (if you didn't implement that > > already) is to _not_ attempt any delta on any object whose size is > > smaller than 50 bytes, and then limit the maximum delta size to > > object_size/2 - 20 (use that for the last argument to diff-delta() and > > store the undeltified object when diff-delta returns NULL). This way > > you'll avoid creating delta objects that are most likely to end up being > > _larger_ than the undeltified object. > > So I added Nico's suggestions to fast-import and ran it on a small > subset of the Mozilla repository (3424 blobs): > > naive always delta: 6652 KiB > Nico's suggestion: 6842 KiB Hmmm... > So Nico's suggestion of limiting delta size to (orig_len/2)-20 or > not using deltas on blobs < 50 bytes actually added 190 KB to the > output pack. Since this sample is probably fairly representative > of the rest of the repository's blobs I'm thinking we may see a 2.8% > increase in size over the current 930 MB blob pack. That's another > 26 MB in our intermediate pack. I don't think this suggestion is > really worth including in fast-import right now... The above is based on the assumption that undeltified blobs usually deflates to 50% the undeflated size or more, and that pure object data deflates better than delta data. Then there is the 20 byte base object reference overhead for any deltas. The 20 bytes is a hard fact. The 50% factor is a wild guess. What I forgot to consider in the above formula is the fact that delta data gets deflated as well so the /2 divisor is probably a bit too much (you could try orig_len*2/3 - 20, or orig-len - 20, and adjust the initial treshold so the limit value doesn't go negative). If you are IO bound (I recall Jon mentioning something to that effect) then you could probably use some CPU cycles to always deflate the object, deflate the resulting delta, and pick the smallest between the two (don't forget the additional 20 bytes in the delta case). Maybe the increased CPU usage won't justify this solution though. Nicolas - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html