On Thu, 20 Apr 2006, Shawn Pearce wrote: > Junio C Hamano <junkio@xxxxxxx> wrote: > > > > This _might_ improve things: > > > > diff --git a/pack-objects.c b/pack-objects.c > > index 09f4f2c..0c6abe9 100644 > > --- a/pack-objects.c > > +++ b/pack-objects.c > > @@ -1037,7 +1039,7 @@ static int try_delta(struct unpacked *cu > > sizediff = oldsize > size ? oldsize - size : size - oldsize; > > > > if (size < 50) > > - return -1; > > + return 0; > > if (old_entry->depth >= max_depth) > > return 0; > > > > @@ -1052,7 +1054,7 @@ static int try_delta(struct unpacked *cu > > if (cur_entry->delta) > > max_size = cur_entry->delta_size-1; > > if (sizediff >= max_size) > > - return -1; > > + return 0; > > delta_buf = diff_delta(old->data, oldsize, > > cur->data, size, &delta_size, max_size); > > if (!delta_buf) > > Holy cow, it did: > > Total 46391, written 46391 (delta 8391), reused 37774 (delta 0) > 46M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > That's the smallest packing I've seen yet. And it doesn't have a > negative affect on repacking GIT either. I think I know what's going on, and why your bisection claimed it was the re-hashing change that was the problem, even though it really wasn't. That if (sizediff >= max_size) return -1; check is actually fairly _old_. It's from June 2005, ie it's from pretty much the first two days of that packing thing existing in the first place. (The initial repacking was done June 25th, with a lot of tweaking over the next few days. That sizediff thing was part of the fairly early tweaking). The thing is, that check made sense back then. Why? Because we sorted things in decreasing size order back then (I think this was before we even did any name-based heuristic sorting at all), so that when we tried the delta algorithm, and the size diff was bigger than the last delta size, we pretty much _knew_ the new delta would be bigger still, and there was no point in continuing with try_delta. HOWEVER. We have since changed the sorting to sort according to name before it sorts according to size, so that old heuristic that depended on the size being monotonically increasing simply doesn't make any sense any more. So I think at that second "return -1" really _should_ be changed to a "return 0", and not just because it helps your particular case. It's literally a bug these days, because the assumptions that caused it to return -1 simply aren't true any more. (It wasn't _strictly_ true even originally: even if the sizediff is huge, the _delta_ may not be huge, since we can delete data with a small delta. So it's quite likely that we should compare the "old is bigger than new" and "new is bigger than old" separately and have different heuristics for them. Again, that was simply not much of an issue back when we sorted just by size). So even the "return 0" might not be completely right. We might actually want to look at how big the delta is, and return only once that fails. Linus - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html