Re: 1.3.0 creating bigger packs than 1.2.3

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Thu, 20 Apr 2006, Shawn Pearce wrote:

> Junio C Hamano <junkio@xxxxxxx> wrote:
> > 
> > This _might_ improve things:
> > 
> > diff --git a/pack-objects.c b/pack-objects.c
> > index 09f4f2c..0c6abe9 100644
> > --- a/pack-objects.c
> > +++ b/pack-objects.c
> > @@ -1037,7 +1039,7 @@ static int try_delta(struct unpacked *cu
> >  	sizediff = oldsize > size ? oldsize - size : size - oldsize;
> >  
> >  	if (size < 50)
> > -		return -1;
> > +		return 0;
> >  	if (old_entry->depth >= max_depth)
> >  		return 0;
> >  
> > @@ -1052,7 +1054,7 @@ static int try_delta(struct unpacked *cu
> >  	if (cur_entry->delta)
> >  		max_size = cur_entry->delta_size-1;
> >  	if (sizediff >= max_size)
> > -		return -1;
> > +		return 0;
> >  	delta_buf = diff_delta(old->data, oldsize,
> >  			       cur->data, size, &delta_size, max_size);
> >  	if (!delta_buf)
> 
> Holy cow, it did:
> 
>   Total 46391, written 46391 (delta 8391), reused 37774 (delta 0)
>    46M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack
> 
> That's the smallest packing I've seen yet.  And it doesn't have a
> negative affect on repacking GIT either.

I think I know what's going on, and why your bisection claimed it was the 
re-hashing change that was the problem, even though it really wasn't.

That

	if (sizediff >= max_size)
		return -1;

check is actually fairly _old_. It's from June 2005, ie it's from pretty 
much the first two days of that packing thing existing in the first place.

(The initial repacking was done June 25th, with a lot of tweaking over the 
next few days. That sizediff thing was part of the fairly early tweaking).

The thing is, that check made sense back then. Why? Because we sorted 
things in decreasing size order back then (I think this was before we even 
did any name-based heuristic sorting at all), so that when we tried the 
delta algorithm, and the size diff was bigger than the last delta size, we 
pretty much _knew_ the new delta would be bigger still, and there was no 
point in continuing with try_delta.

HOWEVER. We have since changed the sorting to sort according to name 
before it sorts according to size, so that old heuristic that depended on 
the size being monotonically increasing simply doesn't make any sense any 
more.

So I think at that second "return -1" really _should_ be changed to a 
"return 0", and not just because it helps your particular case. It's 
literally a bug these days, because the assumptions that caused it to 
return -1 simply aren't true any more.

(It wasn't _strictly_ true even originally: even if the sizediff is huge, 
the _delta_ may not be huge, since we can delete data with a small delta. 
So it's quite likely that we should compare the "old is bigger than new" 
and "new is bigger than old" separately and have different heuristics for 
them. Again, that was simply not much of an issue back when we sorted just 
by size).

So even the "return 0" might not be completely right. We might actually 
want to look at how big the delta is, and return only once that fails.

			Linus
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]