Re: [RFC] Cache negative delta pairs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 28 Jun 2006, Jeff King wrote:

> >From repack to repack, we end up trying to delta many of the same object
> pairs, which is computationally expensive.  This patch makes a
> persistent cache, $GIT_DIR/delta-cache, which contains pairs of sha1
> hashes (the presence of a pair indicates that it's not worth trying to
> delta those two objects).

I think this could be done differently to be even more space efficient 
as I suggested earlier.

> Here are some of my thoughts:
> 
>  - speed. The implementation is quite fast. The sha1 pairs are stored
>    sorted, and we mmap and binary search them. Certainly the extra time
>    spent in lookup is justified by avoiding the delta attempts.

You do that lookup for every delta match attempt.  Instead it could be 
done once for the whole window attempt, potentially reducing the cache 
size by a factor of 20, and it might be faster too.

>  - size. The cache is a packed sequence of binary sha1 pairs. I was
>    concerned that it would grow too large (obviously for n blobs you can
>    end up with n^2/2 entries), but it doesn't seem unreasonable for most
>    repos (either you don't have a lot of files, or if you do, they delta
>    reasonably well). My test repo's cache is only 144K. The git cache is
>    about 2.7M. The linux-2.6 cache is 22M.

See my previous email for comments about this.

>    Theoretically, I could bound the cache size and boot old entries.
>    However, that means storing age information, which increases the
>    required size. I think keeping it simple is best.

You could simply recreate the cache on each run.  Or just keep a bitmap 
of referenced cache entries, and a list of new entries.  At the end if 
the new entry list is empty and the bitmap is all set (or clear) then 
you just keep the current cache.  ONce, say, more than 10% of cache 
entries are not used anymore then you regenerate the cache.  But since 
you need to regenerate it when new entries are added then the 10% unused 
entry criteria might not be used that often anyway, unless packs shrink 
with time which might not be the case really often.

>  - correctness. Currently the code uses the output of try_delta for
>    negative caching. Should the cache checking be moved inside try_delta
>    instead? This would give more control over which reasons to mark a
>    delta negative (I believe right now hitting the depth limit will
>    cause a negative mark; we should perhaps only do so if the content
>    itself makes the delta unpalatable).

Like I said earlier I think this should be moved up a bit i.e. outside 
the delta loop.  In find_deltas() right before the ...

		j = window;
		while (--j > 0) {

loop, just insert another loop that compute the sha1 of the current 
target object and window:

		SHA1_Init(&c);
		j = window;
		while (j-- > 0) { /* postdec include the current object as well */
			unsigned int other_idx = idx + j;
			struct unpacked *m;
			if (other_idx >= window)
				other_idx -= window;
			m = array + other_idx;
			if (!m->entry)
				break;
			SHA1_Update(&c, m->entry.sha1, 20);
		}
		SHA1_Final(negative_delta_hash, &c);			

Then you can skip over the whole of the delta loop right away if that 
negative_delta_hash matches one entry in the cache since you already 
know that this object with this window doesn't produce any delta result.

Otherwise, after the delta loop has proceeded, if there is no delta 
found then you can add negative_delta_hash to the cache.

>  - optionalness. Currently the delta-cache is always used. Since it is a
>    space-time tradeoff, maybe it should be optional (it will have
>    negligible performance and horrible size impact on a repo that
>    consists of many very small but unrelated objects).  Possible methods
>    include:
>      - enable cache saves only if .git/delta-cache is present; turn it
>        on initially with 'touch .git/delta-cache'
>      - config variable

First, I think it should be ignored (but still created) when 
--no-reuse-delta is passed.  Then, it should not be created (but still 
looked up if it exists and --no-reuse-delta is not provided) when the 
pack index file is also not created.  I don't think it is worth making 
this further configurable, and given the suggested strategy above the 
cache should remain fairly small.


Nicolas
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]