On Wed, 28 Jun 2006, Jeff King wrote: > >From repack to repack, we end up trying to delta many of the same object > pairs, which is computationally expensive. This patch makes a > persistent cache, $GIT_DIR/delta-cache, which contains pairs of sha1 > hashes (the presence of a pair indicates that it's not worth trying to > delta those two objects). I think this could be done differently to be even more space efficient as I suggested earlier. > Here are some of my thoughts: > > - speed. The implementation is quite fast. The sha1 pairs are stored > sorted, and we mmap and binary search them. Certainly the extra time > spent in lookup is justified by avoiding the delta attempts. You do that lookup for every delta match attempt. Instead it could be done once for the whole window attempt, potentially reducing the cache size by a factor of 20, and it might be faster too. > - size. The cache is a packed sequence of binary sha1 pairs. I was > concerned that it would grow too large (obviously for n blobs you can > end up with n^2/2 entries), but it doesn't seem unreasonable for most > repos (either you don't have a lot of files, or if you do, they delta > reasonably well). My test repo's cache is only 144K. The git cache is > about 2.7M. The linux-2.6 cache is 22M. See my previous email for comments about this. > Theoretically, I could bound the cache size and boot old entries. > However, that means storing age information, which increases the > required size. I think keeping it simple is best. You could simply recreate the cache on each run. Or just keep a bitmap of referenced cache entries, and a list of new entries. At the end if the new entry list is empty and the bitmap is all set (or clear) then you just keep the current cache. ONce, say, more than 10% of cache entries are not used anymore then you regenerate the cache. But since you need to regenerate it when new entries are added then the 10% unused entry criteria might not be used that often anyway, unless packs shrink with time which might not be the case really often. > - correctness. Currently the code uses the output of try_delta for > negative caching. Should the cache checking be moved inside try_delta > instead? This would give more control over which reasons to mark a > delta negative (I believe right now hitting the depth limit will > cause a negative mark; we should perhaps only do so if the content > itself makes the delta unpalatable). Like I said earlier I think this should be moved up a bit i.e. outside the delta loop. In find_deltas() right before the ... j = window; while (--j > 0) { loop, just insert another loop that compute the sha1 of the current target object and window: SHA1_Init(&c); j = window; while (j-- > 0) { /* postdec include the current object as well */ unsigned int other_idx = idx + j; struct unpacked *m; if (other_idx >= window) other_idx -= window; m = array + other_idx; if (!m->entry) break; SHA1_Update(&c, m->entry.sha1, 20); } SHA1_Final(negative_delta_hash, &c); Then you can skip over the whole of the delta loop right away if that negative_delta_hash matches one entry in the cache since you already know that this object with this window doesn't produce any delta result. Otherwise, after the delta loop has proceeded, if there is no delta found then you can add negative_delta_hash to the cache. > - optionalness. Currently the delta-cache is always used. Since it is a > space-time tradeoff, maybe it should be optional (it will have > negligible performance and horrible size impact on a repo that > consists of many very small but unrelated objects). Possible methods > include: > - enable cache saves only if .git/delta-cache is present; turn it > on initially with 'touch .git/delta-cache' > - config variable First, I think it should be ignored (but still created) when --no-reuse-delta is passed. Then, it should not be created (but still looked up if it exists and --no-reuse-delta is not provided) when the pack index file is also not created. I don't think it is worth making this further configurable, and given the suggested strategy above the cache should remain fairly small. Nicolas - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html