Junio C Hamano <gitster@xxxxxxxxx> writes: > Martin Koegler <mkoegler@xxxxxxxxxxxxxxxxx> writes: > >> Keeping the last used delta base object, if it would be dropped, >> supports creating smaller packs with shorter delta chains. > > Instead of treating the "the last used one happened to be on the > horizon -- try not to drop it" special case, I wonder if it > makes sense to float the last used delta base object early in > the window _after_ it is used. Wouldn't we keep more than one > very recently used delta base objects in the window that way? Well, given the little amount of spare time I have for personal projects, I should not go flaunting them too much, but anyway... I am just drafting implementing a delta-diff size estimator: for every hash value it does not store hash chains or pointers to memory images, but just a bit map of at most 32 (possibly 64 where ulong delivers it, possibly even just 16 bits when one restricts oneself to 16 element windows in order to save memory) delta window files, telling which of the files show such a hash value anywhere. When a file is considered for deltaing, it is run once against this delta-diff estimator which does not even look at the files again. This delivers a lower bound for the size a delta towards each of the delta candidates can take. The candidates are then sorted in increasing size of expected delta size and are considered in turn. Once a delta has a lower size than the lowest bound for further delta candidates, no further candidates need to get considered. Also the calculation of a delta can get aborted once it exceeds the size of a previous delta. Somewhat more complex and error-prone would be to abort based on continually correcting the estimated lower bound from the estimator while one is creating a delta and dropping out as soon as it becomes impossible to beat a previous delta. The nice thing about this is that one can start out with a simplistic estimator as long as it is guaranteed never to deliver an estimate that is too high. As long as that is the case, the packs will never get worse than they are now: a bad estimator will just mean that more elements of the window need to be considered before the conclusion is reached. Since the estimator data structure does not point into any memory-mapped files or hash chains scattered through memory, it should be comparatively light (in comparison to an actual delta) on page thrashing which appears to me one of the main performance problems, while delivering a reasonably useful lower bound for all deltas in the window at once. Its size needs to be large enough so that the bit maps will be dominated by zeros: hash collisions not actually corresponding to matching data lead to underestimates. If those become too many, the pruning of comparisons will become ineffective: A match implies that the next 2*RABIN_SIZE-1 (31) bytes at least could possibly be expressed as a delta, and every match after another RABIN_SIZE implies another RABIN_SIZE bytes might get folded into the previous delta. So false positives seriously distort the estimate. There is still a good chance that a good candidate will land early in the sort order due to such an estimate and thus other candidates will have their deltaing cut off early, but of course it is quite more effective if their deltaing does not even have to start. -- David Kastrup - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html