On Fri, 24 Feb 2006, Junio C Hamano wrote: > I haven't looked at Nico's original or updated code closely at > all, but two things come to mind. > > (1) if we could tell the particular data is intrinsically > diff_delta unfriendly and diff_delta would waste too much > time when tried to delta against almost _any_ other blob, > then it might help to give an interface in diff-delta.c for > the caller to check for such a blob without even trying > diff_delta. > > (2) otherwise, if diff_delta could detect it would spend too > many cycles to finish its work for a particular input early > on, we might want it to bail out instead of trying a > complete job. I have a patch that implements an hybrid approach. Currently, diff-delta takes blocks of data in the reference file and hash them. When the target file is scanned, it uses the hash to match blocks from the target file with the reference file. If blocks are hashed evenly the cost of producing a delta is at most O(n+m) where n and m are the size of the reference and target files respectively. In other words, with good data set the cost is linear. But if many blocks from the reference buffer do hash to the same bucket then for each block in the target file many blocks from the reference buffer have to be tested against, making it tend towards O(n^m) which is pretty highly exponential. The solution I'm investigating is to put a limit on the number of entries in the same hash bucket so to bring the cost back to something more linear. That means the delta might miss on better matches that have not been hashed but still benefit from a limited set. Experience seems to show that the time to deltify the first two blobs you found to be problematic can be reduced by 2 orders of magnitude with about only 10% increase in the resulting delta size, and still nearly 40% smaller than what the current delta code produces. The question is how to determine the best limit on the number of entries in the same hash bucket. Nicolas - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html