mkoegler@xxxxxxxxxxxxxxxxx (Martin Koegler) writes: > On Mon, Aug 27, 2007 at 12:07:58PM +0200, David Kastrup wrote: >> Well, given the little amount of spare time I have for personal >> projects, I should not go flaunting them too much, but anyway... >> >> I am just drafting implementing a delta-diff size estimator: for every >> hash value it does not store hash chains or pointers to memory images, >> but just a bit map of at most 32 (possibly 64 where ulong delivers it, >> possibly even just 16 bits when one restricts oneself to 16 element >> windows in order to save memory) delta window files, telling which of >> the files show such a hash value anywhere. > > How do you want to select these few hash values? Hm? Are you familiar with diff-delta.c? I was not going to change the hash value computation in there. It is a 32bit CRC over 16byte passages: the delta source is checksummed with a stride of 16 bytes and the resulting CRC values are masked to a convenient number of bits and used as index into a hash table with disambiguation to actual code passages using linked lists. Then the destination delta is checksummed in the same manner but with a stride of 1, and the sums are looked up in the hash until a matching prefix is found. So I was going to do the same calculation, but look up a bit mask rather than a linked list (namely calculating under the assumption that a hash hit implies an actual match). > I eg. dump database tables as SQL statements in per table files and > commit all changes. All lines in each file share a common prefix > (INSERT INTO tablename (COL1, COL2, ...) VALUES). Statistics showed, > that up to 50% of the hash elements of the delta index were dropped, > because they had equal hash values. Yes, I am aware of that. One somewhat crazy consequence of that is that git's deltaing works better on Linux style indentation than on GNU style indentation. > With you apropach: > > If rare used values are selected as the 32 hash values, you will > probably regard files, which contain different tables (and therefore > a different prefix), as "similar", although the files have not very > much similarity otherwise. It's statistics. Random matches will tend to level out. > If you select the hash values of the prefixes, you will create for > each table one cluster (if the table count is < 32/64). This already > happens by using the hash value of the path name in the sorting. > > I don't belive, that your idea will improve very much in this case. > I fear, that it will be even worser than the current algorithm. Huh? I am not replacing the current algorithm. I am doing some upfront estimates for getting a good order for doing the current algorithm, so that I might abort parts of it early when I know they can't improve things. If the estimates are completely random and/or useless, it will mean that you get a slowdown by a comparatively small constant factor. >> When a file is considered for deltaing, it is run once against this >> delta-diff estimator which does not even look at the files again. >> This delivers a lower bound for the size a delta towards each of >> the delta candidates can take. The candidates are then sorted in >> increasing size of expected delta size and are considered in turn. >> Once a delta has a lower size than the lowest bound for further >> delta candidates, no further candidates need to get considered. > > So this means, that we have to hash the target entry too (which > could be merged with creating the delta index). Huh? I don't see what you are getting at. > We currently only hash an entry, if it is really needed as source in > a delta-diff. Why would you want to hash a non-candidate? >> Also the calculation of a delta can get aborted once it exceeds the >> size of a previous delta. > > This already happens. Ah yes. Overlooked this. Presorting the candidates should be helpful right away then. > Do you want to do your estimations only against a small window of > objects or all objects? The estimates are made against the candidates in the window. When a candidate leaves the window, its bit in the bit masks gets reused for the next candidate entering the window. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html