On Mon, 27 Feb 2006, Junio C Hamano wrote: > Although I do not mean to rain on your effort and substantial > improvement you made with this patch, we need to admit that > improving pathological corner case has quite a diminishing > effect in the overall picture. It does, unfortunately. The problem is that the code in diff_delta() is highly solicited during a pack generation. Just the simple hashing change from this patch that involves two shifts instead of only one increased the CPU time to generate a pack quite appreciably. This is just to say that this code is pretty sensitive to even small details. > But this definitely is an improvement nevertheless. I should > try this on my wife's AMD64 (Cygwin). The same datasets I used > in my previous complaint still seem to take a couple of seconds > (or more) each on my Duron 750 X-<. There is no miracle. To prevent the extreme cases additional tests have to be performed in all cases, and therefore performance for all cases is affected. > A handful additional ideas. > > * Lower the hash limit a bit more (obvious). That might have > detrimental effect in the general case. It does. The current parameters are what I think is the best compromize between cost and compression. And still 99% of the cases don't need a lower hash limit since their hash lists are already way below the limit. For the cases where it makes a difference, well lowering the hash limit also increase the cost associated with the reworking of the hash list so there comes a point where you actually increase the resulting delta size while the CPU time stays constant. > * Study the hash bucket distribution for the pathological case > and see if we can cheaply detect a pattern. I suspect these > cases have relatively few but heavily collided buckets with > mostly sparse other entries. If there is such an easily > detectable pattern, maybe we can look for such a pattern at > runtime, and cull hash entries more aggressively in such a > case? Same issue as above. And "looking for such a pattern" really does increase the CPU time in _all_ cases. So that looking for patological cases has to be as cheap as possible which I think is the case right now (and it is still too costly for my taste already). And yet I don't think there is really a need for further reduction of the hash list at this point since the patological cases are really handled gracefully with this patch even with a nice and pretty packed delta. > * Try checking the target buffer to see if it would have many > hits on the heavily collided hash entries from the source > (maybe hash_count the target buffer as well). Again that'd add another significant CPU cost to all cases, even those that don't need it at all. The problem is not about those patological cases anymore since I think they are well under control now. It is the overhead those few patological cases impose on the other 180000 good behaving objects that is a problem. > * Have pack-object detect a pathological blob (the test patch I > sent you previously uses the eye-candy timer for this > purpose, but we could getrusage() if we want to be more > precise) by observing how much time is spent for a single > round, and mark the blob as such, so we do not delta against > it with other blobs in find_deltas, when we are packing many > objects. It does not really matter in the big picture if we > choose not to delta the pathological ones tightly, as long as > they are relatively few. That is one solution, but that doesn't handle the root of the problem which is the cost of detecting those cases in the first place. > * Also in pack-object, have an optional backing-store to write > out deltified representations for results that took more than > certain amount of time to produce in find_deltas(), and reuse > them in write_object(). The pack reusing code is pretty effective in doing so already, isn't it? Since using git-repack -f should not be the common case then those patological cases (now taking one second instead of 60 or more) should be reused most of the time. > I tried an experimental patch to cull collided hash buckets > very aggressively. I haven't applied your last "reuse index" > patch, though -- I think that is orthogonal and I'd like to > leave that to the next round. It is indeed orthogonal and I think you could apply it to the next branch without the other patches (it should apply with little problems). This is an obvious and undisputable gain, even more if pack-objects is reworked to reduce memory usage by keeping only one live index for multiple consecutive deltaattempts. > With the same dataset: resulting pack is 9651096 vs 9664546 > (your patch) final pack size, with wallclock 2m45s (user 2m31). > Still not good enough, and at the same time I wonder why it gets > _smaller_ results than yours. It really becomes hard to find the best balance especially when the resulting delta is then fed through zlib. Sometimes a larger delta will compress better, sometimes not. My test bench was the whole git repository and the kernel repository, and with such large number of objects it seems that smaller deltas always translate into smaller packs. But it might not necessarily always be the case. > I'd appreciate it if you can test it on the 20MB blobs and see > what happens if you have time. Before your patch: user 0m9.713s, delta size = 4910988 bytes, or 1744110 compressed. With your patch: user 0m3.948s, delta size = 6753517 bytes, or 1978803 once compressed. BTW there is another potential for improvement in the delta code (but I have real work to do now so)... Let's suppose the reference buffer has: ***********************************************************************/ This is common with some comment block styles. Now that line will end up with multiple blocks that will hash to the same thing and linked successively in the same hash bucket except for the last ones where the '/' is involved in the hashing. Now if the target buffer also contains a line similar to the above but one '*' character longer. The first "***" will be hashed to x, then x will be looked up in the reference index, and the first entry corresponding to the first character of the line above will be returned. At that point a search forward is started to find out how much matches. In this case it will match up to the '/' in the reference buffer while the target will have another '*'. The length will be recorded as the best match, so far so good. Now the next entry from the same hash bucket will be tested in case it might provide a starting point for a longer match. But we know in this case that the location of the second '*' in the reference buffer will be returned. And we obviously know that this cannot match more data than the previous attempt, in fact it'll match one byte less. And the hash entry to follow will return the location of the third '*' for another byte less to match. Then the fourth '*' for yet another shorter match. And so on repeating those useless string comparisons multiple times. One improvement might consist of counting the number of consecutive identical bytes when starting a compare, and manage to skip as many hash entries (minus the block size) before looping again with more entries in the same hash bucket. The other case to distinguish is when the target buffer has instead a _shorter_ line, say 5 fewer '*''s before the final '/'. Here we would loop 4 times matching only a bunch of '*' before we finally match the '/' as well on the fifth attempt becoming the best match. In this case we could test if the repeated byte we noticed from the start is present further away in the reference buffer, and if so simply skip hash entries while searching forward in the reference buffer. When the reference buffer doesn't match the repeated character then the comparison is resumed with the target buffer, and in this case the '/' would match right away, avoiding those four extra loops recomparing all those '*' needlessly. Such improvements added to the search algorithm might make it significantly faster, even in the presence of multiple hash entries all located in the same bucket. Or maybe not if the data set has few repeated characters. And this is probably worthwhile only on top of my small block patch. Only experimentation could tell. Someone willing to try? Nicolas - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html