Re: [PATCH] diff-delta: bound hash list length to avoid O(m*n) behavior

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Nicolas Pitre <nico@xxxxxxx> writes:

> The diff-delta code can exhibit O(m*n) behavior with some patological 
> data set where most hash entries end up in the same hash bucket.

> Other data points from the Linux kernel repository:
>
> blob 9af06ba723df75fed49f7ccae5b6c9c34bc5115f ->
> blob dfc9cd58dc065d17030d875d3fea6e7862ede143
> size (491102 -> 496045)
> delta size (16 byte blocks): 248899 in less than 0.1 sec
> delta size (3 byte blocks): 136000 in 11.8 secs
> delta size (3 byte blocks + this patch): 171688 in 0.79 sec

These numbers are misleading.

The 36K objects pack I used in my previous tests gives 9971223
(from "next" branch) vs 9664546 (this patch) final pack size.
The wallclock time on my machine is 1m35 vs 3m30.  I doubt many
people are willing to pay 100% more waiting time for 3% tighter
pack.

Although I do not mean to rain on your effort and substantial
improvement you made with this patch, we need to admit that
improving pathological corner case has quite a diminishing
effect in the overall picture.

But this definitely is an improvement nevertheless.  I should
try this on my wife's AMD64 (Cygwin).  The same datasets I used
in my previous complaint still seem to take a couple of seconds
(or more) each on my Duron 750 X-<.

A handful additional ideas.

 * Lower the hash limit a bit more (obvious).  That might have
   detrimental effect in the general case.

 * Study the hash bucket distribution for the pathological case
   and see if we can cheaply detect a pattern.  I suspect these
   cases have relatively few but heavily collided buckets with
   mostly sparse other entries.  If there is such an easily
   detectable pattern, maybe we can look for such a pattern at
   runtime, and cull hash entries more aggressively in such a
   case?

 * Try checking the target buffer to see if it would have many
   hits on the heavily collided hash entries from the source
   (maybe hash_count the target buffer as well).

 * Have pack-object detect a pathological blob (the test patch I
   sent you previously uses the eye-candy timer for this
   purpose, but we could getrusage() if we want to be more
   precise) by observing how much time is spent for a single
   round, and mark the blob as such, so we do not delta against
   it with other blobs in find_deltas, when we are packing many
   objects.  It does not really matter in the big picture if we
   choose not to delta the pathological ones tightly, as long as
   they are relatively few.

 * Also in pack-object, have an optional backing-store to write
   out deltified representations for results that took more than
   certain amount of time to produce in find_deltas(), and reuse
   them in write_object().

If anybody is interested, here are some other interesting pairs
from the Linux 2.6 repositories.

        2645d6239b74 cb968e7a0fcd
        357980942d73 5bb837052ef1
        5412dcb4b6d0 ac07e18abb1d
        5bb837052ef1 1f8744ad9cde
        63d827d7da07 5bb837052ef1
        9af06ba723df 1896a0eea7e5
        cb968e7a0fcd 97b5578ae9b1
        d8c5a8dbd086 97b5578ae9b1
        d8c5a8dbd086 cb968e7a0fcd
        dfc9cd58dc06 1896a0eea7e5
        fa0f82ec9fa0 b611e27470e1


-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]