Re: [PATCH] diff-delta: bound hash list length to avoid O(m*n) behavior

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Nicolas Pitre <nico@xxxxxxx> writes:

>> I tried an experimental patch to cull collided hash buckets
>> very aggressively.  I haven't applied your last "reuse index"
>> patch, though -- I think that is orthogonal and I'd like to
>> leave that to the next round.
>
> It is indeed orthogonal and I think you could apply it to the next 
> branch without the other patches (it should apply with little problems).  
> This is an obvious and undisputable gain, even more if pack-objects is 
> reworked to reduce memory usage by keeping only one live index for 
> multiple consecutive deltaattempts.

Umm.  The hash-index is rather huge, isn't it?  I did not
realize it was two-pointer structure for every byte in the
source material, and we typically delta from larger to smaller,
so we will keep about 10x the unpacked source.  Until we swap
the windowing around, that means about 100x the unpacked source
with the default window size.

Also, I am not sure which one is more costly: hash-index
building or use of that to search inside target.  I somehow got
an impression that the former is relatively cheap, and that is
what is being cached here.

> Let's suppose the reference buffer has:
>  
> ***********************************************************************/
>...
> One improvement might consist of counting the number of consecutive 
> identical bytes when starting a compare, and manage to skip as many hash 
> entries (minus the block size) before looping again with more entries in 
> the same hash bucket.

Umm, again.  Consecutive identical bytes (BTW, I think "* * *"
and "** ** **" patterns have the same collision issues without
being consecutive bytes, so such an optimization may be trickier
and cost more), when emitted as literals, would compress well,
wouldn't they?  At the end of the day, I think what matters is
the size of deflated delta, since going to disk to read it out
is more expensive than deflating and applying.  I think you made
a suggestion along the same line, capping the max delta used by
try_delta() more precisely by taking the deflated size into
account.


-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]