Re: [PATCH] diff-delta: produce optimal pack data

Junio C Hamano <junkio@xxxxxxx> · Fri, 24 Feb 2006 00:49:13 -0800

Nicolas Pitre <nico@xxxxxxx> writes:

> Indexing based on adler32 has a match precision based on the block size 
> (currently 16).  Lowering the block size would produce smaller deltas 
> but the indexing memory and computing cost increases significantly.

Indeed.

I had this patch in my personal tree for a while.  I was
wondring why sometimes progress indication during "Deltifying"
stage stops for literally several seconds, or more.

In Linux 2.6 repository, these object pairs take forever to
delta.

        blob 9af06ba723df75fed49f7ccae5b6c9c34bc5115f -> 
        blob dfc9cd58dc065d17030d875d3fea6e7862ede143
        size (491102 -> 496045)
        58 seconds

        blob 4917ec509720a42846d513addc11cbd25e0e3c4f -> 
        blob dfc9cd58dc065d17030d875d3fea6e7862ede143
        size (495831 -> 496045)
        64 seconds

Admittedly, these are *BAD* input samples (a binary firmware
blob with many similar looking ", 0x" sequences).  I can see
that trying to reuse source materials really hard would take
significant computation.

However, this is simply unacceptable.

The new algoritm takes 58 seconds to produce 136000 bytes of
delta, while the old takes 0.25 seconds to produce 248899 (I am
using the test-delta program in git.git distribution).  The
compression ratio is significantly better, but this is unusable
even for offline archival use (remember, pack delta selection
needs to do window=10 such deltification trials to come up with
the best delta, so you are spending 10 minutes to save 100k from
one oddball blob), let alone on-the-fly pack generation for
network transfer.

Maybe we would want two implementation next to each other, and
internally see if it is taking too much cycles compared to the
input size then switch to cheaper version?

-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html