Re: [PATCH] improve delta long block matching with big files

mkoegler@xxxxxxxxxxxxxxxxx (Martin Koegler) · Sat, 26 May 2007 17:19:09 +0200

Nicolas Pitre wrote:
> Martin Koegler noted that create_delta() performs a new hash lookup
> after every block copy encoding which are currently limited to 64KB.
> 
> In case of larger identical blocks, the next hash lookup would normally
> point to the next 64KB block in the reference buffer and multiple block
> copy operations will be consecutively encoded.
> 
> It is however possible that the reference buffer be sparsely indexed if
> hash buckets have been trimmed down in create_delta_index() when hashing
> of the reference buffer isn't well balanced.  In that case the hash
> lookup following a block copy might fail to match anything and the fact
> that the reference buffer still matches beyond the previous 64KB block
> will be missed.
> 
> Let's rework the code so that buffer comparison isn't bounded to 64KB
> anymore.  The match size should be as large as possible up front and
> only then should multiple block copy be encoded to cover it all.
> Also, fewer hash lookups will be performed in the end.
> 
> According to Martin, this patch should reduce his 92MB pack down to 75MB
> with the dataset he has.
> 
> Tests performed on the Linux kernel repo show a slightly smaller pack and
> a slightly faster repack.
>
Acked-by: Martin Koegler <mkoegler@xxxxxxxxxxxxxxxxx>
> Signed-off-by: Nicolas Pitre <nico@xxxxxxx>
---

The patch results in a 75 MB pack file for my repository and is
faster:

Total 6452 (delta 4581), reused 1522 (delta 0)
10073.11user 5200.33system 4:14:36elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1371504760minor)pagefaults 0swaps

mfg Martin Kögler
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html