Hi David, On Fri, Feb 26, 2021 at 03:12:03PM +0100, David Sterba wrote: > On Fri, Feb 26, 2021 at 07:28:54PM +0800, Gao Xiang wrote: > > On Fri, Feb 26, 2021 at 10:36:53AM +0100, David Sterba wrote: > > > On Thu, Feb 25, 2021 at 10:50:56AM -0800, Eric Biggers wrote: > > > > > > ZLIB and ZSTD can have a separate dictionary and don't need the input > > > chunks to be contiguous. This brings some additional overhead like > > > copying parts of the input to the dictionary and additional memory for > > > themporary structures, but with higher compression ratios. > > > > > > IIRC the biggest problem for LZ4 was the cost of setting up each 4K > > > chunk, the work memory had to be zeroed. The size of the work memory is > > > tunable but trading off compression ratio. Either way it was either too > > > slow or too bad. > > > > May I ask why LZ4 needs to zero the work memory (if you mean dest > > buffer and LZ4_decompress_safe), just out of curiousity... I didn't > > see that restriction before. Thanks! > > Not the destination buffer, but the work memory or state as it can be > also called. This is from my initial interest in lz4 in 2012 and I got > that from Yann himself. There was a tradeoff to either expect zeroed > work memory or add more conditionals. > > At time he got some benchmark result and the conditionals came out > worse. And I see the memset is still there (see below) so there's been > no change. > > For example in f2fs sources there is: > lz4_compress_pages > LZ4_compress_default (cc->private is the work memory) > LZ4_compress_fast > LZ4_compress_fast_extState > LZ4_resetStream > memset > > Where the state size LZ4_MEM_COMPRESS is hidden in the maze od defines > > #define LZ4_MEM_COMPRESS LZ4_STREAMSIZE > #define LZ4_STREAMSIZE (LZ4_STREAMSIZE_U64 * sizeof(unsigned long long)) > #define LZ4_STREAMSIZE_U64 ((1 << (LZ4_MEMORY_USAGE - 3)) + 4) > /* > * LZ4_MEMORY_USAGE : > * Memory usage formula : N->2^N Bytes > * (examples : 10 -> 1KB; 12 -> 4KB ; 16 -> 64KB; 20 -> 1MB; etc.) > * Increasing memory usage improves compression ratio > * Reduced memory usage can improve speed, due to cache effect > * Default value is 14, for 16KB, which nicely fits into Intel x86 L1 cache > */ > #define LZ4_MEMORY_USAGE 14 > > So it's 16K by default in linux. Now imagine doing memset(16K) just to > compress 4K, and do that 32 times to compress the whole 128K chunk. > That's not a negligible overhead. Ok, thanks for your reply & info. I get the concern now. That is mainly a internal HASHTABLE for LZ4 matchfinder (most LZ algs use hash table as a way to find a previous match) to get len-distance pair from history sliding window. I just did a quick glance, If I understand correctly, I think LZO has a similiar table (yet not quite sure since I don't looked into that much), see: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/lzo/lzodefs.h#n68 #define D_BITS 13 #define D_SIZE (1u << D_BITS) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/lzo/lzo1x_compress.c#n108 t = ((dv * 0x1824429d) >> (32 - D_BITS)) & D_MASK; m_pos = in + dict[t]; dict[t] = (lzo_dict_t) (ip - in); if (unlikely(dv != get_unaligned_le32(m_pos))) goto literal; https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/lzo/lzo1x_compress.c#n334 memset(wrkmem, 0, D_SIZE * sizeof(lzo_dict_t)); LZO uses 13 but LZ4 uses 14, so in principle less collision. But it can be tunable especially considering btrfs 4kb fixed-sized input compression, it seems double memset though (16kb vs 8kb).. Thanks, Gao Xiang >