Re: bluestore compression questions

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Wed, 19 Feb 2020 12:02:23 -0500

Hi Igor,

Thanks for the insightful details on how to interpret the compression 
data.  I'm still a bit confused about why compression doesn't work 
better in my case, so I've decided to try a test.  I created 16 GiB 
cephfs file which is just a repeat of 4 characters 'abcd' essentially 4 
billion times.  Purposefully extremely compressible data.  I'm writing 
this file into an empty pool that is erasure coded 6 data + 3 parity.

Just to verify data compressibility, here is the file before and after 
running gzip on it - compresses almost 1000 fold:
-rw-r--r-- 1 root root 17179869184 Feb 19 08:08 compressible_file
-rw-r--r-- 1 root root    24998233 Feb 19 08:34 compressible_file.gz

First with compression set to 'none' on the pool (ceph df detail):

    POOL                      ID     STORED OBJECTS     USED        
%USED     MAX AVAIL     QUOTA OBJECTS     QUOTA BYTES     DIRTY     USED 
COMPR     UNDER COMPR
    cephfs_data_ec63_c         5      16 GiB 4.10k      25 GiB         
0       5.4 PiB     N/A N/A             4.10k            0 B             
0 B

This is all good - 4096 objects, each 4MiB (default cephfs chunking), 
16GiB stored, with a 1.5x overhead, so 24GiB should be used (ok, it says 
25GiB, probably some rounding).

Now with compression set to 'aggressive', same extremely compressible file:

    POOL                      ID     STORED OBJECTS     USED        
%USED     MAX AVAIL     QUOTA OBJECTS     QUOTA BYTES     DIRTY     USED 
COMPR     UNDER COMPR
    cephfs_data_ec63_c         5      16 GiB 4.10k      14 GiB         
0       5.4 PiB     N/A N/A             4.10k         11 GiB          22 
GiB

Still 4096 objects of 4MiB each, 16GiB before compression.  Now the 
stats say it compressed 22GiB to 11GiB and stored 14GiB as a result.  I 
would have expected it to be able to compress all 24GiB data and much 
much better than just to half.  With 4MiB objects, 6+3 erasure code and 
the 64kiB bluestore allocation size, one chunk is about 4MiB/6, which is 
around 11 bluestore allocations, so I don't think the problem is in the 
rounding up to the nearest 64k here.

I've tried increasing the cephfs object size to 16M.  Then, there are 
fewer objects (1024 instead of 4096), but the compression stats are 
exactly the same.  Also, I've tried other compression algorithms besides 
the default snappy (lz4 and zlib) - exactly the same result.  Any ideas?

Andras

On 2/17/20 3:59 AM, Igor Fedotov wrote:
Hi Andras,

please find my answers inline.

On 2/15/2020 12:27 AM, Andras Pataki wrote:
We're considering using bluestore compression for some of our data, 
and I'm not entirely sure how to interpret compression results.  As 
an example, one of the osd perf dump results shows:

        "bluestore_compressed": 28089935,
        "bluestore_compressed_allocated": 115539968,
        "bluestore_compressed_original": 231079936,

Am I right to interpret this as "Originally 231079936 bytes (231GB) 
were compressed to 28089935 bytes (28GB) and were stored in 115539968 
bytes (115GB)".  If so, why is there such a huge discrepancy between 
the compressed_allocated and compressed numbers (115GB vs 28GB)?

That's right except MBs not GBs.

The discrepancy is most probably caused by write block sizes and 
BlueStore's allocation granularity.  For spinner drives minimal 
allocation size is 64K by default.

E.g. for 128K written block which takes 24K after compression 
allocated disk chunk would be 64K.

Also, ceph df detail also shows some compression stats:

    POOL                      ID     STORED      OBJECTS USED        
%USED     MAX AVAIL     QUOTA OBJECTS     QUOTA BYTES     DIRTY       
USED COMPR     UNDER COMPR
    cephfs_data_ec63_c         5     1.2 TiB 311.93k     1.7 TiB      
0.02       5.4 PiB     N/A N/A             311.93k 116 GiB         
232 GiB

Would this indicate that the compressed data takes up exactly half 
the space (which coincides with what the osd's are saying)?  It is 
odd that the allocated numbers are always half the original numbers ...

USED COMPR column shows amount of space allocated for compressed data. 
I.e. this includes compressed data plus all the allocation, 
replication and erasure coding overhead.
UNDER COMPR - amount of data passed through compression (summed over 
all replicas) and beneficial enough to be stored in a compressed form. 
Here is another 'ceph df detail' sample output where 500K object was 
compressed (partially, due to unaligned object size - 448K only). So 
384K at USED COMPR is caused by 3x128K allocation units for keeping 
compressed data at 3 replicas. And 1.31M at UNDER COMPR column denotes 
3x448K blocks of user data compressed at 3 replicas. And this totally 
required (see USED) 576K = 3x128K + 3x64K (=ROUND_UP_TO(500K - 448K, 
64K))

This is ceph Nautilus with all the defaults for compression (except 
turning it on for the pool in question).  Any insights here would be 
appreciated.

Thanks,

Andras
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx