Hi Andras,
I think you're missing one (at least!) more important aspect in your
calculations which is writing block size. BlueStore compresses each
writing block independently. As Ceph object size in your case is
presumably 4MiB which (as far as I understand EC functioning) is split
into 9 parts, hence object size becomes around 466K.
Consequently writing size isn't higher. So Bluestore compresses blocks
of 466KiB (in fact even less as it operates on blocks aligned with
minimal allocation size - 4K / 64K for SSD/HDD respectively ) instead of
4M. This will drop compression ratio significantly. Then compressed data
is aligned with allocation size. And tail part isn't compressed and
aligned to allocation size as well.
E.g. for HDD (min alloc size = 64KiB) you'll have 448KiB as a block
under compression and 18KiB tail. If 448K are compressed to 100K it'll
take 128K (rounding up to 64K boundary) to keep that compressed block on
a disk anyway. Plus additional 64K to keep uncompressed tail. Hence 466
KiB needs 196 KiB. Moreover with 64KiB alloc size target size would
never drop below 128KiB for this input as head and tail need at least
one allocation unit each.
Much much less than ideal compression you're getting while compressing
huge 16GiB file.
If for some reasons writes come in with shorter sizes - ratio will drop
even worse.
Generally I suggest to simplify testing scenario if you want to track
what's happening: use replication pool, short compressible objects with
a known compression ratio(e.g. 64K, 256K or 4M length, then try alloc
size unaligned files etc).
And use rados cli tool to put such an object, this will definitely
perform a single write with block size equal to object size.
Then learn resulting stats.
If your cluster is out of any other load you might also want to reset
OSD performance counters before putting the object and then learn some
additional info from there, e.g. big (i.e. alloc size aligned ) vs.
small writes count/volume. Or at least I'll be able to explain what
happened step by step for this single object... This will also require
proper location of target OSDs for each specific object name though.
Hope this helps,
Igor.
On 2/19/2020 8:02 PM, Andras Pataki wrote:
Hi Igor,
Thanks for the insightful details on how to interpret the compression
data. I'm still a bit confused about why compression doesn't work
better in my case, so I've decided to try a test. I created 16 GiB
cephfs file which is just a repeat of 4 characters 'abcd' essentially
4 billion times. Purposefully extremely compressible data. I'm
writing this file into an empty pool that is erasure coded 6 data + 3
parity.
Just to verify data compressibility, here is the file before and after
running gzip on it - compresses almost 1000 fold:
-rw-r--r-- 1 root root 17179869184 Feb 19 08:08 compressible_file
-rw-r--r-- 1 root root 24998233 Feb 19 08:34 compressible_file.gz
First with compression set to 'none' on the pool (ceph df detail):
POOL ID STORED OBJECTS USED
%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED
COMPR UNDER COMPR
cephfs_data_ec63_c 5 16 GiB 4.10k 25 GiB
0 5.4 PiB N/A N/A 4.10k 0
B 0 B
This is all good - 4096 objects, each 4MiB (default cephfs chunking),
16GiB stored, with a 1.5x overhead, so 24GiB should be used (ok, it
says 25GiB, probably some rounding).
Now with compression set to 'aggressive', same extremely compressible
file:
POOL ID STORED OBJECTS USED
%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED
COMPR UNDER COMPR
cephfs_data_ec63_c 5 16 GiB 4.10k 14 GiB
0 5.4 PiB N/A N/A 4.10k 11
GiB 22 GiB
Still 4096 objects of 4MiB each, 16GiB before compression. Now the
stats say it compressed 22GiB to 11GiB and stored 14GiB as a result.
I would have expected it to be able to compress all 24GiB data and
much much better than just to half. With 4MiB objects, 6+3 erasure
code and the 64kiB bluestore allocation size, one chunk is about
4MiB/6, which is around 11 bluestore allocations, so I don't think the
problem is in the rounding up to the nearest 64k here.
I've tried increasing the cephfs object size to 16M. Then, there are
fewer objects (1024 instead of 4096), but the compression stats are
exactly the same. Also, I've tried other compression algorithms
besides the default snappy (lz4 and zlib) - exactly the same result.
Any ideas?
Andras
On 2/17/20 3:59 AM, Igor Fedotov wrote:
Hi Andras,
please find my answers inline.
On 2/15/2020 12:27 AM, Andras Pataki wrote:
We're considering using bluestore compression for some of our data,
and I'm not entirely sure how to interpret compression results. As
an example, one of the osd perf dump results shows:
"bluestore_compressed": 28089935,
"bluestore_compressed_allocated": 115539968,
"bluestore_compressed_original": 231079936,
Am I right to interpret this as "Originally 231079936 bytes (231GB)
were compressed to 28089935 bytes (28GB) and were stored in
115539968 bytes (115GB)". If so, why is there such a huge
discrepancy between the compressed_allocated and compressed numbers
(115GB vs 28GB)?
That's right except MBs not GBs.
The discrepancy is most probably caused by write block sizes and
BlueStore's allocation granularity. For spinner drives minimal
allocation size is 64K by default.
E.g. for 128K written block which takes 24K after compression
allocated disk chunk would be 64K.
Also, ceph df detail also shows some compression stats:
POOL ID STORED OBJECTS USED
%USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES
DIRTY USED COMPR UNDER COMPR
cephfs_data_ec63_c 5 1.2 TiB 311.93k 1.7
TiB 0.02 5.4 PiB N/A N/A 311.93k 116
GiB 232 GiB
Would this indicate that the compressed data takes up exactly half
the space (which coincides with what the osd's are saying)? It is
odd that the allocated numbers are always half the original numbers ...
USED COMPR column shows amount of space allocated for compressed
data. I.e. this includes compressed data plus all the allocation,
replication and erasure coding overhead.
UNDER COMPR - amount of data passed through compression (summed over
all replicas) and beneficial enough to be stored in a compressed
form. Here is another 'ceph df detail' sample output where 500K
object was compressed (partially, due to unaligned object size - 448K
only). So 384K at USED COMPR is caused by 3x128K allocation units for
keeping compressed data at 3 replicas. And 1.31M at UNDER COMPR
column denotes 3x448K blocks of user data compressed at 3 replicas.
And this totally required (see USED) 576K = 3x128K + 3x64K
(=ROUND_UP_TO(500K - 448K, 64K))
This is ceph Nautilus with all the defaults for compression (except
turning it on for the pool in question). Any insights here would be
appreciated.
Thanks,
Andras
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx