Re: Ceph Erasure Coding - Stored vs used

Simon Leinen <simon.leinen@xxxxxxxxx> · Wed, 12 Feb 2020 14:39:12 +0100

Kristof Coucke writes:
> I have an issue on my Ceph cluster.
> For one of my pools I have 107TiB STORED and 298TiB USED.
> This is strange, since I've configured erasure coding (6 data chunks, 3
> coding chunks).
> So, in an ideal world this should result in approx. 160.5TiB USED.

> The question now is why this is the case...
> There are 473+M objects stored. Lot's of these files are pretty small.
> (Read 150kb files). Not all of them though.
> I am running Nautilus version 14.2.4.

> I suspect that the stripe size is related with this issue.

I think the relevant factor is "bluestore min alloc size".

473 million objects at 6+3 EC correspond to 4.257 billion stored
"chunks" (probably not the correct word) that have to be stored on
OSDs.

If you use Bluestore with the default min alloc size of 64KiB, then
those chunks will take up at least 253.73 TiB.  This pretty well matches
the occupancy you see, assuming that you have some objects that are
larger than 384 KiB that can explain the additional 44 TiB.

If I understand correctly, there are ongoing discussions to make the
default min alloc size smaller, at least for read-only or append-only
RADOS objects.  Presumably you could also change the default globally,
but that might have adverse effects if you serve RBD images.  Also it
probably won't have an effect on already-existing objects.

> This is still the default (4MB), but I am not sure.  Before BlueFS it
> was easy to check the size of the chunks on the disk...  With BlueFS
> this is another story.

> I have the following questions:
> 1. How can I check this to be sure that this is the case? I actually
> want to drill down starting from an object I've sent to the Ceph
> cluster thru the RGW. I would like to see where the chunks are stored
> and which size is allocated for these on the disks.

Yeah, I'd be curious about that as well.

> 2. If it is related to the stripe size, can I safely adapt this parameter
> or is this going to work forward only, or will it also work reversely?

See above, I think it's more about (Bluestore) min alloc size.  I don't
have any actual experience with changing this default, and there's
always some risk that you might get into less well-tested territory.
But if your cluster is only used for RadosGW, it might be worth reducing
bluestore min alloc size from the default (64 KiB) to something smaller.
In the discussions I have seen a proposal to reduce it to 4096 bytes (at
least for read-only/append-only objects).  If you are worried about
possible explosion of write operations, you can try a value that's
higher, but still lower than 64 KiB.  If most of your objects are 150
KiB or larger, then a min alloc size of 16 KiB or even 32 KiB should be
sufficient to cut your small-object storage overhead in half.

Hope this helps,
-- 
Simon.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx