Re: ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool

Wladimir Mutel <mwg@xxxxxxxxx> · Tue, 6 Jul 2021 14:51:40 +0300

	I started my experimental 1-host/8-HDDs setup in 2018 with Luminous,
	and I read https://ceph.io/community/new-luminous-erasure-coding-rbd-cephfs/ ,
	which had interested me in using Bluestore and rewriteable EC pools for RBD data.
	I have about 22 TiB or raw storage, and ceph df shows this :

--- RAW STORAGE ---
CLASS    SIZE    AVAIL    USED  RAW USED  %RAW USED
hdd    22 TiB  2.7 TiB  19 TiB    19 TiB      87.78
TOTAL  22 TiB  2.7 TiB  19 TiB    19 TiB      87.78

--- POOLS ---
POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
jerasure21              1  256  9.0 TiB    2.32M   13 TiB  97.06    276 GiB
libvirt                 2  128  1.5 TiB  413.60k  4.5 TiB  91.77    140 GiB
rbd                     3   32  798 KiB        5  2.7 MiB      0    138 GiB
iso                     4   32  2.3 MiB       10  8.0 MiB      0    138 GiB
device_health_metrics   5    1   31 MiB        9   94 MiB   0.02    138 GiB

	If I add USED for libvirt and jerasure21 , I get 17.5 TiB, and 2.7 TiB is shown at RAW STORAGE/AVAIL
	Sum of POOLS/MAX AVAIL is about 840 GiB, where are my other 2.7-0.840 =~ 1.86 TiB ???
	Or in different words, where are my (RAW STORAGE/RAW USED)-(SUM(POOLS/USED)) = 19-17.5 = 1.5 TiB ?

	As it does not seem I would get any more hosts for this setup,
	I am seriously thinking of bringing down this Ceph
	and setting up instead a Btrfs storing qcow2 images served over iSCSI
	which looks simpler to me for single-host situation.

Josh Baergen wrote:
Hey Wladimir,

I actually don't know where this is referenced in the docs, if anywhere. Googling around shows many people discovering this overhead the hard way on ceph-users.

I also don't know the rbd journaling mechanism in enough depth to comment on whether it could be causing this issue for you. Are you seeing a high 
allocated:stored ratio on your cluster?

Josh

On Sun, Jul 4, 2021 at 6:52 AM Wladimir Mutel <mwg@xxxxxxxxx <mailto:mwg@xxxxxxxxx>> wrote:

    Dear Mr Baergen,

    thanks a lot for your very concise explanation,
    however I would like to learn more why default Bluestore alloc.size causes such a big storage overhead,
    and where in the Ceph docs it is explained how and what to watch for to avoid hitting this phenomenon again and again.
    I have a feeling this is what I get on my experimental Ceph setup with simplest JErasure 2+1 data pool.
    Could it be caused by journaled RBD writes to EC data-pool ?

    Josh Baergen wrote:
     > Hey Arkadiy,
     >
     > If the OSDs are on HDDs and were created with the default
     > bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus, then in
     > effect data will be allocated from the pool in 640KiB chunks (64KiB *
     > (k+m)). 5.36M objects taking up 501GiB is an average object size of 98KiB
     > which results in a ratio of 6.53:1 allocated:stored, which is pretty close
     > to the 7:1 observed.
     >
     > If my assumption about your configuration is correct, then the only way to
     > fix this is to adjust bluestore_min_alloc_size_hdd and recreate all your
     > OSDs, which will take a while...
     >
     > Josh
     >
     > On Tue, Jun 29, 2021 at 3:07 PM Arkadiy Kulev <eth@xxxxxxxxxxxx <mailto:eth@xxxxxxxxxxxx>> wrote:
     >
     >> The pool *default.rgw.buckets.data* has *501 GiB* stored, but USED shows
     >> *3.5
     >> TiB *(7 times higher!)*:*
     >>
     >> root@ceph-01:~# ceph df
     >> --- RAW STORAGE ---
     >> CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
     >> hdd    196 TiB  193 TiB  3.5 TiB   3.6 TiB       1.85
     >> TOTAL  196 TiB  193 TiB  3.5 TiB   3.6 TiB       1.85
     >>
     >> --- POOLS ---
     >> POOL                       ID  PGS  STORED   OBJECTS  USED     %USED  MAX
     >> AVAIL
     >> device_health_metrics       1    1   19 KiB       12   56 KiB      0
     >>   61 TiB
     >> .rgw.root                   2   32  2.6 KiB        6  1.1 MiB      0
     >>   61 TiB
     >> default.rgw.log             3   32  168 KiB      210   13 MiB      0
     >>   61 TiB
     >> default.rgw.control         4   32      0 B        8      0 B      0
     >>   61 TiB
     >> default.rgw.meta            5    8  4.8 KiB       11  1.9 MiB      0
     >>   61 TiB
     >> default.rgw.buckets.index   6    8  1.6 GiB      211  4.7 GiB      0
     >>   61 TiB
     >>
     >> default.rgw.buckets.data   10  128  501 GiB    5.36M  3.5 TiB   1.90
     >> 110 TiB
     >>
     >> The *default.rgw.buckets.data* pool is using erasure coding:
     >>
     >> root@ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST
     >> crush-device-class=hdd
     >> crush-failure-domain=host
     >> crush-root=default
     >> jerasure-per-chunk-alignment=false
     >> k=6
     >> m=4
     >> plugin=jerasure
     >> technique=reed_sol_van
     >> w=8
     >>
     >> If anyone could help explain why it's using up 7 times more space, it would
     >> help a lot. Versioning is disabled. ceph version 15.2.13 (octopus stable).
     >>
     >> Sincerely,
     >> Ark.
     >> _______________________________________________
     >> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
     >> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
     >>
     > _______________________________________________
     > ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
     > To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
     >

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx