Re: ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool

Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> · Tue, 6 Jul 2021 07:49:58 -0600

Oh, I just read your message again, and I see that I didn't answer your
question. :D I admit I don't know how MAX AVAIL is calculated, and whether
it takes things like imbalance into account (it might).

Josh

On Tue, Jul 6, 2021 at 7:41 AM Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx>
wrote:

> Hey Wladimir,
>
> That output looks like it's from Nautilus or later. My understanding is
> that the USED column is in raw bytes, whereas STORED is "user" bytes. If
> you're using EC 2:1 for all of those pools, I would expect USED to be at
> least 1.5x STORED, which looks to be the case for jerasure21. Perhaps your
> libvirt pool is 3x replicated, in which case the numbers add up as well.
>
> Josh
>
> On Tue, Jul 6, 2021 at 5:51 AM Wladimir Mutel <mwg@xxxxxxxxx> wrote:
>
>>         I started my experimental 1-host/8-HDDs setup in 2018 with
>> Luminous,
>>         and I read
>> https://ceph.io/community/new-luminous-erasure-coding-rbd-cephfs/ ,
>>         which had interested me in using Bluestore and rewriteable EC
>> pools for RBD data.
>>         I have about 22 TiB or raw storage, and ceph df shows this :
>>
>> --- RAW STORAGE ---
>> CLASS    SIZE    AVAIL    USED  RAW USED  %RAW USED
>> hdd    22 TiB  2.7 TiB  19 TiB    19 TiB      87.78
>> TOTAL  22 TiB  2.7 TiB  19 TiB    19 TiB      87.78
>>
>> --- POOLS ---
>> POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX
>> AVAIL
>> jerasure21              1  256  9.0 TiB    2.32M   13 TiB  97.06    276
>> GiB
>> libvirt                 2  128  1.5 TiB  413.60k  4.5 TiB  91.77    140
>> GiB
>> rbd                     3   32  798 KiB        5  2.7 MiB      0    138
>> GiB
>> iso                     4   32  2.3 MiB       10  8.0 MiB      0    138
>> GiB
>> device_health_metrics   5    1   31 MiB        9   94 MiB   0.02    138
>> GiB
>>
>>         If I add USED for libvirt and jerasure21 , I get 17.5 TiB, and
>> 2.7 TiB is shown at RAW STORAGE/AVAIL
>>         Sum of POOLS/MAX AVAIL is about 840 GiB, where are my other
>> 2.7-0.840 =~ 1.86 TiB ???
>>         Or in different words, where are my (RAW STORAGE/RAW
>> USED)-(SUM(POOLS/USED)) = 19-17.5 = 1.5 TiB ?
>>
>>         As it does not seem I would get any more hosts for this setup,
>>         I am seriously thinking of bringing down this Ceph
>>         and setting up instead a Btrfs storing qcow2 images served over
>> iSCSI
>>         which looks simpler to me for single-host situation.
>>
>> Josh Baergen wrote:
>> > Hey Wladimir,
>> >
>> > I actually don't know where this is referenced in the docs, if
>> anywhere. Googling around shows many people discovering this overhead the
>> hard way on ceph-users.
>> >
>> > I also don't know the rbd journaling mechanism in enough depth to
>> comment on whether it could be causing this issue for you. Are you seeing a
>> high
>> > allocated:stored ratio on your cluster?
>> >
>> > Josh
>> >
>> > On Sun, Jul 4, 2021 at 6:52 AM Wladimir Mutel <mwg@xxxxxxxxx <mailto:
>> mwg@xxxxxxxxx>> wrote:
>> >
>> >     Dear Mr Baergen,
>> >
>> >     thanks a lot for your very concise explanation,
>> >     however I would like to learn more why default Bluestore alloc.size
>> causes such a big storage overhead,
>> >     and where in the Ceph docs it is explained how and what to watch
>> for to avoid hitting this phenomenon again and again.
>> >     I have a feeling this is what I get on my experimental Ceph setup
>> with simplest JErasure 2+1 data pool.
>> >     Could it be caused by journaled RBD writes to EC data-pool ?
>> >
>> >     Josh Baergen wrote:
>> >      > Hey Arkadiy,
>> >      >
>> >      > If the OSDs are on HDDs and were created with the default
>> >      > bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus,
>> then in
>> >      > effect data will be allocated from the pool in 640KiB chunks
>> (64KiB *
>> >      > (k+m)). 5.36M objects taking up 501GiB is an average object size
>> of 98KiB
>> >      > which results in a ratio of 6.53:1 allocated:stored, which is
>> pretty close
>> >      > to the 7:1 observed.
>> >      >
>> >      > If my assumption about your configuration is correct, then the
>> only way to
>> >      > fix this is to adjust bluestore_min_alloc_size_hdd and recreate
>> all your
>> >      > OSDs, which will take a while...
>> >      >
>> >      > Josh
>> >      >
>> >      > On Tue, Jun 29, 2021 at 3:07 PM Arkadiy Kulev <eth@xxxxxxxxxxxx
>> <mailto:eth@xxxxxxxxxxxx>> wrote:
>> >      >
>> >      >> The pool *default.rgw.buckets.data* has *501 GiB* stored, but
>> USED shows
>> >      >> *3.5
>> >      >> TiB *(7 times higher!)*:*
>> >      >>
>> >      >> root@ceph-01:~# ceph df
>> >      >> --- RAW STORAGE ---
>> >      >> CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
>> >      >> hdd    196 TiB  193 TiB  3.5 TiB   3.6 TiB       1.85
>> >      >> TOTAL  196 TiB  193 TiB  3.5 TiB   3.6 TiB       1.85
>> >      >>
>> >      >> --- POOLS ---
>> >      >> POOL                       ID  PGS  STORED   OBJECTS  USED
>>  %USED  MAX
>> >      >> AVAIL
>> >      >> device_health_metrics       1    1   19 KiB       12   56 KiB
>>     0
>> >      >>   61 TiB
>> >      >> .rgw.root                   2   32  2.6 KiB        6  1.1 MiB
>>     0
>> >      >>   61 TiB
>> >      >> default.rgw.log             3   32  168 KiB      210   13 MiB
>>     0
>> >      >>   61 TiB
>> >      >> default.rgw.control         4   32      0 B        8      0 B
>>     0
>> >      >>   61 TiB
>> >      >> default.rgw.meta            5    8  4.8 KiB       11  1.9 MiB
>>     0
>> >      >>   61 TiB
>> >      >> default.rgw.buckets.index   6    8  1.6 GiB      211  4.7 GiB
>>     0
>> >      >>   61 TiB
>> >      >>
>> >      >> default.rgw.buckets.data   10  128  501 GiB    5.36M  3.5 TiB
>>  1.90
>> >      >> 110 TiB
>> >      >>
>> >      >> The *default.rgw.buckets.data* pool is using erasure coding:
>> >      >>
>> >      >> root@ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST
>> >      >> crush-device-class=hdd
>> >      >> crush-failure-domain=host
>> >      >> crush-root=default
>> >      >> jerasure-per-chunk-alignment=false
>> >      >> k=6
>> >      >> m=4
>> >      >> plugin=jerasure
>> >      >> technique=reed_sol_van
>> >      >> w=8
>> >      >>
>> >      >> If anyone could help explain why it's using up 7 times more
>> space, it would
>> >      >> help a lot. Versioning is disabled. ceph version 15.2.13
>> (octopus stable).
>> >      >>
>> >      >> Sincerely,
>> >      >> Ark.
>> >      >> _______________________________________________
>> >      >> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:
>> ceph-users@xxxxxxx>
>> >      >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> <mailto:ceph-users-leave@xxxxxxx>
>> >      >>
>> >      > _______________________________________________
>> >      > ceph-users mailing list -- ceph-users@xxxxxxx <mailto:
>> ceph-users@xxxxxxx>
>> >      > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> <mailto:ceph-users-leave@xxxxxxx>
>> >      >
>> >
>>
>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx