Re: BlueFS spillover detected, why, what?

Igor Fedotov <ifedotov@xxxxxxx> · Thu, 20 Aug 2020 15:54:58 +0300

Hi Seena,

this parameter isn't intended to be adjusted in production environments 
- it's supposed that default behavior covers all regular customers' needs.

The issue though is that default setting is invalid. It should be 
'use_some_extra'. Gonna fix that shortly...

Thanks,

Igor

On 8/20/2020 1:44 PM, Seena Fallah wrote:
Hi Igor.

Could you please tell why this config is in LEVEL_DEV 
(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)? 
As it is documented in Ceph we can't use LEVEL_DEV in production 
environments!

Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov@xxxxxxx 
<mailto:ifedotov@xxxxxxx>> wrote:

    Hi Simon,

    starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space
    at DB
    volume.

    see this PR: https://github.com/ceph/ceph/pull/29687

    Nice overview on the overall BlueFS/RocksDB design can be find here:

    https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

    Which also includes some overview (as well as additional concerns)
    for
    changes brought by the above-mentioned PR.

    Thanks,

    Igor

    On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
    > Hi Michael,
    >
    > thanks for the explanation! So if I understand correctly, we
    waste 93
    > GB per OSD on unused NVME space, because only 30GB is actually
    used...?
    >
    > And to improve the space for rocksdb, we need to plan for 300GB per
    > rocksdb partition in order to benefit from this advantage....
    >
    > Reducing the number of small files is something we always ask of
    our
    > users, but reality is what it is ;-)
    >
    > I'll have to look into how I can get an informative view on these
    > metrics... It's pretty overwhelming the amount of information
    coming
    > out of the ceph cluster, even when you look only superficially...
    >
    > Cheers,
    >
    > /Simon
    >
    > On 20/08/2020 10:16, Michael Bisig wrote:
    >> Hi Simon
    >>
    >> As far as I know, RocksDB only uses "leveled" space on the NVME
    >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB.
    Every
    >> DB space above such a limit will automatically end up on slow
    devices.
    >> In your setup where you have 123GB per OSD that means you only use
    >> 30GB of fast device. The DB which spills over this limit will be
    >> offloaded to the HDD and accordingly, it slows down requests and
    >> compactions.
    >>
    >> You can proof what your OSD currently consumes with:
    >>    ceph daemon osd.X perf dump
    >>
    >> Informative values are `db_total_bytes`, `db_used_bytes` and
    >> `slow_used_bytes`. This changes regularly because of the ongoing
    >> compactions but Prometheus mgr module exports these values such
    that
    >> you can track it.
    >>
    >> Small files generally leads to bigger RocksDB, especially when you
    >> use EC, but this depends on the actual amount and file sizes.
    >>
    >> I hope this helps.
    >> Regards,
    >> Michael
    >>
    >> On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek@xxxxxxxxxxxxx
    <mailto:s.oosthoek@xxxxxxxxxxxxx>> wrote:
    >>
    >>      Hi
    >>
    >>      Recently our ceph cluster (nautilus) is experiencing bluefs
    >> spillovers,
    >>      just 2 osd's and I disabled the warning for these osds.
    >>      (ceph config set osd.125
    bluestore_warn_on_bluefs_spillover false)
    >>
    >>      I'm wondering what causes this and how this can be prevented.
    >>
    >>      As I understand it the rocksdb for the OSD needs to store
    more
    >> than fits
    >>      on the NVME logical volume (123G for 12T OSD). A way to
    fix it
    >> could be
    >>      to increase the logical volume on the nvme (if there was
    space
    >> on the
    >>      nvme, which there isn't at the moment).
    >>
    >>      This is the current size of the cluster and how much is free:
    >>
    >>      [root@cephmon1 ~]# ceph df
    >>      RAW STORAGE:
    >>           CLASS     SIZE        AVAIL USED        RAW USED
    >> %RAW USED
    >>           hdd       1.8 PiB     842 TiB     974 TiB      974
    >> TiB         53.63
    >>           TOTAL     1.8 PiB     842 TiB     974 TiB      974
    >> TiB         53.63
    >>
    >>      POOLS:
    >>           POOL                    ID     STORED OBJECTS USED
    >>      %USED     MAX AVAIL
    >>           cephfs_data              1     572 MiB 121.26M 2.4 GiB
    >>          0       167 TiB
    >>           cephfs_metadata          2      56 GiB 5.15M      57 GiB
    >>          0       167 TiB
    >>           cephfs_data_3copy        8     201 GiB 51.68k 602 GiB
    >>      0.09       222 TiB
    >>           cephfs_data_ec83        13     643 TiB 279.75M 953 TiB
    >>      58.86       485 TiB
    >>           rbd                     14      21 GiB 5.66k      64 GiB
    >>          0       222 TiB
    >>           .rgw.root               15     1.2 KiB 4       1 MiB
    >>          0       167 TiB
    >>           default.rgw.control     16         0 B 8         0 B
    >>          0       167 TiB
    >>           default.rgw.meta        17       765 B 4       1 MiB
    >>          0       167 TiB
    >>           default.rgw.log         18         0 B 207         0 B
    >>          0       167 TiB
    >>           cephfs_data_ec57        20     433 MiB         230
    1.2 GiB
    >>          0       278 TiB
    >>
    >>      The amount used can still grow a bit before we need to add
    >> nodes, but
    >>      apparently we are running into the limits of our rocskdb
    >> partitions.
    >>
    >>      Did we choose a parameter (e.g. minimal object size) too
    small,
    >> so we
    >>      have too much objects on these spillover OSDs? Or is it
    that too
    >> many
    >>      small files are stored on the cephfs filesystems?
    >>
    >>      When we expand the cluster, we can choose larger nvme
    devices to
    >> allow
    >>      larger rocksdb partitions, but is that the right way to deal
    >> with this,
    >>      or should we adjust some parameters on the cluster that will
    >> reduce the
    >>      rocksdb size?
    >>
    >>      Cheers
    >>
    >>      /Simon
    >>      _______________________________________________
    >>      ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    >>      To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>
    >>
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    > To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx