Re: BlueFS spillover detected, why, what?

Michael Bisig <michael.bisig@xxxxxxxxx> · Thu, 20 Aug 2020 08:16:22 +0000

Hi Simon

As far as I know, RocksDB only uses "leveled" space on the NVME partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a limit will automatically end up on slow devices. 
In your setup where you have 123GB per OSD that means you only use 30GB of fast device. The DB which spills over this limit will be offloaded to the HDD and accordingly, it slows down requests and compactions.

You can proof what your OSD currently consumes with:
  ceph daemon osd.X perf dump

Informative values are `db_total_bytes`, `db_used_bytes` and `slow_used_bytes`. This changes regularly because of the ongoing compactions but Prometheus mgr module exports these values such that you can track it.

Small files generally leads to bigger RocksDB, especially when you use EC, but this depends on the actual amount and file sizes.

I hope this helps.
Regards,
Michael

On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek@xxxxxxxxxxxxx> wrote:

    Hi

    Recently our ceph cluster (nautilus) is experiencing bluefs spillovers, 
    just 2 osd's and I disabled the warning for these osds.
    (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)

    I'm wondering what causes this and how this can be prevented.

    As I understand it the rocksdb for the OSD needs to store more than fits 
    on the NVME logical volume (123G for 12T OSD). A way to fix it could be 
    to increase the logical volume on the nvme (if there was space on the 
    nvme, which there isn't at the moment).

    This is the current size of the cluster and how much is free:

    [root@cephmon1 ~]# ceph df
    RAW STORAGE:
         CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED
         hdd       1.8 PiB     842 TiB     974 TiB      974 TiB         53.63
         TOTAL     1.8 PiB     842 TiB     974 TiB      974 TiB         53.63

    POOLS:
         POOL                    ID     STORED      OBJECTS     USED 
    %USED     MAX AVAIL
         cephfs_data              1     572 MiB     121.26M     2.4 GiB 
        0       167 TiB
         cephfs_metadata          2      56 GiB       5.15M      57 GiB 
        0       167 TiB
         cephfs_data_3copy        8     201 GiB      51.68k     602 GiB 
    0.09       222 TiB
         cephfs_data_ec83        13     643 TiB     279.75M     953 TiB 
    58.86       485 TiB
         rbd                     14      21 GiB       5.66k      64 GiB 
        0       222 TiB
         .rgw.root               15     1.2 KiB           4       1 MiB 
        0       167 TiB
         default.rgw.control     16         0 B           8         0 B 
        0       167 TiB
         default.rgw.meta        17       765 B           4       1 MiB 
        0       167 TiB
         default.rgw.log         18         0 B         207         0 B 
        0       167 TiB
         cephfs_data_ec57        20     433 MiB         230     1.2 GiB 
        0       278 TiB

    The amount used can still grow a bit before we need to add nodes, but 
    apparently we are running into the limits of our rocskdb partitions.

    Did we choose a parameter (e.g. minimal object size) too small, so we 
    have too much objects on these spillover OSDs? Or is it that too many 
    small files are stored on the cephfs filesystems?

    When we expand the cluster, we can choose larger nvme devices to allow 
    larger rocksdb partitions, but is that the right way to deal with this, 
    or should we adjust some parameters on the cluster that will reduce the 
    rocksdb size?

    Cheers

    /Simon
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx