Re: BlueFS spillover detected, why, what?

Igor Fedotov <ifedotov@xxxxxxx> · Thu, 20 Aug 2020 12:27:37 +0300

Hi Simon,

starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB 
volume.

see this PR: https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB design can be find here:

https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

Which also includes some overview (as well as additional concerns) for 
changes brought by the above-mentioned PR.

Thanks,

Igor

On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
Hi Michael,

thanks for the explanation! So if I understand correctly, we waste 93 
GB per OSD on unused NVME space, because only 30GB is actually used...?

And to improve the space for rocksdb, we need to plan for 300GB per 
rocksdb partition in order to benefit from this advantage....

Reducing the number of small files is something we always ask of our 
users, but reality is what it is ;-)

I'll have to look into how I can get an informative view on these 
metrics... It's pretty overwhelming the amount of information coming 
out of the ceph cluster, even when you look only superficially...

Cheers,

/Simon

On 20/08/2020 10:16, Michael Bisig wrote:
Hi Simon

As far as I know, RocksDB only uses "leveled" space on the NVME 
partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every 
DB space above such a limit will automatically end up on slow devices.
In your setup where you have 123GB per OSD that means you only use 
30GB of fast device. The DB which spills over this limit will be 
offloaded to the HDD and accordingly, it slows down requests and 
compactions.

You can proof what your OSD currently consumes with:
   ceph daemon osd.X perf dump

Informative values are `db_total_bytes`, `db_used_bytes` and 
`slow_used_bytes`. This changes regularly because of the ongoing 
compactions but Prometheus mgr module exports these values such that 
you can track it.

Small files generally leads to bigger RocksDB, especially when you 
use EC, but this depends on the actual amount and file sizes.

I hope this helps.
Regards,
Michael

On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek@xxxxxxxxxxxxx> wrote:

     Hi

     Recently our ceph cluster (nautilus) is experiencing bluefs 
spillovers,
     just 2 osd's and I disabled the warning for these osds.
     (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)

     I'm wondering what causes this and how this can be prevented.

     As I understand it the rocksdb for the OSD needs to store more 
than fits
     on the NVME logical volume (123G for 12T OSD). A way to fix it 
could be
     to increase the logical volume on the nvme (if there was space 
on the
     nvme, which there isn't at the moment).

     This is the current size of the cluster and how much is free:

     [root@cephmon1 ~]# ceph df
     RAW STORAGE:
          CLASS     SIZE        AVAIL       USED        RAW USED     
%RAW USED
          hdd       1.8 PiB     842 TiB     974 TiB      974 
TiB         53.63
          TOTAL     1.8 PiB     842 TiB     974 TiB      974 
TiB         53.63

     POOLS:
          POOL                    ID     STORED      OBJECTS USED
     %USED     MAX AVAIL
          cephfs_data              1     572 MiB     121.26M 2.4 GiB
         0       167 TiB
          cephfs_metadata          2      56 GiB 5.15M      57 GiB
         0       167 TiB
          cephfs_data_3copy        8     201 GiB      51.68k 602 GiB
     0.09       222 TiB
          cephfs_data_ec83        13     643 TiB     279.75M 953 TiB
     58.86       485 TiB
          rbd                     14      21 GiB 5.66k      64 GiB
         0       222 TiB
          .rgw.root               15     1.2 KiB 4       1 MiB
         0       167 TiB
          default.rgw.control     16         0 B 8         0 B
         0       167 TiB
          default.rgw.meta        17       765 B 4       1 MiB
         0       167 TiB
          default.rgw.log         18         0 B 207         0 B
         0       167 TiB
          cephfs_data_ec57        20     433 MiB         230 1.2 GiB
         0       278 TiB

     The amount used can still grow a bit before we need to add 
nodes, but
     apparently we are running into the limits of our rocskdb 
partitions.

     Did we choose a parameter (e.g. minimal object size) too small, 
so we
     have too much objects on these spillover OSDs? Or is it that too 
many
     small files are stored on the cephfs filesystems?

     When we expand the cluster, we can choose larger nvme devices to 
allow
     larger rocksdb partitions, but is that the right way to deal 
with this,
     or should we adjust some parameters on the cluster that will 
reduce the
     rocksdb size?

     Cheers

     /Simon
     _______________________________________________
     ceph-users mailing list -- ceph-users@xxxxxxx
     To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx