Re: BlueFS spillover detected, why, what?

Seena Fallah <seenafallah@xxxxxxxxx> · Thu, 20 Aug 2020 15:14:00 +0430

Hi Igor.

Could you please tell why this config is in LEVEL_DEV (
https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
As it is documented in Ceph we can't use LEVEL_DEV in production
environments!

Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov@xxxxxxx> wrote:

> Hi Simon,
>
>
> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB
> volume.
>
> see this PR: https://github.com/ceph/ceph/pull/29687
>
> Nice overview on the overall BlueFS/RocksDB design can be find here:
>
>
> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>
> Which also includes some overview (as well as additional concerns) for
> changes brought by the above-mentioned PR.
>
>
> Thanks,
>
> Igor
>
>
> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> > Hi Michael,
> >
> > thanks for the explanation! So if I understand correctly, we waste 93
> > GB per OSD on unused NVME space, because only 30GB is actually used...?
> >
> > And to improve the space for rocksdb, we need to plan for 300GB per
> > rocksdb partition in order to benefit from this advantage....
> >
> > Reducing the number of small files is something we always ask of our
> > users, but reality is what it is ;-)
> >
> > I'll have to look into how I can get an informative view on these
> > metrics... It's pretty overwhelming the amount of information coming
> > out of the ceph cluster, even when you look only superficially...
> >
> > Cheers,
> >
> > /Simon
> >
> > On 20/08/2020 10:16, Michael Bisig wrote:
> >> Hi Simon
> >>
> >> As far as I know, RocksDB only uses "leveled" space on the NVME
> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every
> >> DB space above such a limit will automatically end up on slow devices.
> >> In your setup where you have 123GB per OSD that means you only use
> >> 30GB of fast device. The DB which spills over this limit will be
> >> offloaded to the HDD and accordingly, it slows down requests and
> >> compactions.
> >>
> >> You can proof what your OSD currently consumes with:
> >>    ceph daemon osd.X perf dump
> >>
> >> Informative values are `db_total_bytes`, `db_used_bytes` and
> >> `slow_used_bytes`. This changes regularly because of the ongoing
> >> compactions but Prometheus mgr module exports these values such that
> >> you can track it.
> >>
> >> Small files generally leads to bigger RocksDB, especially when you
> >> use EC, but this depends on the actual amount and file sizes.
> >>
> >> I hope this helps.
> >> Regards,
> >> Michael
> >>
> >> On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek@xxxxxxxxxxxxx> wrote:
> >>
> >>      Hi
> >>
> >>      Recently our ceph cluster (nautilus) is experiencing bluefs
> >> spillovers,
> >>      just 2 osd's and I disabled the warning for these osds.
> >>      (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)
> >>
> >>      I'm wondering what causes this and how this can be prevented.
> >>
> >>      As I understand it the rocksdb for the OSD needs to store more
> >> than fits
> >>      on the NVME logical volume (123G for 12T OSD). A way to fix it
> >> could be
> >>      to increase the logical volume on the nvme (if there was space
> >> on the
> >>      nvme, which there isn't at the moment).
> >>
> >>      This is the current size of the cluster and how much is free:
> >>
> >>      [root@cephmon1 ~]# ceph df
> >>      RAW STORAGE:
> >>           CLASS     SIZE        AVAIL       USED        RAW USED
> >> %RAW USED
> >>           hdd       1.8 PiB     842 TiB     974 TiB      974
> >> TiB         53.63
> >>           TOTAL     1.8 PiB     842 TiB     974 TiB      974
> >> TiB         53.63
> >>
> >>      POOLS:
> >>           POOL                    ID     STORED      OBJECTS USED
> >>      %USED     MAX AVAIL
> >>           cephfs_data              1     572 MiB     121.26M 2.4 GiB
> >>          0       167 TiB
> >>           cephfs_metadata          2      56 GiB 5.15M      57 GiB
> >>          0       167 TiB
> >>           cephfs_data_3copy        8     201 GiB      51.68k 602 GiB
> >>      0.09       222 TiB
> >>           cephfs_data_ec83        13     643 TiB     279.75M 953 TiB
> >>      58.86       485 TiB
> >>           rbd                     14      21 GiB 5.66k      64 GiB
> >>          0       222 TiB
> >>           .rgw.root               15     1.2 KiB 4       1 MiB
> >>          0       167 TiB
> >>           default.rgw.control     16         0 B 8         0 B
> >>          0       167 TiB
> >>           default.rgw.meta        17       765 B 4       1 MiB
> >>          0       167 TiB
> >>           default.rgw.log         18         0 B 207         0 B
> >>          0       167 TiB
> >>           cephfs_data_ec57        20     433 MiB         230 1.2 GiB
> >>          0       278 TiB
> >>
> >>      The amount used can still grow a bit before we need to add
> >> nodes, but
> >>      apparently we are running into the limits of our rocskdb
> >> partitions.
> >>
> >>      Did we choose a parameter (e.g. minimal object size) too small,
> >> so we
> >>      have too much objects on these spillover OSDs? Or is it that too
> >> many
> >>      small files are stored on the cephfs filesystems?
> >>
> >>      When we expand the cluster, we can choose larger nvme devices to
> >> allow
> >>      larger rocksdb partitions, but is that the right way to deal
> >> with this,
> >>      or should we adjust some parameters on the cluster that will
> >> reduce the
> >>      rocksdb size?
> >>
> >>      Cheers
> >>
> >>      /Simon
> >>      _______________________________________________
> >>      ceph-users mailing list -- ceph-users@xxxxxxx
> >>      To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx