Re: BlueFS spillover detected, why, what?

Seena Fallah <seenafallah@xxxxxxxxx> · Fri, 21 Aug 2020 02:21:52 +0430

Ok thanks. And also as you mentioned in the doc you shared from cloudferro,
It's not good to change `write_buffer_size` for bluestore rocksdb to fit
our db?

On Fri, Aug 21, 2020 at 1:46 AM Igor Fedotov <ifedotov@xxxxxxx> wrote:

> Honestly I don't have any perfect solution for now.
>
> If this is urgent you probably better to proceed with enabling the new DB
> space management feature.
>
> But please do that eventually, modify 1-2 OSDs at the first stage and test
> them for some period (may be a week or two).
>
>
> Thanks,
>
> Igor
>
>
> On 8/20/2020 5:36 PM, Seena Fallah wrote:
>
> So what do you suggest for a short term solution? (I think you won't
> backport it to nautilus at least about 6 month)
>
> Changing db size is too expensive because I should buy new NVME devices
> with double size and also redeploy all my OSDs.
> Manual compaction will still have an impact on performance and doing it
> for a month doesn't look very good!
>
> On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov <ifedotov@xxxxxxx> wrote:
>
>> Correct.
>> On 8/20/2020 5:15 PM, Seena Fallah wrote:
>>
>> So you won't backport it to nautilus until it gets default to master for
>> a while?
>>
>> On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov <ifedotov@xxxxxxx> wrote:
>>
>>> From technical/developer's point of view I don't see any issues with
>>> tuning this option. But since now I wouldn't  recommend to enable it in
>>> production as it partially bypassed our regular development cycle. Being
>>> enabled in master for a while by default allows more develpers to use/try
>>> the feature before release. This can be considered as an additional
>>> implicit QA process. But as we just discovered this hasn't happened.
>>>
>>> Hence you can definitely try it but this exposes your cluster(s) to some
>>> risk as for any new (and incompletely tested) feature....
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 8/20/2020 4:06 PM, Seena Fallah wrote:
>>>
>>> Greate, thanks.
>>>
>>> Is it safe to change it manually in ceph.conf until next nautilus
>>> release or should I wait for the next nautilus release for this change? I
>>> mean does qa run on this value for this config that we could trust and
>>> change it or should we wait until the next nautilus release that qa ran on
>>> this value?
>>>
>>> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov <ifedotov@xxxxxxx> wrote:
>>>
>>>> Hi Seena,
>>>>
>>>> this parameter isn't intended to be adjusted in production environments
>>>> - it's supposed that default behavior covers all regular customers' needs.
>>>>
>>>> The issue though is that default setting is invalid. It should be
>>>> 'use_some_extra'. Gonna fix that shortly...
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>>
>>>>
>>>> On 8/20/2020 1:44 PM, Seena Fallah wrote:
>>>>
>>>> Hi Igor.
>>>>
>>>> Could you please tell why this config is in LEVEL_DEV (
>>>> https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
>>>> As it is documented in Ceph we can't use LEVEL_DEV in production
>>>> environments!
>>>>
>>>> Thanks
>>>>
>>>> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov@xxxxxxx> wrote:
>>>>
>>>>> Hi Simon,
>>>>>
>>>>>
>>>>> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at
>>>>> DB
>>>>> volume.
>>>>>
>>>>> see this PR: https://github.com/ceph/ceph/pull/29687
>>>>>
>>>>> Nice overview on the overall BlueFS/RocksDB design can be find here:
>>>>>
>>>>>
>>>>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>>>>>
>>>>> Which also includes some overview (as well as additional concerns) for
>>>>> changes brought by the above-mentioned PR.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Igor
>>>>>
>>>>>
>>>>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
>>>>> > Hi Michael,
>>>>> >
>>>>> > thanks for the explanation! So if I understand correctly, we waste
>>>>> 93
>>>>> > GB per OSD on unused NVME space, because only 30GB is actually
>>>>> used...?
>>>>> >
>>>>> > And to improve the space for rocksdb, we need to plan for 300GB per
>>>>> > rocksdb partition in order to benefit from this advantage....
>>>>> >
>>>>> > Reducing the number of small files is something we always ask of our
>>>>> > users, but reality is what it is ;-)
>>>>> >
>>>>> > I'll have to look into how I can get an informative view on these
>>>>> > metrics... It's pretty overwhelming the amount of information coming
>>>>> > out of the ceph cluster, even when you look only superficially...
>>>>> >
>>>>> > Cheers,
>>>>> >
>>>>> > /Simon
>>>>> >
>>>>> > On 20/08/2020 10:16, Michael Bisig wrote:
>>>>> >> Hi Simon
>>>>> >>
>>>>> >> As far as I know, RocksDB only uses "leveled" space on the NVME
>>>>> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB.
>>>>> Every
>>>>> >> DB space above such a limit will automatically end up on slow
>>>>> devices.
>>>>> >> In your setup where you have 123GB per OSD that means you only use
>>>>> >> 30GB of fast device. The DB which spills over this limit will be
>>>>> >> offloaded to the HDD and accordingly, it slows down requests and
>>>>> >> compactions.
>>>>> >>
>>>>> >> You can proof what your OSD currently consumes with:
>>>>> >>    ceph daemon osd.X perf dump
>>>>> >>
>>>>> >> Informative values are `db_total_bytes`, `db_used_bytes` and
>>>>> >> `slow_used_bytes`. This changes regularly because of the ongoing
>>>>> >> compactions but Prometheus mgr module exports these values such
>>>>> that
>>>>> >> you can track it.
>>>>> >>
>>>>> >> Small files generally leads to bigger RocksDB, especially when you
>>>>> >> use EC, but this depends on the actual amount and file sizes.
>>>>> >>
>>>>> >> I hope this helps.
>>>>> >> Regards,
>>>>> >> Michael
>>>>> >>
>>>>> >> On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek@xxxxxxxxxxxxx>
>>>>> wrote:
>>>>> >>
>>>>> >>      Hi
>>>>> >>
>>>>> >>      Recently our ceph cluster (nautilus) is experiencing bluefs
>>>>> >> spillovers,
>>>>> >>      just 2 osd's and I disabled the warning for these osds.
>>>>> >>      (ceph config set osd.125 bluestore_warn_on_bluefs_spillover
>>>>> false)
>>>>> >>
>>>>> >>      I'm wondering what causes this and how this can be prevented.
>>>>> >>
>>>>> >>      As I understand it the rocksdb for the OSD needs to store more
>>>>> >> than fits
>>>>> >>      on the NVME logical volume (123G for 12T OSD). A way to fix it
>>>>> >> could be
>>>>> >>      to increase the logical volume on the nvme (if there was space
>>>>> >> on the
>>>>> >>      nvme, which there isn't at the moment).
>>>>> >>
>>>>> >>      This is the current size of the cluster and how much is free:
>>>>> >>
>>>>> >>      [root@cephmon1 ~]# ceph df
>>>>> >>      RAW STORAGE:
>>>>> >>           CLASS     SIZE        AVAIL       USED        RAW
>>>>> USED
>>>>> >> %RAW USED
>>>>> >>           hdd       1.8 PiB     842 TiB     974 TiB      974
>>>>> >> TiB         53.63
>>>>> >>           TOTAL     1.8 PiB     842 TiB     974 TiB      974
>>>>> >> TiB         53.63
>>>>> >>
>>>>> >>      POOLS:
>>>>> >>           POOL                    ID     STORED      OBJECTS USED
>>>>> >>      %USED     MAX AVAIL
>>>>> >>           cephfs_data              1     572 MiB     121.26M 2.4 GiB
>>>>> >>          0       167 TiB
>>>>> >>           cephfs_metadata          2      56 GiB 5.15M      57 GiB
>>>>> >>          0       167 TiB
>>>>> >>           cephfs_data_3copy        8     201 GiB      51.68k 602 GiB
>>>>> >>      0.09       222 TiB
>>>>> >>           cephfs_data_ec83        13     643 TiB     279.75M 953 TiB
>>>>> >>      58.86       485 TiB
>>>>> >>           rbd                     14      21 GiB 5.66k      64 GiB
>>>>> >>          0       222 TiB
>>>>> >>           .rgw.root               15     1.2 KiB 4       1 MiB
>>>>> >>          0       167 TiB
>>>>> >>           default.rgw.control     16         0 B 8         0 B
>>>>> >>          0       167 TiB
>>>>> >>           default.rgw.meta        17       765 B 4       1 MiB
>>>>> >>          0       167 TiB
>>>>> >>           default.rgw.log         18         0 B 207         0 B
>>>>> >>          0       167 TiB
>>>>> >>           cephfs_data_ec57        20     433 MiB         230 1.2 GiB
>>>>> >>          0       278 TiB
>>>>> >>
>>>>> >>      The amount used can still grow a bit before we need to add
>>>>> >> nodes, but
>>>>> >>      apparently we are running into the limits of our rocskdb
>>>>> >> partitions.
>>>>> >>
>>>>> >>      Did we choose a parameter (e.g. minimal object size) too
>>>>> small,
>>>>> >> so we
>>>>> >>      have too much objects on these spillover OSDs? Or is it that
>>>>> too
>>>>> >> many
>>>>> >>      small files are stored on the cephfs filesystems?
>>>>> >>
>>>>> >>      When we expand the cluster, we can choose larger nvme devices
>>>>> to
>>>>> >> allow
>>>>> >>      larger rocksdb partitions, but is that the right way to deal
>>>>> >> with this,
>>>>> >>      or should we adjust some parameters on the cluster that will
>>>>> >> reduce the
>>>>> >>      rocksdb size?
>>>>> >>
>>>>> >>      Cheers
>>>>> >>
>>>>> >>      /Simon
>>>>> >>      _______________________________________________
>>>>> >>      ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> >>      To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>> >>
>>>>> > _______________________________________________
>>>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx