Re: BlueFS spillover detected, why, what?

Igor Fedotov <ifedotov@xxxxxxx> · Fri, 21 Aug 2020 15:49:05 +0300

Can't say nothing about "write_buffer_size" tuning.. Never tried that.

But I presume that these are *"max_bytes_for_level_base*" and 
*"**max_bytes_for_level_multiplier*" params which rather should be tuned 
to modify RocksDB level granularity.

But I have no ideas how safe this is in a production environment.

Thanks,

Igor

On 8/21/2020 12:51 AM, Seena Fallah wrote:
Ok thanks. And also as you mentioned in the doc you shared from 
cloudferro, It's not good to change `write_buffer_size` for bluestore 
rocksdb to fit our db?

On Fri, Aug 21, 2020 at 1:46 AM Igor Fedotov <ifedotov@xxxxxxx 
<mailto:ifedotov@xxxxxxx>> wrote:

    Honestly I don't have any perfect solution for now.

    If this is urgent you probably better to proceed with enabling the
    new DB space management feature.

    But please do that eventually, modify 1-2 OSDs at the first stage
    and test them for some period (may be a week or two).

    Thanks,

    Igor

    On 8/20/2020 5:36 PM, Seena Fallah wrote:
    So what do you suggest for a short term solution? (I think you
    won't backport it to nautilus at least about 6 month)

    Changing db size is too expensive because I should buy new NVME
    devices with double size and also redeploy all my OSDs.
    Manual compaction will still have an impact on performance and
    doing it for a month doesn't look very good!

    On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov <ifedotov@xxxxxxx
    <mailto:ifedotov@xxxxxxx>> wrote:

        Correct.

        On 8/20/2020 5:15 PM, Seena Fallah wrote:
        So you won't backport it to nautilus until it gets
        default to master for a while?

        On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov
        <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> wrote:

            From technical/developer's point of view I don't see any
            issues with tuning this option. But since now I wouldn't
            recommend to enable it in production as it partially
            bypassed our regular development cycle. Being enabled in
            master for a while by default allows more develpers to
            use/try the feature before release. This can be
            considered as an additional implicit QA process. But as
            we just discovered this hasn't happened.

            Hence you can definitely try it but this exposes your
            cluster(s) to some risk as for any new (and incompletely
            tested) feature....

            Thanks,

            Igor

            On 8/20/2020 4:06 PM, Seena Fallah wrote:
            Greate, thanks.

            Is it safe to change it manually in ceph.conf
            until next nautilus release or should I wait for the
            next nautilus release for this change? I mean does qa
            run on this value for this config that we could trust
            and change it or should we wait until the next nautilus
            release that qa ran on this value?

            On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov
            <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> wrote:

                Hi Seena,

                this parameter isn't intended to be adjusted in
                production environments - it's supposed that
                default behavior covers all regular customers' needs.

                The issue though is that default setting is
                invalid. It should be 'use_some_extra'. Gonna fix
                that shortly...

                Thanks,

                Igor

                On 8/20/2020 1:44 PM, Seena Fallah wrote:
                Hi Igor.

                Could you please tell why this config is in
                LEVEL_DEV
                (https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
                As it is documented in Ceph we can't use LEVEL_DEV
                in production environments!

                Thanks

                On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
                <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> wrote:

                    Hi Simon,

                    starting Nautlus v14.2.10 Bluestore is able to
                    use 'wasted' space at DB
                    volume.

                    see this PR:
                    https://github.com/ceph/ceph/pull/29687

                    Nice overview on the overall BlueFS/RocksDB
                    design can be find here:

                    https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

                    Which also includes some overview (as well as
                    additional concerns) for
                    changes brought by the above-mentioned PR.

                    Thanks,

                    Igor

                    On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
                    > Hi Michael,
                    >
                    > thanks for the explanation! So if I
                    understand correctly, we waste 93
                    > GB per OSD on unused NVME space, because
                    only 30GB is actually used...?
                    >
                    > And to improve the space for rocksdb, we
                    need to plan for 300GB per
                    > rocksdb partition in order to benefit from
                    this advantage....
                    >
                    > Reducing the number of small files is
                    something we always ask of our
                    > users, but reality is what it is ;-)
                    >
                    > I'll have to look into how I can get an
                    informative view on these
                    > metrics... It's pretty overwhelming the
                    amount of information coming
                    > out of the ceph cluster, even when you look
                    only superficially...
                    >
                    > Cheers,
                    >
                    > /Simon
                    >
                    > On 20/08/2020 10:16, Michael Bisig wrote:
                    >> Hi Simon
                    >>
                    >> As far as I know, RocksDB only uses
                    "leveled" space on the NVME
                    >> partition. The values are set to be 300MB,
                    3GB, 30GB and 300GB. Every
                    >> DB space above such a limit will
                    automatically end up on slow devices.
                    >> In your setup where you have 123GB per OSD
                    that means you only use
                    >> 30GB of fast device. The DB which spills
                    over this limit will be
                    >> offloaded to the HDD and accordingly, it
                    slows down requests and
                    >> compactions.
                    >>
                    >> You can proof what your OSD currently
                    consumes with:
                    >>    ceph daemon osd.X perf dump
                    >>
                    >> Informative values are `db_total_bytes`,
                    `db_used_bytes` and
                    >> `slow_used_bytes`. This changes regularly
                    because of the ongoing
                    >> compactions but Prometheus mgr module
                    exports these values such that
                    >> you can track it.
                    >>
                    >> Small files generally leads to bigger
                    RocksDB, especially when you
                    >> use EC, but this depends on the actual
                    amount and file sizes.
                    >>
                    >> I hope this helps.
                    >> Regards,
                    >> Michael
                    >>
                    >> On 20.08.20, 09:10, "Simon Oosthoek"
                    <s.oosthoek@xxxxxxxxxxxxx
                    <mailto:s.oosthoek@xxxxxxxxxxxxx>> wrote:
                    >>
                    >>      Hi
                    >>
                    >>      Recently our ceph cluster (nautilus)
                    is experiencing bluefs
                    >> spillovers,
                    >>      just 2 osd's and I disabled the
                    warning for these osds.
                    >>      (ceph config set osd.125
                    bluestore_warn_on_bluefs_spillover false)
                    >>
                    >>      I'm wondering what causes this and how
                    this can be prevented.
                    >>
                    >>      As I understand it the rocksdb for the
                    OSD needs to store more
                    >> than fits
                    >>      on the NVME logical volume (123G for
                    12T OSD). A way to fix it
                    >> could be
                    >>      to increase the logical volume on the
                    nvme (if there was space
                    >> on the
                    >>      nvme, which there isn't at the moment).
                    >>
                    >>      This is the current size of the
                    cluster and how much is free:
                    >>
                    >>      [root@cephmon1 ~]# ceph df
                    >>      RAW STORAGE:
                    >>           CLASS SIZE        AVAIL
                    USED        RAW USED
                    >> %RAW USED
                    >>           hdd 1.8 PiB     842 TiB     974
                    TiB      974
                    >> TiB         53.63
                    >>           TOTAL 1.8 PiB     842 TiB     974
                    TiB      974
                    >> TiB         53.63
                    >>
                    >>      POOLS:
                    >> POOL                    ID STORED     
                    OBJECTS USED
                    >>      %USED     MAX AVAIL
                    >> cephfs_data              1 572 MiB    
                    121.26M 2.4 GiB
                    >>          0       167 TiB
                    >> cephfs_metadata 2      56 GiB 5.15M      57 GiB
                    >>          0       167 TiB
                    >> cephfs_data_3copy        8 201 GiB     
                    51.68k 602 GiB
                    >>      0.09       222 TiB
                    >> cephfs_data_ec83        13 643 TiB    
                    279.75M 953 TiB
                    >>      58.86       485 TiB
                    >> rbd 14      21 GiB 5.66k      64 GiB
                    >>          0       222 TiB
                    >> .rgw.root               15 1.2 KiB 4      
                    1 MiB
                    >>          0       167 TiB
                    >> default.rgw.control 16         0 B
                    8         0 B
                    >>          0       167 TiB
                    >> default.rgw.meta 17       765 B 4       1 MiB
                    >>          0       167 TiB
                    >> default.rgw.log 18         0 B 207         0 B
                    >>          0       167 TiB
                    >> cephfs_data_ec57        20 433 MiB        
                    230 1.2 GiB
                    >>          0       278 TiB
                    >>
                    >>      The amount used can still grow a bit
                    before we need to add
                    >> nodes, but
                    >>      apparently we are running into the
                    limits of our rocskdb
                    >> partitions.
                    >>
                    >>      Did we choose a parameter (e.g.
                    minimal object size) too small,
                    >> so we
                    >>      have too much objects on these
                    spillover OSDs? Or is it that too
                    >> many
                    >>      small files are stored on the cephfs
                    filesystems?
                    >>
                    >>      When we expand the cluster, we can
                    choose larger nvme devices to
                    >> allow
                    >>      larger rocksdb partitions, but is that
                    the right way to deal
                    >> with this,
                    >>      or should we adjust some parameters on
                    the cluster that will
                    >> reduce the
                    >>      rocksdb size?
                    >>
                    >>      Cheers
                    >>
                    >>      /Simon
                    >> _______________________________________________
                    >>      ceph-users mailing list --
                    ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
                    >>      To unsubscribe send an email to
                    ceph-users-leave@xxxxxxx
                    <mailto:ceph-users-leave@xxxxxxx>
                    >>
                    > _______________________________________________
                    > ceph-users mailing list --
                    ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
                    > To unsubscribe send an email to
                    ceph-users-leave@xxxxxxx
                    <mailto:ceph-users-leave@xxxxxxx>
                    _______________________________________________
                    ceph-users mailing list -- ceph-users@xxxxxxx
                    <mailto:ceph-users@xxxxxxx>
                    To unsubscribe send an email to
                    ceph-users-leave@xxxxxxx
                    <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx