Re: BlueFS spillover detected, why, what?

Igor Fedotov <ifedotov@xxxxxxx> · Thu, 20 Aug 2020 17:22:50 +0300

Correct.

On 8/20/2020 5:15 PM, Seena Fallah wrote:
So you won't backport it to nautilus until it gets default to master 
for a while?

On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov <ifedotov@xxxxxxx 
<mailto:ifedotov@xxxxxxx>> wrote:

    From technical/developer's point of view I don't see any issues
    with tuning this option. But since now I wouldn't recommend to
    enable it in production as it partially bypassed our regular
    development cycle. Being enabled in master for a while by default
    allows more develpers to use/try the feature before release. This
    can be considered as an additional implicit QA process. But as we
    just discovered this hasn't happened.

    Hence you can definitely try it but this exposes your cluster(s)
    to some risk as for any new (and incompletely tested) feature....

    Thanks,

    Igor

    On 8/20/2020 4:06 PM, Seena Fallah wrote:
    Greate, thanks.

    Is it safe to change it manually in ceph.conf until next nautilus
    release or should I wait for the next nautilus release for this
    change? I mean does qa run on this value for this config that we
    could trust and change it or should we wait until the next
    nautilus release that qa ran on this value?

    On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov <ifedotov@xxxxxxx
    <mailto:ifedotov@xxxxxxx>> wrote:

        Hi Seena,

        this parameter isn't intended to be adjusted in production
        environments - it's supposed that default behavior covers all
        regular customers' needs.

        The issue though is that default setting is invalid. It
        should be 'use_some_extra'. Gonna fix that shortly...

        Thanks,

        Igor

        On 8/20/2020 1:44 PM, Seena Fallah wrote:
        Hi Igor.

        Could you please tell why this config is in LEVEL_DEV
        (https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
        As it is documented in Ceph we can't use LEVEL_DEV in
        production environments!

        Thanks

        On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
        <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> wrote:

            Hi Simon,

            starting Nautlus v14.2.10 Bluestore is able to use
            'wasted' space at DB
            volume.

            see this PR: https://github.com/ceph/ceph/pull/29687

            Nice overview on the overall BlueFS/RocksDB design can
            be find here:

            https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

            Which also includes some overview (as well as additional
            concerns) for
            changes brought by the above-mentioned PR.

            Thanks,

            Igor

            On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
            > Hi Michael,
            >
            > thanks for the explanation! So if I understand
            correctly, we waste 93
            > GB per OSD on unused NVME space, because only 30GB is
            actually used...?
            >
            > And to improve the space for rocksdb, we need to plan
            for 300GB per
            > rocksdb partition in order to benefit from this
            advantage....
            >
            > Reducing the number of small files is something we
            always ask of our
            > users, but reality is what it is ;-)
            >
            > I'll have to look into how I can get an informative
            view on these
            > metrics... It's pretty overwhelming the amount of
            information coming
            > out of the ceph cluster, even when you look only
            superficially...
            >
            > Cheers,
            >
            > /Simon
            >
            > On 20/08/2020 10:16, Michael Bisig wrote:
            >> Hi Simon
            >>
            >> As far as I know, RocksDB only uses "leveled" space
            on the NVME
            >> partition. The values are set to be 300MB, 3GB, 30GB
            and 300GB. Every
            >> DB space above such a limit will automatically end up
            on slow devices.
            >> In your setup where you have 123GB per OSD that means
            you only use
            >> 30GB of fast device. The DB which spills over this
            limit will be
            >> offloaded to the HDD and accordingly, it slows down
            requests and
            >> compactions.
            >>
            >> You can proof what your OSD currently consumes with:
            >>    ceph daemon osd.X perf dump
            >>
            >> Informative values are `db_total_bytes`,
            `db_used_bytes` and
            >> `slow_used_bytes`. This changes regularly because of
            the ongoing
            >> compactions but Prometheus mgr module exports these
            values such that
            >> you can track it.
            >>
            >> Small files generally leads to bigger RocksDB,
            especially when you
            >> use EC, but this depends on the actual amount and
            file sizes.
            >>
            >> I hope this helps.
            >> Regards,
            >> Michael
            >>
            >> On 20.08.20, 09:10, "Simon Oosthoek"
            <s.oosthoek@xxxxxxxxxxxxx
            <mailto:s.oosthoek@xxxxxxxxxxxxx>> wrote:
            >>
            >>      Hi
            >>
            >>      Recently our ceph cluster (nautilus) is
            experiencing bluefs
            >> spillovers,
            >>      just 2 osd's and I disabled the warning for
            these osds.
            >>      (ceph config set osd.125
            bluestore_warn_on_bluefs_spillover false)
            >>
            >>      I'm wondering what causes this and how this can
            be prevented.
            >>
            >>      As I understand it the rocksdb for the OSD needs
            to store more
            >> than fits
            >>      on the NVME logical volume (123G for 12T OSD). A
            way to fix it
            >> could be
            >>      to increase the logical volume on the nvme (if
            there was space
            >> on the
            >>      nvme, which there isn't at the moment).
            >>
            >>      This is the current size of the cluster and how
            much is free:
            >>
            >>      [root@cephmon1 ~]# ceph df
            >>      RAW STORAGE:
            >>           CLASS     SIZE AVAIL       USED        RAW
            USED
            >> %RAW USED
            >>           hdd       1.8 PiB     842 TiB     974
            TiB      974
            >> TiB         53.63
            >>           TOTAL     1.8 PiB     842 TiB     974
            TiB      974
            >> TiB         53.63
            >>
            >>      POOLS:
            >>           POOL ID     STORED      OBJECTS USED
            >>      %USED     MAX AVAIL
            >>           cephfs_data 1     572 MiB     121.26M 2.4 GiB
            >>          0       167 TiB
            >>           cephfs_metadata 2      56 GiB 5.15M      57 GiB
            >>          0       167 TiB
            >>           cephfs_data_3copy 8     201 GiB      51.68k
            602 GiB
            >>      0.09       222 TiB
            >>           cephfs_data_ec83 13     643 TiB     279.75M
            953 TiB
            >>      58.86       485 TiB
            >>           rbd 14      21 GiB 5.66k      64 GiB
            >>          0       222 TiB
            >>           .rgw.root 15     1.2 KiB 4       1 MiB
            >>          0       167 TiB
            >>           default.rgw.control 16         0 B
            8         0 B
            >>          0       167 TiB
            >>           default.rgw.meta 17       765 B 4       1 MiB
            >>          0       167 TiB
            >>           default.rgw.log 18         0 B 207         0 B
            >>          0       167 TiB
            >>           cephfs_data_ec57 20     433 MiB         230
            1.2 GiB
            >>          0       278 TiB
            >>
            >>      The amount used can still grow a bit before we
            need to add
            >> nodes, but
            >>      apparently we are running into the limits of our
            rocskdb
            >> partitions.
            >>
            >>      Did we choose a parameter (e.g. minimal object
            size) too small,
            >> so we
            >>      have too much objects on these spillover OSDs?
            Or is it that too
            >> many
            >>      small files are stored on the cephfs filesystems?
            >>
            >>      When we expand the cluster, we can choose larger
            nvme devices to
            >> allow
            >>      larger rocksdb partitions, but is that the right
            way to deal
            >> with this,
            >>      or should we adjust some parameters on the
            cluster that will
            >> reduce the
            >>      rocksdb size?
            >>
            >>      Cheers
            >>
            >>      /Simon
            >> _______________________________________________
            >>      ceph-users mailing list -- ceph-users@xxxxxxx
            <mailto:ceph-users@xxxxxxx>
            >>      To unsubscribe send an email to
            ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
            >>
            > _______________________________________________
            > ceph-users mailing list -- ceph-users@xxxxxxx
            <mailto:ceph-users@xxxxxxx>
            > To unsubscribe send an email to
            ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
            _______________________________________________
            ceph-users mailing list -- ceph-users@xxxxxxx
            <mailto:ceph-users@xxxxxxx>
            To unsubscribe send an email to ceph-users-leave@xxxxxxx
            <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx