Re: BlueFS spillover detected, why, what?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Honestly I don't have any perfect solution for now.

If this is urgent you probably better to proceed with enabling the new DB space management feature.

But please do that eventually, modify 1-2 OSDs at the first stage and test them for some period (may be a week or two).


Thanks,

Igor


On 8/20/2020 5:36 PM, Seena Fallah wrote:
So what do you suggest for a short term solution? (I think you won't backport it to nautilus at least about 6 month)

Changing db size is too expensive because I should buy new NVME devices with double size and also redeploy all my OSDs. Manual compaction will still have an impact on performance and doing it for a month doesn't look very good!

On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> wrote:

    Correct.

    On 8/20/2020 5:15 PM, Seena Fallah wrote:
    So you won't backport it to nautilus until it gets default to
    master for a while?

    On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov <ifedotov@xxxxxxx
    <mailto:ifedotov@xxxxxxx>> wrote:

        From technical/developer's point of view I don't see any
        issues with tuning this option. But since now I wouldn't 
        recommend to enable it in production as it partially bypassed
        our regular development cycle. Being enabled in master for a
        while by default allows more develpers to use/try the feature
        before release. This can be considered as an additional
        implicit QA process. But as we just discovered this hasn't
        happened.

        Hence you can definitely try it but this exposes your
        cluster(s) to some risk as for any new (and incompletely
        tested) feature....


        Thanks,

        Igor


        On 8/20/2020 4:06 PM, Seena Fallah wrote:
        Greate, thanks.

        Is it safe to change it manually in ceph.conf until next
        nautilus release or should I wait for the next nautilus
        release for this change? I mean does qa run on this value
        for this config that we could trust and change it or should
        we wait until the next nautilus release that qa ran on this
        value?

        On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov
        <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> wrote:

            Hi Seena,

            this parameter isn't intended to be adjusted in
            production environments - it's supposed that default
            behavior covers all regular customers' needs.

            The issue though is that default setting is invalid. It
            should be 'use_some_extra'. Gonna fix that shortly...


            Thanks,

            Igor




            On 8/20/2020 1:44 PM, Seena Fallah wrote:
            Hi Igor.

            Could you please tell why this config is in LEVEL_DEV
            (https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
            As it is documented in Ceph we can't use LEVEL_DEV in
            production environments!

            Thanks

            On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
            <ifedotov@xxxxxxx <mailto:ifedotov@xxxxxxx>> wrote:

                Hi Simon,


                starting Nautlus v14.2.10 Bluestore is able to use
                'wasted' space at DB
                volume.

                see this PR: https://github.com/ceph/ceph/pull/29687

                Nice overview on the overall BlueFS/RocksDB design
                can be find here:

                https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

                Which also includes some overview (as well as
                additional concerns) for
                changes brought by the above-mentioned PR.


                Thanks,

                Igor


                On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
                > Hi Michael,
                >
                > thanks for the explanation! So if I understand
                correctly, we waste 93
                > GB per OSD on unused NVME space, because only
                30GB is actually used...?
                >
                > And to improve the space for rocksdb, we need to
                plan for 300GB per
                > rocksdb partition in order to benefit from this
                advantage....
                >
                > Reducing the number of small files is something
                we always ask of our
                > users, but reality is what it is ;-)
                >
                > I'll have to look into how I can get an
                informative view on these
                > metrics... It's pretty overwhelming the amount of
                information coming
                > out of the ceph cluster, even when you look only
                superficially...
                >
                > Cheers,
                >
                > /Simon
                >
                > On 20/08/2020 10:16, Michael Bisig wrote:
                >> Hi Simon
                >>
                >> As far as I know, RocksDB only uses "leveled"
                space on the NVME
                >> partition. The values are set to be 300MB, 3GB,
                30GB and 300GB. Every
                >> DB space above such a limit will automatically
                end up on slow devices.
                >> In your setup where you have 123GB per OSD that
                means you only use
                >> 30GB of fast device. The DB which spills over
                this limit will be
                >> offloaded to the HDD and accordingly, it slows
                down requests and
                >> compactions.
                >>
                >> You can proof what your OSD currently consumes with:
                >>    ceph daemon osd.X perf dump
                >>
                >> Informative values are `db_total_bytes`,
                `db_used_bytes` and
                >> `slow_used_bytes`. This changes regularly
                because of the ongoing
                >> compactions but Prometheus mgr module exports
                these values such that
                >> you can track it.
                >>
                >> Small files generally leads to bigger RocksDB,
                especially when you
                >> use EC, but this depends on the actual amount
                and file sizes.
                >>
                >> I hope this helps.
                >> Regards,
                >> Michael
                >>
                >> On 20.08.20, 09:10, "Simon Oosthoek"
                <s.oosthoek@xxxxxxxxxxxxx
                <mailto:s.oosthoek@xxxxxxxxxxxxx>> wrote:
                >>
                >>      Hi
                >>
                >>      Recently our ceph cluster (nautilus) is
                experiencing bluefs
                >> spillovers,
                >>      just 2 osd's and I disabled the warning for
                these osds.
                >>      (ceph config set osd.125
                bluestore_warn_on_bluefs_spillover false)
                >>
                >>      I'm wondering what causes this and how this
                can be prevented.
                >>
                >>      As I understand it the rocksdb for the OSD
                needs to store more
                >> than fits
                >>      on the NVME logical volume (123G for 12T
                OSD). A way to fix it
                >> could be
                >>      to increase the logical volume on the nvme
                (if there was space
                >> on the
                >>      nvme, which there isn't at the moment).
                >>
                >>      This is the current size of the cluster and
                how much is free:
                >>
                >>      [root@cephmon1 ~]# ceph df
                >>      RAW STORAGE:
                >>           CLASS SIZE        AVAIL       USED RAW
                USED
                >> %RAW USED
                >>           hdd       1.8 PiB     842 TiB     974
                TiB      974
                >> TiB         53.63
                >>           TOTAL     1.8 PiB     842 TiB     974
                TiB      974
                >> TiB         53.63
                >>
                >>      POOLS:
                >> POOL                    ID STORED      OBJECTS USED
                >>      %USED     MAX AVAIL
                >> cephfs_data              1     572 MiB    
                121.26M 2.4 GiB
                >>          0       167 TiB
                >> cephfs_metadata          2      56 GiB
                5.15M      57 GiB
                >>          0       167 TiB
                >> cephfs_data_3copy        8     201 GiB     
                51.68k 602 GiB
                >>      0.09       222 TiB
                >> cephfs_data_ec83        13     643 TiB    
                279.75M 953 TiB
                >>      58.86       485 TiB
                >> rbd                     14      21 GiB
                5.66k      64 GiB
                >>          0       222 TiB
                >> .rgw.root               15     1.2 KiB 4       1 MiB
                >>          0       167 TiB
                >> default.rgw.control     16         0 B 8         0 B
                >>          0       167 TiB
                >> default.rgw.meta        17       765 B 4       1 MiB
                >>          0       167 TiB
                >> default.rgw.log         18         0 B
                207         0 B
                >>          0       167 TiB
                >> cephfs_data_ec57        20     433 MiB        
                230 1.2 GiB
                >>          0       278 TiB
                >>
                >>      The amount used can still grow a bit before
                we need to add
                >> nodes, but
                >>      apparently we are running into the limits
                of our rocskdb
                >> partitions.
                >>
                >>      Did we choose a parameter (e.g. minimal
                object size) too small,
                >> so we
                >>      have too much objects on these spillover
                OSDs? Or is it that too
                >> many
                >>      small files are stored on the cephfs
                filesystems?
                >>
                >>      When we expand the cluster, we can choose
                larger nvme devices to
                >> allow
                >>      larger rocksdb partitions, but is that the
                right way to deal
                >> with this,
                >>      or should we adjust some parameters on the
                cluster that will
                >> reduce the
                >>      rocksdb size?
                >>
                >>      Cheers
                >>
                >>      /Simon
                >> _______________________________________________
                >>      ceph-users mailing list --
                ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
                >>      To unsubscribe send an email to
                ceph-users-leave@xxxxxxx
                <mailto:ceph-users-leave@xxxxxxx>
                >>
                > _______________________________________________
                > ceph-users mailing list -- ceph-users@xxxxxxx
                <mailto:ceph-users@xxxxxxx>
                > To unsubscribe send an email to
                ceph-users-leave@xxxxxxx
                <mailto:ceph-users-leave@xxxxxxx>
                _______________________________________________
                ceph-users mailing list -- ceph-users@xxxxxxx
                <mailto:ceph-users@xxxxxxx>
                To unsubscribe send an email to
                ceph-users-leave@xxxxxxx
                <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux