Ok thanks. And also as you mentioned in the doc you shared from cloudferro, It's not good to change `write_buffer_size` for bluestore rocksdb to fit our db? On Fri, Aug 21, 2020 at 1:46 AM Igor Fedotov <ifedotov@xxxxxxx> wrote: > Honestly I don't have any perfect solution for now. > > If this is urgent you probably better to proceed with enabling the new DB > space management feature. > > But please do that eventually, modify 1-2 OSDs at the first stage and test > them for some period (may be a week or two). > > > Thanks, > > Igor > > > On 8/20/2020 5:36 PM, Seena Fallah wrote: > > So what do you suggest for a short term solution? (I think you won't > backport it to nautilus at least about 6 month) > > Changing db size is too expensive because I should buy new NVME devices > with double size and also redeploy all my OSDs. > Manual compaction will still have an impact on performance and doing it > for a month doesn't look very good! > > On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: > >> Correct. >> On 8/20/2020 5:15 PM, Seena Fallah wrote: >> >> So you won't backport it to nautilus until it gets default to master for >> a while? >> >> On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: >> >>> From technical/developer's point of view I don't see any issues with >>> tuning this option. But since now I wouldn't recommend to enable it in >>> production as it partially bypassed our regular development cycle. Being >>> enabled in master for a while by default allows more develpers to use/try >>> the feature before release. This can be considered as an additional >>> implicit QA process. But as we just discovered this hasn't happened. >>> >>> Hence you can definitely try it but this exposes your cluster(s) to some >>> risk as for any new (and incompletely tested) feature.... >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 8/20/2020 4:06 PM, Seena Fallah wrote: >>> >>> Greate, thanks. >>> >>> Is it safe to change it manually in ceph.conf until next nautilus >>> release or should I wait for the next nautilus release for this change? I >>> mean does qa run on this value for this config that we could trust and >>> change it or should we wait until the next nautilus release that qa ran on >>> this value? >>> >>> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: >>> >>>> Hi Seena, >>>> >>>> this parameter isn't intended to be adjusted in production environments >>>> - it's supposed that default behavior covers all regular customers' needs. >>>> >>>> The issue though is that default setting is invalid. It should be >>>> 'use_some_extra'. Gonna fix that shortly... >>>> >>>> >>>> Thanks, >>>> >>>> Igor >>>> >>>> >>>> >>>> >>>> On 8/20/2020 1:44 PM, Seena Fallah wrote: >>>> >>>> Hi Igor. >>>> >>>> Could you please tell why this config is in LEVEL_DEV ( >>>> https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)? >>>> As it is documented in Ceph we can't use LEVEL_DEV in production >>>> environments! >>>> >>>> Thanks >>>> >>>> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: >>>> >>>>> Hi Simon, >>>>> >>>>> >>>>> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at >>>>> DB >>>>> volume. >>>>> >>>>> see this PR: https://github.com/ceph/ceph/pull/29687 >>>>> >>>>> Nice overview on the overall BlueFS/RocksDB design can be find here: >>>>> >>>>> >>>>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf >>>>> >>>>> Which also includes some overview (as well as additional concerns) for >>>>> changes brought by the above-mentioned PR. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Igor >>>>> >>>>> >>>>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote: >>>>> > Hi Michael, >>>>> > >>>>> > thanks for the explanation! So if I understand correctly, we waste >>>>> 93 >>>>> > GB per OSD on unused NVME space, because only 30GB is actually >>>>> used...? >>>>> > >>>>> > And to improve the space for rocksdb, we need to plan for 300GB per >>>>> > rocksdb partition in order to benefit from this advantage.... >>>>> > >>>>> > Reducing the number of small files is something we always ask of our >>>>> > users, but reality is what it is ;-) >>>>> > >>>>> > I'll have to look into how I can get an informative view on these >>>>> > metrics... It's pretty overwhelming the amount of information coming >>>>> > out of the ceph cluster, even when you look only superficially... >>>>> > >>>>> > Cheers, >>>>> > >>>>> > /Simon >>>>> > >>>>> > On 20/08/2020 10:16, Michael Bisig wrote: >>>>> >> Hi Simon >>>>> >> >>>>> >> As far as I know, RocksDB only uses "leveled" space on the NVME >>>>> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. >>>>> Every >>>>> >> DB space above such a limit will automatically end up on slow >>>>> devices. >>>>> >> In your setup where you have 123GB per OSD that means you only use >>>>> >> 30GB of fast device. The DB which spills over this limit will be >>>>> >> offloaded to the HDD and accordingly, it slows down requests and >>>>> >> compactions. >>>>> >> >>>>> >> You can proof what your OSD currently consumes with: >>>>> >> ceph daemon osd.X perf dump >>>>> >> >>>>> >> Informative values are `db_total_bytes`, `db_used_bytes` and >>>>> >> `slow_used_bytes`. This changes regularly because of the ongoing >>>>> >> compactions but Prometheus mgr module exports these values such >>>>> that >>>>> >> you can track it. >>>>> >> >>>>> >> Small files generally leads to bigger RocksDB, especially when you >>>>> >> use EC, but this depends on the actual amount and file sizes. >>>>> >> >>>>> >> I hope this helps. >>>>> >> Regards, >>>>> >> Michael >>>>> >> >>>>> >> On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek@xxxxxxxxxxxxx> >>>>> wrote: >>>>> >> >>>>> >> Hi >>>>> >> >>>>> >> Recently our ceph cluster (nautilus) is experiencing bluefs >>>>> >> spillovers, >>>>> >> just 2 osd's and I disabled the warning for these osds. >>>>> >> (ceph config set osd.125 bluestore_warn_on_bluefs_spillover >>>>> false) >>>>> >> >>>>> >> I'm wondering what causes this and how this can be prevented. >>>>> >> >>>>> >> As I understand it the rocksdb for the OSD needs to store more >>>>> >> than fits >>>>> >> on the NVME logical volume (123G for 12T OSD). A way to fix it >>>>> >> could be >>>>> >> to increase the logical volume on the nvme (if there was space >>>>> >> on the >>>>> >> nvme, which there isn't at the moment). >>>>> >> >>>>> >> This is the current size of the cluster and how much is free: >>>>> >> >>>>> >> [root@cephmon1 ~]# ceph df >>>>> >> RAW STORAGE: >>>>> >> CLASS SIZE AVAIL USED RAW >>>>> USED >>>>> >> %RAW USED >>>>> >> hdd 1.8 PiB 842 TiB 974 TiB 974 >>>>> >> TiB 53.63 >>>>> >> TOTAL 1.8 PiB 842 TiB 974 TiB 974 >>>>> >> TiB 53.63 >>>>> >> >>>>> >> POOLS: >>>>> >> POOL ID STORED OBJECTS USED >>>>> >> %USED MAX AVAIL >>>>> >> cephfs_data 1 572 MiB 121.26M 2.4 GiB >>>>> >> 0 167 TiB >>>>> >> cephfs_metadata 2 56 GiB 5.15M 57 GiB >>>>> >> 0 167 TiB >>>>> >> cephfs_data_3copy 8 201 GiB 51.68k 602 GiB >>>>> >> 0.09 222 TiB >>>>> >> cephfs_data_ec83 13 643 TiB 279.75M 953 TiB >>>>> >> 58.86 485 TiB >>>>> >> rbd 14 21 GiB 5.66k 64 GiB >>>>> >> 0 222 TiB >>>>> >> .rgw.root 15 1.2 KiB 4 1 MiB >>>>> >> 0 167 TiB >>>>> >> default.rgw.control 16 0 B 8 0 B >>>>> >> 0 167 TiB >>>>> >> default.rgw.meta 17 765 B 4 1 MiB >>>>> >> 0 167 TiB >>>>> >> default.rgw.log 18 0 B 207 0 B >>>>> >> 0 167 TiB >>>>> >> cephfs_data_ec57 20 433 MiB 230 1.2 GiB >>>>> >> 0 278 TiB >>>>> >> >>>>> >> The amount used can still grow a bit before we need to add >>>>> >> nodes, but >>>>> >> apparently we are running into the limits of our rocskdb >>>>> >> partitions. >>>>> >> >>>>> >> Did we choose a parameter (e.g. minimal object size) too >>>>> small, >>>>> >> so we >>>>> >> have too much objects on these spillover OSDs? Or is it that >>>>> too >>>>> >> many >>>>> >> small files are stored on the cephfs filesystems? >>>>> >> >>>>> >> When we expand the cluster, we can choose larger nvme devices >>>>> to >>>>> >> allow >>>>> >> larger rocksdb partitions, but is that the right way to deal >>>>> >> with this, >>>>> >> or should we adjust some parameters on the cluster that will >>>>> >> reduce the >>>>> >> rocksdb size? >>>>> >> >>>>> >> Cheers >>>>> >> >>>>> >> /Simon >>>>> >> _______________________________________________ >>>>> >> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> >> >>>>> > _______________________________________________ >>>>> > ceph-users mailing list -- ceph-users@xxxxxxx >>>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> >>>> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx