So you won't backport it to nautilus until it gets default to master for a while? On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: > From technical/developer's point of view I don't see any issues with > tuning this option. But since now I wouldn't recommend to enable it in > production as it partially bypassed our regular development cycle. Being > enabled in master for a while by default allows more develpers to use/try > the feature before release. This can be considered as an additional > implicit QA process. But as we just discovered this hasn't happened. > > Hence you can definitely try it but this exposes your cluster(s) to some > risk as for any new (and incompletely tested) feature.... > > > Thanks, > > Igor > > > On 8/20/2020 4:06 PM, Seena Fallah wrote: > > Greate, thanks. > > Is it safe to change it manually in ceph.conf until next nautilus release > or should I wait for the next nautilus release for this change? I mean does > qa run on this value for this config that we could trust and change it or > should we wait until the next nautilus release that qa ran on this value? > > On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: > >> Hi Seena, >> >> this parameter isn't intended to be adjusted in production environments - >> it's supposed that default behavior covers all regular customers' needs. >> >> The issue though is that default setting is invalid. It should be >> 'use_some_extra'. Gonna fix that shortly... >> >> >> Thanks, >> >> Igor >> >> >> >> >> On 8/20/2020 1:44 PM, Seena Fallah wrote: >> >> Hi Igor. >> >> Could you please tell why this config is in LEVEL_DEV ( >> https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)? >> As it is documented in Ceph we can't use LEVEL_DEV in production >> environments! >> >> Thanks >> >> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: >> >>> Hi Simon, >>> >>> >>> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB >>> volume. >>> >>> see this PR: https://github.com/ceph/ceph/pull/29687 >>> >>> Nice overview on the overall BlueFS/RocksDB design can be find here: >>> >>> >>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf >>> >>> Which also includes some overview (as well as additional concerns) for >>> changes brought by the above-mentioned PR. >>> >>> >>> Thanks, >>> >>> Igor >>> >>> >>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote: >>> > Hi Michael, >>> > >>> > thanks for the explanation! So if I understand correctly, we waste 93 >>> > GB per OSD on unused NVME space, because only 30GB is actually used...? >>> > >>> > And to improve the space for rocksdb, we need to plan for 300GB per >>> > rocksdb partition in order to benefit from this advantage.... >>> > >>> > Reducing the number of small files is something we always ask of our >>> > users, but reality is what it is ;-) >>> > >>> > I'll have to look into how I can get an informative view on these >>> > metrics... It's pretty overwhelming the amount of information coming >>> > out of the ceph cluster, even when you look only superficially... >>> > >>> > Cheers, >>> > >>> > /Simon >>> > >>> > On 20/08/2020 10:16, Michael Bisig wrote: >>> >> Hi Simon >>> >> >>> >> As far as I know, RocksDB only uses "leveled" space on the NVME >>> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every >>> >> DB space above such a limit will automatically end up on slow devices. >>> >> In your setup where you have 123GB per OSD that means you only use >>> >> 30GB of fast device. The DB which spills over this limit will be >>> >> offloaded to the HDD and accordingly, it slows down requests and >>> >> compactions. >>> >> >>> >> You can proof what your OSD currently consumes with: >>> >> ceph daemon osd.X perf dump >>> >> >>> >> Informative values are `db_total_bytes`, `db_used_bytes` and >>> >> `slow_used_bytes`. This changes regularly because of the ongoing >>> >> compactions but Prometheus mgr module exports these values such that >>> >> you can track it. >>> >> >>> >> Small files generally leads to bigger RocksDB, especially when you >>> >> use EC, but this depends on the actual amount and file sizes. >>> >> >>> >> I hope this helps. >>> >> Regards, >>> >> Michael >>> >> >>> >> On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek@xxxxxxxxxxxxx> >>> wrote: >>> >> >>> >> Hi >>> >> >>> >> Recently our ceph cluster (nautilus) is experiencing bluefs >>> >> spillovers, >>> >> just 2 osd's and I disabled the warning for these osds. >>> >> (ceph config set osd.125 bluestore_warn_on_bluefs_spillover >>> false) >>> >> >>> >> I'm wondering what causes this and how this can be prevented. >>> >> >>> >> As I understand it the rocksdb for the OSD needs to store more >>> >> than fits >>> >> on the NVME logical volume (123G for 12T OSD). A way to fix it >>> >> could be >>> >> to increase the logical volume on the nvme (if there was space >>> >> on the >>> >> nvme, which there isn't at the moment). >>> >> >>> >> This is the current size of the cluster and how much is free: >>> >> >>> >> [root@cephmon1 ~]# ceph df >>> >> RAW STORAGE: >>> >> CLASS SIZE AVAIL USED RAW USED >>> >> %RAW USED >>> >> hdd 1.8 PiB 842 TiB 974 TiB 974 >>> >> TiB 53.63 >>> >> TOTAL 1.8 PiB 842 TiB 974 TiB 974 >>> >> TiB 53.63 >>> >> >>> >> POOLS: >>> >> POOL ID STORED OBJECTS USED >>> >> %USED MAX AVAIL >>> >> cephfs_data 1 572 MiB 121.26M 2.4 GiB >>> >> 0 167 TiB >>> >> cephfs_metadata 2 56 GiB 5.15M 57 GiB >>> >> 0 167 TiB >>> >> cephfs_data_3copy 8 201 GiB 51.68k 602 GiB >>> >> 0.09 222 TiB >>> >> cephfs_data_ec83 13 643 TiB 279.75M 953 TiB >>> >> 58.86 485 TiB >>> >> rbd 14 21 GiB 5.66k 64 GiB >>> >> 0 222 TiB >>> >> .rgw.root 15 1.2 KiB 4 1 MiB >>> >> 0 167 TiB >>> >> default.rgw.control 16 0 B 8 0 B >>> >> 0 167 TiB >>> >> default.rgw.meta 17 765 B 4 1 MiB >>> >> 0 167 TiB >>> >> default.rgw.log 18 0 B 207 0 B >>> >> 0 167 TiB >>> >> cephfs_data_ec57 20 433 MiB 230 1.2 GiB >>> >> 0 278 TiB >>> >> >>> >> The amount used can still grow a bit before we need to add >>> >> nodes, but >>> >> apparently we are running into the limits of our rocskdb >>> >> partitions. >>> >> >>> >> Did we choose a parameter (e.g. minimal object size) too small, >>> >> so we >>> >> have too much objects on these spillover OSDs? Or is it that too >>> >> many >>> >> small files are stored on the cephfs filesystems? >>> >> >>> >> When we expand the cluster, we can choose larger nvme devices to >>> >> allow >>> >> larger rocksdb partitions, but is that the right way to deal >>> >> with this, >>> >> or should we adjust some parameters on the cluster that will >>> >> reduce the >>> >> rocksdb size? >>> >> >>> >> Cheers >>> >> >>> >> /Simon >>> >> _______________________________________________ >>> >> ceph-users mailing list -- ceph-users@xxxxxxx >>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >> >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users@xxxxxxx >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx