Hi Igor. Could you please tell why this config is in LEVEL_DEV ( https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)? As it is documented in Ceph we can't use LEVEL_DEV in production environments! Thanks On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: > Hi Simon, > > > starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB > volume. > > see this PR: https://github.com/ceph/ceph/pull/29687 > > Nice overview on the overall BlueFS/RocksDB design can be find here: > > > https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf > > Which also includes some overview (as well as additional concerns) for > changes brought by the above-mentioned PR. > > > Thanks, > > Igor > > > On 8/20/2020 11:39 AM, Simon Oosthoek wrote: > > Hi Michael, > > > > thanks for the explanation! So if I understand correctly, we waste 93 > > GB per OSD on unused NVME space, because only 30GB is actually used...? > > > > And to improve the space for rocksdb, we need to plan for 300GB per > > rocksdb partition in order to benefit from this advantage.... > > > > Reducing the number of small files is something we always ask of our > > users, but reality is what it is ;-) > > > > I'll have to look into how I can get an informative view on these > > metrics... It's pretty overwhelming the amount of information coming > > out of the ceph cluster, even when you look only superficially... > > > > Cheers, > > > > /Simon > > > > On 20/08/2020 10:16, Michael Bisig wrote: > >> Hi Simon > >> > >> As far as I know, RocksDB only uses "leveled" space on the NVME > >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every > >> DB space above such a limit will automatically end up on slow devices. > >> In your setup where you have 123GB per OSD that means you only use > >> 30GB of fast device. The DB which spills over this limit will be > >> offloaded to the HDD and accordingly, it slows down requests and > >> compactions. > >> > >> You can proof what your OSD currently consumes with: > >> ceph daemon osd.X perf dump > >> > >> Informative values are `db_total_bytes`, `db_used_bytes` and > >> `slow_used_bytes`. This changes regularly because of the ongoing > >> compactions but Prometheus mgr module exports these values such that > >> you can track it. > >> > >> Small files generally leads to bigger RocksDB, especially when you > >> use EC, but this depends on the actual amount and file sizes. > >> > >> I hope this helps. > >> Regards, > >> Michael > >> > >> On 20.08.20, 09:10, "Simon Oosthoek" <s.oosthoek@xxxxxxxxxxxxx> wrote: > >> > >> Hi > >> > >> Recently our ceph cluster (nautilus) is experiencing bluefs > >> spillovers, > >> just 2 osd's and I disabled the warning for these osds. > >> (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false) > >> > >> I'm wondering what causes this and how this can be prevented. > >> > >> As I understand it the rocksdb for the OSD needs to store more > >> than fits > >> on the NVME logical volume (123G for 12T OSD). A way to fix it > >> could be > >> to increase the logical volume on the nvme (if there was space > >> on the > >> nvme, which there isn't at the moment). > >> > >> This is the current size of the cluster and how much is free: > >> > >> [root@cephmon1 ~]# ceph df > >> RAW STORAGE: > >> CLASS SIZE AVAIL USED RAW USED > >> %RAW USED > >> hdd 1.8 PiB 842 TiB 974 TiB 974 > >> TiB 53.63 > >> TOTAL 1.8 PiB 842 TiB 974 TiB 974 > >> TiB 53.63 > >> > >> POOLS: > >> POOL ID STORED OBJECTS USED > >> %USED MAX AVAIL > >> cephfs_data 1 572 MiB 121.26M 2.4 GiB > >> 0 167 TiB > >> cephfs_metadata 2 56 GiB 5.15M 57 GiB > >> 0 167 TiB > >> cephfs_data_3copy 8 201 GiB 51.68k 602 GiB > >> 0.09 222 TiB > >> cephfs_data_ec83 13 643 TiB 279.75M 953 TiB > >> 58.86 485 TiB > >> rbd 14 21 GiB 5.66k 64 GiB > >> 0 222 TiB > >> .rgw.root 15 1.2 KiB 4 1 MiB > >> 0 167 TiB > >> default.rgw.control 16 0 B 8 0 B > >> 0 167 TiB > >> default.rgw.meta 17 765 B 4 1 MiB > >> 0 167 TiB > >> default.rgw.log 18 0 B 207 0 B > >> 0 167 TiB > >> cephfs_data_ec57 20 433 MiB 230 1.2 GiB > >> 0 278 TiB > >> > >> The amount used can still grow a bit before we need to add > >> nodes, but > >> apparently we are running into the limits of our rocskdb > >> partitions. > >> > >> Did we choose a parameter (e.g. minimal object size) too small, > >> so we > >> have too much objects on these spillover OSDs? Or is it that too > >> many > >> small files are stored on the cephfs filesystems? > >> > >> When we expand the cluster, we can choose larger nvme devices to > >> allow > >> larger rocksdb partitions, but is that the right way to deal > >> with this, > >> or should we adjust some parameters on the cluster that will > >> reduce the > >> rocksdb size? > >> > >> Cheers > >> > >> /Simon > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx