Yep, we're using RocksDB compression with Pacific since a few month. It helped a lot. Since we're talking overspilling... Despite using bluestore_volume_selection_policy=use_some_extra with resharded RocksDB databases we can still observe many OSDs overspilling from time to time (approximately every month and a half). When this happens: - almost all OSDs overspill one after the other over 2-3 days. They all get detected and compacted thanks to a cron job, then it's completely quiet again for another month and a half, and then it comes back. This phenomenon repeats cyclically. - 'ceph health detail' shows figures similar to thoses reported in [1] that [2] is supposed to have fixed (if I'm not mistaken): === Full health status === [WARN] BLUEFS_SPILLOVER: 8 OSD(s) experiencing BlueFS spillover osd.337 spilled over 12 GiB metadata from 'db' device (12 GiB used of 124 GiB) to slow device osd.352 spilled over 12 GiB metadata from 'db' device (687 MiB used of 124 GiB) to slow device osd.353 spilled over 12 GiB metadata from 'db' device (152 MiB used of 124 GiB) to slow device osd.357 spilled over 12 GiB metadata from 'db' device (960 MiB used of 124 GiB) to slow device osd.359 spilled over 1.9 GiB metadata from 'db' device (12 GiB used of 124 GiB) to slow device Has anyone ever experienced this? Cheers, Frédéric. [1] [ https://tracker.ceph.com/issues/38745 | https://tracker.ceph.com/issues/38745 ] [2] [ https://github.com/ceph/ceph/pull/29687 | https://github.com/ceph/ceph/pull/29687 ] ----- Le 12 Nov 24, à 17:36, Anthony D'Atri <anthony.datri@xxxxxxxxx> a écrit : > Yes, it improves the dynamic where only ~3, 30, 300, etc. GB of DB space can be > used, and thus mitigates spillover. Previously a, say, 29GB DB device/partition > would be like 85% unused. > With recent releases one can also turn on DB compression, which should have a > similar benefit. >> On Nov 12, 2024, at 11:25 AM, Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> >> wrote: >> Hi Anthony, >> Did the RocksDB sharding end up improving the overspilling situation related to >> the level thresholds? I had only anticipated that it would reduce the impact of >> compaction. >> We reshared our OSD's RocksDBs a long time ago (after upgrading to Pacific IIRC) >> and I think we could still observe overspilling at the layer levels sometimes, >> if I'm not mistaken. >> Cheers, >> Frédéric. >> PS: It seems that the document you referred to is not accessible from the >> Internet. >> ----- Le 12 Nov 24, à 15:11, Anthony D'Atri <anthony.datri@xxxxxxxxx> a écrit : >>> RocksDB column sharding came a while ago. It should be enabled on your OSDs, >>> provided they weren’t built on a much older release. If they were you can >>> update them. >>> [ >>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf >>> ] [ >>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf >>> | rocksdb_in_ceph ] >>> [ >>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf >>> | PDF Document · 512 KB ] >>> [ >>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database >>> ] [ >>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database >>> | IBM Storage Ceph – Administration, Resharding RocksDB database reshard >>> RocksDB database ] >>> [ >>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database >>> | ibm.com ] >>> [ >>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database >>> ] >>>> On Nov 12, 2024, at 8:02 AM, Alexander Patrakov <patrakov@xxxxxxxxx> wrote: >>>> Yes, that is correct. >>>> On Tue, Nov 12, 2024 at 8:51 PM Frédéric Nass >>>> <frederic.nass@xxxxxxxxxxxxxxxx> wrote: >>>>> Hello Alexander, >>>>> Thank you for clarifying this point. The documentation was not very clear about >>>>> the 'improvements'. >>>>> Does that mean that in the latest releases overspilling no longer occurs between >>>>> the two thresholds of 30GB and 300GB? Meaning block.db can be 80GB in size >>>>> without overspilling, for example? >>>>> Cheers, >>>>> Frédéric. >>>>> ----- Le 12 Nov 24, à 13:32, Alexander Patrakov patrakov@xxxxxxxxx a écrit : >>>>>> Hello Frédéric, >>>>>> The advice regarding 30/300 GB DB sizes is no longer valid. Since Ceph >>>>>> 15.2.8, due to the new default (bluestore_volume_selection_policy = >>>>>> use_some_extra), it no longer wastes the extra capacity of the DB >>>>>> device. >>>>>> On Tue, Nov 12, 2024 at 5:52 PM Frédéric Nass >>>>>> <frederic.nass@xxxxxxxxxxxxxxxx> wrote: >>>>>>> ----- Le 12 Nov 24, à 8:51, Roland Giesler roland@xxxxxxxxxxxxxx a écrit : >>>>>>>> On 2024/11/12 04:54, Alwin Antreich wrote: >>>>>>>>> Hi Roland, >>>>>>>>> On Mon, Nov 11, 2024, 20:16 Roland Giesler <roland@xxxxxxxxxxxxxx> wrote: >>>>>>>>>> I have ceph 17.2.6 on a proxmox cluster and want to replace some ssd's >>>>>>>>>> who are end of life. I have some spinners who have their journals on >>>>>>>>>> SSD. Each spinner has a 50GB SSD LVM partition and I want to move those >>>>>>>>>> each to new corresponding partitions. >>>>>>>>>> The new 4TB SSD's I have split into volumes with: >>>>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB1 -L 47.69g NodeA-nvme0 >>>>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB2 -L 47.69g NodeA-nvme0 >>>>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB3 -L 47.69g NodeA-nvme0 >>>>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB4 -L 47.69g NodeA-nvme0 >>>>>>>>>> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme1 >>>>>>>>>> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme0 >>>>>>>>> I caution the mix of DB/WAL partitions with other applications. The >>>>>>>>> performance profile may not be suited for shared use. And depending on the >>>>>>>>> use case the ~48GB might not be big enough to hinder DB spillover. See the >>>>>>>>> current size when querying the OSD. >>>>>>>> I see relatively small RocksDB and not WAL? >>>>>>>> ceph daemon osd.4 perf dump >>>>>>>> <snip> >>>>>>>> "bluefs": { >>>>>>>> "db_total_bytes": 45025845248, >>>>>>>> "db_used_bytes": 2131755008, >>>>>>>> "wal_total_bytes": 0, >>>>>>>> "wal_used_bytes": 0, >>>>>>>> </snip> >>>>>>>> I have been led to understand that 4% is die high end and only on very busy >>>>>>>> systems is that reached, if ever? >>>>>>> Hi Roland, >>>>>>> This is generally true but it depends on what your cluster is used for. >>>>>>> If your cluster is used for block (RBD) storage then 1%-2% should be enough. If >>>>>>> your cluster is used for file (cephfs) and S3 (RGW) storage then you'd rather >>>>>>> stay on the safe size and respect the 4% recommendation as these workloads make >>>>>>> heavy use of block.db to store metadata. >>>>>>> Now percentage is one thing, level size is another. To avoid overspilling when >>>>>>> block.db size approaches 30GB you'd better choose a block.db size of 300GB+ >>>>>>> whatever the percentage of block size this is, if you don't want to play with >>>>>>> rocksdb level size and multiplier, which you probably don't. >>>>>>> Regards, >>>>>>> Frédéric. >>>>>>> [1] >>>>>>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing >>>>>>> [2] >>>>>>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-sizing-considerations >>>>>>> [3] https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide >>>>>>>>>> What am I missing to get these changes to be permanent? >>>>>>>>> Likely just an issue with the order of execution. But there is an easier >>>>>>>>> way to do the move. See: >>>>>>>>> https://docs.ceph.com/en/quincy/ceph-volume/lvm/migrate/ >>>>>>>> Ah, excellent! I didn't find that in my searches. Will try that now. >>>>>>>> regards >>>>>>>> Roland >>>>>>>>> Cheers, >>>>>>>>> Alwin >>>>>>>>> -- >>>>>>>>>> Alwin Antreich >>>>>>>>> Head of Training and Proxmox Services >>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>>>>>>> CEO: Martin Verges, Andy Muthmann - VAT-ID: DE310638492 >>>>>>>>> Com. register: Amtsgericht Munich HRB 231263 >>>>>>>>> Web: https://croit.io/ >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>> -- >>>>>> Alexander Patrakov >>>> -- >>>> Alexander Patrakov >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx