Hi Anthony, Did the RocksDB sharding end up improving the overspilling situation related to the level thresholds? I had only anticipated that it would reduce the impact of compaction. We reshared our OSD's RocksDBs a long time ago (after upgrading to Pacific IIRC) and I think we could still observe overspilling at the layer levels sometimes, if I'm not mistaken. Cheers, Frédéric. PS: It seems that the document you referred to is not accessible from the Internet. ----- Le 12 Nov 24, à 15:11, Anthony D'Atri <anthony.datri@xxxxxxxxx> a écrit : > RocksDB column sharding came a while ago. It should be enabled on your OSDs, > provided they weren’t built on a much older release. If they were you can > update them. > [ > https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf > ] [ > https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf > | rocksdb_in_ceph ] > [ > https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf > | PDF Document · 512 KB ] > [ > https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database > ] [ > https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database > | IBM Storage Ceph – Administration, Resharding RocksDB database reshard > RocksDB database ] > [ > https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database > | ibm.com ] > [ > https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database > ] >> On Nov 12, 2024, at 8:02 AM, Alexander Patrakov <patrakov@xxxxxxxxx> wrote: >> Yes, that is correct. >> On Tue, Nov 12, 2024 at 8:51 PM Frédéric Nass >> <frederic.nass@xxxxxxxxxxxxxxxx> wrote: >>> Hello Alexander, >>> Thank you for clarifying this point. The documentation was not very clear about >>> the 'improvements'. >>> Does that mean that in the latest releases overspilling no longer occurs between >>> the two thresholds of 30GB and 300GB? Meaning block.db can be 80GB in size >>> without overspilling, for example? >>> Cheers, >>> Frédéric. >>> ----- Le 12 Nov 24, à 13:32, Alexander Patrakov patrakov@xxxxxxxxx a écrit : >>>> Hello Frédéric, >>>> The advice regarding 30/300 GB DB sizes is no longer valid. Since Ceph >>>> 15.2.8, due to the new default (bluestore_volume_selection_policy = >>>> use_some_extra), it no longer wastes the extra capacity of the DB >>>> device. >>>> On Tue, Nov 12, 2024 at 5:52 PM Frédéric Nass >>>> <frederic.nass@xxxxxxxxxxxxxxxx> wrote: >>>>> ----- Le 12 Nov 24, à 8:51, Roland Giesler roland@xxxxxxxxxxxxxx a écrit : >>>>>> On 2024/11/12 04:54, Alwin Antreich wrote: >>>>>>> Hi Roland, >>>>>>> On Mon, Nov 11, 2024, 20:16 Roland Giesler <roland@xxxxxxxxxxxxxx> wrote: >>>>>>>> I have ceph 17.2.6 on a proxmox cluster and want to replace some ssd's >>>>>>>> who are end of life. I have some spinners who have their journals on >>>>>>>> SSD. Each spinner has a 50GB SSD LVM partition and I want to move those >>>>>>>> each to new corresponding partitions. >>>>>>>> The new 4TB SSD's I have split into volumes with: >>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB1 -L 47.69g NodeA-nvme0 >>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB2 -L 47.69g NodeA-nvme0 >>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB3 -L 47.69g NodeA-nvme0 >>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB4 -L 47.69g NodeA-nvme0 >>>>>>>> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme1 >>>>>>>> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme0 >>>>>>> I caution the mix of DB/WAL partitions with other applications. The >>>>>>> performance profile may not be suited for shared use. And depending on the >>>>>>> use case the ~48GB might not be big enough to hinder DB spillover. See the >>>>>>> current size when querying the OSD. >>>>>> I see relatively small RocksDB and not WAL? >>>>>> ceph daemon osd.4 perf dump >>>>>> <snip> >>>>>> "bluefs": { >>>>>> "db_total_bytes": 45025845248, >>>>>> "db_used_bytes": 2131755008, >>>>>> "wal_total_bytes": 0, >>>>>> "wal_used_bytes": 0, >>>>>> </snip> >>>>>> I have been led to understand that 4% is die high end and only on very busy >>>>>> systems is that reached, if ever? >>>>> Hi Roland, >>>>> This is generally true but it depends on what your cluster is used for. >>>>> If your cluster is used for block (RBD) storage then 1%-2% should be enough. If >>>>> your cluster is used for file (cephfs) and S3 (RGW) storage then you'd rather >>>>> stay on the safe size and respect the 4% recommendation as these workloads make >>>>> heavy use of block.db to store metadata. >>>>> Now percentage is one thing, level size is another. To avoid overspilling when >>>>> block.db size approaches 30GB you'd better choose a block.db size of 300GB+ >>>>> whatever the percentage of block size this is, if you don't want to play with >>>>> rocksdb level size and multiplier, which you probably don't. >>>>> Regards, >>>>> Frédéric. >>>>> [1] >>>>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing >>>>> [2] >>>>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-sizing-considerations >>>>> [3] https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide >>>>>>>> What am I missing to get these changes to be permanent? >>>>>>> Likely just an issue with the order of execution. But there is an easier >>>>>>> way to do the move. See: >>>>>>> https://docs.ceph.com/en/quincy/ceph-volume/lvm/migrate/ >>>>>> Ah, excellent! I didn't find that in my searches. Will try that now. >>>>>> regards >>>>>> Roland >>>>>>> Cheers, >>>>>>> Alwin >>>>>>> -- >>>>>>>> Alwin Antreich >>>>>>> Head of Training and Proxmox Services >>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>>>>> CEO: Martin Verges, Andy Muthmann - VAT-ID: DE310638492 >>>>>>> Com. register: Amtsgericht Munich HRB 231263 >>>>>>> Web: https://croit.io/ >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> -- >>>> Alexander Patrakov >> -- >> Alexander Patrakov >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx