Re: Move block.db to new ssd

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Tue, 12 Nov 2024 11:36:58 -0500

Yes, it improves the dynamic where only ~3, 30, 300, etc. GB of DB space can be used, and thus mitigates spillover.  Previously a, say, 29GB DB device/partition would be like 85% unused.

With recent releases one can also turn on DB compression, which should have a similar benefit.

> On Nov 12, 2024, at 11:25 AM, Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> wrote:
> 
> Hi Anthony,
> 
> Did the RocksDB sharding end up improving the overspilling situation related to the level thresholds? I had only anticipated that it would reduce the impact of compaction.
> 
> We reshared our OSD's RocksDBs a long time ago (after upgrading to Pacific IIRC) and I think we could still observe overspilling at the layer levels sometimes, if I'm not mistaken.
> 
> Cheers,
> Frédéric.
> 
> PS: It seems that the document you referred to is not accessible from the Internet.
> 
> ----- Le 12 Nov 24, à 15:11, Anthony D'Atri <anthony.datri@xxxxxxxxx> a écrit :
> RocksDB column sharding came a while ago.  It should be enabled on your OSDs, provided they weren’t built on a much older release.  If they were you can update them.  
> 

> rocksdb_in_ceph
> PDF Document · 512 KB <https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf>
> 
> 
> IBM Storage Ceph – Administration, Resharding RocksDB database reshard RocksDB database
> ibm.com <https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database>
>  <https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database>
> 
> 
> 
> 
> On Nov 12, 2024, at 8:02 AM, Alexander Patrakov <patrakov@xxxxxxxxx> wrote:
> 
> Yes, that is correct.
> 
> On Tue, Nov 12, 2024 at 8:51 PM Frédéric Nass
> <frederic.nass@xxxxxxxxxxxxxxxx> wrote:
> 
> Hello Alexander,
> 
> Thank you for clarifying this point. The documentation was not very clear about the 'improvements'.
> 
> Does that mean that in the latest releases overspilling no longer occurs between the two thresholds of 30GB and 300GB? Meaning block.db can be 80GB in size without overspilling, for example?
> 
> Cheers,
> Frédéric.
> 
> ----- Le 12 Nov 24, à 13:32, Alexander Patrakov patrakov@xxxxxxxxx a écrit :
> 
> Hello Frédéric,
> 
> The advice regarding 30/300 GB DB sizes is no longer valid. Since Ceph
> 15.2.8, due to the new default (bluestore_volume_selection_policy =
> use_some_extra), it no longer wastes the extra capacity of the DB
> device.
> 
> On Tue, Nov 12, 2024 at 5:52 PM Frédéric Nass
> <frederic.nass@xxxxxxxxxxxxxxxx> wrote:
> 
> 
> 
> ----- Le 12 Nov 24, à 8:51, Roland Giesler roland@xxxxxxxxxxxxxx a écrit :
> 
> On 2024/11/12 04:54, Alwin Antreich wrote:
> Hi Roland,
> 
> On Mon, Nov 11, 2024, 20:16 Roland Giesler <roland@xxxxxxxxxxxxxx> wrote:
> 
> I have ceph 17.2.6 on a proxmox cluster and want to replace some ssd's
> who are end of life.  I have some spinners who have their journals on
> SSD.  Each spinner has a 50GB SSD LVM partition and I want to move those
> each to new corresponding partitions.
> 
> The new 4TB SSD's I have split into volumes with:
> 
> # lvcreate -n NodeA-nvme-LV-RocksDB1 -L 47.69g NodeA-nvme0
> # lvcreate -n NodeA-nvme-LV-RocksDB2 -L 47.69g NodeA-nvme0
> # lvcreate -n NodeA-nvme-LV-RocksDB3 -L 47.69g NodeA-nvme0
> # lvcreate -n NodeA-nvme-LV-RocksDB4 -L 47.69g NodeA-nvme0
> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme1
> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme0
> 
> I caution the mix of DB/WAL partitions with other applications. The
> performance profile may not be suited for shared use. And depending on the
> use case the ~48GB might not be big enough to hinder DB spillover. See the
> current size when querying the OSD.
> 
> I see relatively small RocksDB and not WAL?
> 
> ceph daemon osd.4 perf dump
> <snip>
>    "bluefs": {
>        "db_total_bytes": 45025845248,
>        "db_used_bytes": 2131755008,
>        "wal_total_bytes": 0,
>        "wal_used_bytes": 0,
> </snip>
> 
> I have been led to understand that 4% is die high end and only on very busy
> systems is that reached, if ever?
> 
> Hi Roland,
> 
> This is generally true but it depends on what your cluster is used for.
> 
> If your cluster is used for block (RBD) storage then 1%-2% should be enough. If
> your cluster is used for file (cephfs) and S3 (RGW) storage then you'd rather
> stay on the safe size and respect the 4% recommendation as these workloads make
> heavy use of block.db to store metadata.
> 
> Now percentage is one thing, level size is another. To avoid overspilling when
> block.db size approaches 30GB you'd better choose a block.db size of 300GB+
> whatever the percentage of block size this is, if you don't want to play with
> rocksdb level size and multiplier, which you probably don't.
> 
> Regards,
> Frédéric.
> 
> [1]
> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
> [2]
> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-sizing-considerations
> [3] https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
> 
> 
> What am I missing to get these changes to be permanent?
> 
> Likely just an issue with the order of execution. But there is an easier
> way to do the move. See:
> https://docs.ceph.com/en/quincy/ceph-volume/lvm/migrate/
> 
> Ah, excellent!  I didn't find that in my searches.  Will try that now.
> 
> regards
> 
> Roland
> 
> 
> 
> Cheers,
> Alwin
> 
> --
> 
> Alwin Antreich
> Head of Training and Proxmox Services
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges, Andy Muthmann - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io/
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> 
> --
> Alexander Patrakov
> 
> 
> 
> -- 
> Alexander Patrakov
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx