Re: Move block.db to new ssd

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Tue, 12 Nov 2024 17:25:50 +0100 (CET)

Hi Anthony, 

Did the RocksDB sharding end up improving the overspilling situation related to the level thresholds? I had only anticipated that it would reduce the impact of compaction. 

We reshared our OSD's RocksDBs a long time ago (after upgrading to Pacific IIRC) and I think we could still observe overspilling at the layer levels sometimes, if I'm not mistaken. 

Cheers, 

Frédéric. 

PS: It seems that the document you referred to is not accessible from the Internet. 

----- Le 12 Nov 24, à 15:11, Anthony D'Atri <anthony.datri@xxxxxxxxx> a écrit : 

> RocksDB column sharding came a while ago. It should be enabled on your OSDs,
> provided they weren’t built on a much older release. If they were you can
> update them.

> [
> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
> ] [
> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
> | rocksdb_in_ceph ]
> [
> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
> | PDF Document · 512 KB ]

> [
> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database
> ] [
> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database
> | IBM Storage Ceph – Administration, Resharding RocksDB database reshard
> RocksDB database ]
> [
> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database
> | ibm.com ]
> [
> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-resharding-rocksdb-database
> ]

>> On Nov 12, 2024, at 8:02 AM, Alexander Patrakov <patrakov@xxxxxxxxx> wrote:

>> Yes, that is correct.

>> On Tue, Nov 12, 2024 at 8:51 PM Frédéric Nass
>> <frederic.nass@xxxxxxxxxxxxxxxx> wrote:

>>> Hello Alexander,

>>> Thank you for clarifying this point. The documentation was not very clear about
>>> the 'improvements'.

>>> Does that mean that in the latest releases overspilling no longer occurs between
>>> the two thresholds of 30GB and 300GB? Meaning block.db can be 80GB in size
>>> without overspilling, for example?

>>> Cheers,

>>> Frédéric.

>>> ----- Le 12 Nov 24, à 13:32, Alexander Patrakov patrakov@xxxxxxxxx a écrit :

>>>> Hello Frédéric,

>>>> The advice regarding 30/300 GB DB sizes is no longer valid. Since Ceph

>>>> 15.2.8, due to the new default (bluestore_volume_selection_policy =

>>>> use_some_extra), it no longer wastes the extra capacity of the DB

>>>> device.

>>>> On Tue, Nov 12, 2024 at 5:52 PM Frédéric Nass

>>>> <frederic.nass@xxxxxxxxxxxxxxxx> wrote:

>>>>> ----- Le 12 Nov 24, à 8:51, Roland Giesler roland@xxxxxxxxxxxxxx a écrit :

>>>>>> On 2024/11/12 04:54, Alwin Antreich wrote:

>>>>>>> Hi Roland,

>>>>>>> On Mon, Nov 11, 2024, 20:16 Roland Giesler <roland@xxxxxxxxxxxxxx> wrote:

>>>>>>>> I have ceph 17.2.6 on a proxmox cluster and want to replace some ssd's

>>>>>>>> who are end of life. I have some spinners who have their journals on

>>>>>>>> SSD. Each spinner has a 50GB SSD LVM partition and I want to move those

>>>>>>>> each to new corresponding partitions.

>>>>>>>> The new 4TB SSD's I have split into volumes with:

>>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB1 -L 47.69g NodeA-nvme0

>>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB2 -L 47.69g NodeA-nvme0

>>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB3 -L 47.69g NodeA-nvme0

>>>>>>>> # lvcreate -n NodeA-nvme-LV-RocksDB4 -L 47.69g NodeA-nvme0

>>>>>>>> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme1

>>>>>>>> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme0

>>>>>>> I caution the mix of DB/WAL partitions with other applications. The

>>>>>>> performance profile may not be suited for shared use. And depending on the

>>>>>>> use case the ~48GB might not be big enough to hinder DB spillover. See the

>>>>>>> current size when querying the OSD.

>>>>>> I see relatively small RocksDB and not WAL?

>>>>>> ceph daemon osd.4 perf dump

>>>>>> <snip>

>>>>>> "bluefs": {

>>>>>> "db_total_bytes": 45025845248,

>>>>>> "db_used_bytes": 2131755008,

>>>>>> "wal_total_bytes": 0,

>>>>>> "wal_used_bytes": 0,

>>>>>> </snip>

>>>>>> I have been led to understand that 4% is die high end and only on very busy

>>>>>> systems is that reached, if ever?

>>>>> Hi Roland,

>>>>> This is generally true but it depends on what your cluster is used for.

>>>>> If your cluster is used for block (RBD) storage then 1%-2% should be enough. If

>>>>> your cluster is used for file (cephfs) and S3 (RGW) storage then you'd rather

>>>>> stay on the safe size and respect the 4% recommendation as these workloads make

>>>>> heavy use of block.db to store metadata.

>>>>> Now percentage is one thing, level size is another. To avoid overspilling when

>>>>> block.db size approaches 30GB you'd better choose a block.db size of 300GB+

>>>>> whatever the percentage of block size this is, if you don't want to play with

>>>>> rocksdb level size and multiplier, which you probably don't.

>>>>> Regards,

>>>>> Frédéric.

>>>>> [1]

>>>>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing

>>>>> [2]

>>>>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-sizing-considerations

>>>>> [3] https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

>>>>>>>> What am I missing to get these changes to be permanent?

>>>>>>> Likely just an issue with the order of execution. But there is an easier

>>>>>>> way to do the move. See:

>>>>>>> https://docs.ceph.com/en/quincy/ceph-volume/lvm/migrate/

>>>>>> Ah, excellent! I didn't find that in my searches. Will try that now.

>>>>>> regards

>>>>>> Roland

>>>>>>> Cheers,

>>>>>>> Alwin

>>>>>>> --

>>>>>>>> Alwin Antreich

>>>>>>> Head of Training and Proxmox Services

>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich

>>>>>>> CEO: Martin Verges, Andy Muthmann - VAT-ID: DE310638492

>>>>>>> Com. register: Amtsgericht Munich HRB 231263

>>>>>>> Web: https://croit.io/

>>>>>>> _______________________________________________

>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx

>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

>>>>>> _______________________________________________

>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx

>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

>>>>> _______________________________________________

>>>>> ceph-users mailing list -- ceph-users@xxxxxxx

>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

>>>> --

>>>> Alexander Patrakov

>> --
>> Alexander Patrakov
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx