Re: Move block.db to new ssd

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Tue, 12 Nov 2024 13:51:06 +0100 (CET)

Hello Alexander,

Thank you for clarifying this point. The documentation was not very clear about the 'improvements'.

Does that mean that in the latest releases overspilling no longer occurs between the two thresholds of 30GB and 300GB? Meaning block.db can be 80GB in size without overspilling, for example?

Cheers,
Frédéric.

----- Le 12 Nov 24, à 13:32, Alexander Patrakov patrakov@xxxxxxxxx a écrit :

> Hello Frédéric,
> 
> The advice regarding 30/300 GB DB sizes is no longer valid. Since Ceph
> 15.2.8, due to the new default (bluestore_volume_selection_policy =
> use_some_extra), it no longer wastes the extra capacity of the DB
> device.
> 
> On Tue, Nov 12, 2024 at 5:52 PM Frédéric Nass
> <frederic.nass@xxxxxxxxxxxxxxxx> wrote:
>>
>>
>>
>> ----- Le 12 Nov 24, à 8:51, Roland Giesler roland@xxxxxxxxxxxxxx a écrit :
>>
>> > On 2024/11/12 04:54, Alwin Antreich wrote:
>> >> Hi Roland,
>> >>
>> >> On Mon, Nov 11, 2024, 20:16 Roland Giesler <roland@xxxxxxxxxxxxxx> wrote:
>> >>
>> >>> I have ceph 17.2.6 on a proxmox cluster and want to replace some ssd's
>> >>> who are end of life.  I have some spinners who have their journals on
>> >>> SSD.  Each spinner has a 50GB SSD LVM partition and I want to move those
>> >>> each to new corresponding partitions.
>> >>>
>> >>> The new 4TB SSD's I have split into volumes with:
>> >>>
>> >>> # lvcreate -n NodeA-nvme-LV-RocksDB1 -L 47.69g NodeA-nvme0
>> >>> # lvcreate -n NodeA-nvme-LV-RocksDB2 -L 47.69g NodeA-nvme0
>> >>> # lvcreate -n NodeA-nvme-LV-RocksDB3 -L 47.69g NodeA-nvme0
>> >>> # lvcreate -n NodeA-nvme-LV-RocksDB4 -L 47.69g NodeA-nvme0
>> >>> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme1
>> >>> # lvcreate -n NodeA-nvme-LV-data -l 100%FREE NodeA-nvme0
>> >>>
>> >> I caution the mix of DB/WAL partitions with other applications. The
>> >> performance profile may not be suited for shared use. And depending on the
>> >> use case the ~48GB might not be big enough to hinder DB spillover. See the
>> >> current size when querying the OSD.
>> >
>> > I see relatively small RocksDB and not WAL?
>> >
>> > ceph daemon osd.4 perf dump
>> > <snip>
>> >     "bluefs": {
>> >         "db_total_bytes": 45025845248,
>> >         "db_used_bytes": 2131755008,
>> >         "wal_total_bytes": 0,
>> >         "wal_used_bytes": 0,
>> > </snip>
>> >
>> > I have been led to understand that 4% is die high end and only on very busy
>> > systems is that reached, if ever?
>>
>> Hi Roland,
>>
>> This is generally true but it depends on what your cluster is used for.
>>
>> If your cluster is used for block (RBD) storage then 1%-2% should be enough. If
>> your cluster is used for file (cephfs) and S3 (RGW) storage then you'd rather
>> stay on the safe size and respect the 4% recommendation as these workloads make
>> heavy use of block.db to store metadata.
>>
>> Now percentage is one thing, level size is another. To avoid overspilling when
>> block.db size approaches 30GB you'd better choose a block.db size of 300GB+
>> whatever the percentage of block size this is, if you don't want to play with
>> rocksdb level size and multiplier, which you probably don't.
>>
>> Regards,
>> Frédéric.
>>
>> [1]
>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
>> [2]
>> https://www.ibm.com/docs/en/storage-ceph/7.1?topic=bluestore-sizing-considerations
>> [3] https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
>>
>> >
>> >>> What am I missing to get these changes to be permanent?
>> >>>
>> >> Likely just an issue with the order of execution. But there is an easier
>> >> way to do the move. See:
>> >> https://docs.ceph.com/en/quincy/ceph-volume/lvm/migrate/
>> >
>> > Ah, excellent!  I didn't find that in my searches.  Will try that now.
>> >
>> > regards
>> >
>> > Roland
>> >
>> >
>> >>
>> >> Cheers,
>> >> Alwin
>> >>
>> >> --
>> >>
>> >>> Alwin Antreich
>> >> Head of Training and Proxmox Services
>> >>
>> >> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> >> CEO: Martin Verges, Andy Muthmann - VAT-ID: DE310638492
>> >> Com. register: Amtsgericht Munich HRB 231263
>> >> Web: https://croit.io/
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> 
> --
> Alexander Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx