Re: Moving devices to a different device class?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



 Thank you! This is very helpful information and thanks for the specific advice for these drive types on choosing a 64KB min_alloc_size. I will do some more review as I believe they are likely at the 4KB min_alloc_size if that is the default for the `ssd` device-class.

  I will look to try to use the 64K default min_alloc_size, if I can do so for a new device-class, and then `destroy` each of these OSDs and create anew with the better `min_alloc_size`. These steps could then be done 1-by-1 for each of the OSDs of this type prior to trying to create the new pool.

  If I cannot do that, then I would use the `ceph osd crush rm-device-class ssd osd.XX` and `ceph osd crush rm-device-class ssd osd.XX` to individually reassign the drives over to a new class with a simple name like `qlc` to avoid issues with special characters in the class name. This could be done 1-by-1 and with watching that the PGs rebalance to the other SSDs in the original pool.

 Thanks again,

On Tue, Oct 24, 2023 at 12:11 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
Ah, our old friend the P5316.

A few things to remember about these:

* 64KB IU means that you'll burn through endurance if you do a lot of writes smaller than that.  The firmware will try to coalesce smaller writes, especially if they're sequential.  You probably want to keep your RGW / CephFS index / medata pools on other media.

* With Quincy or later and a reasonably recent kernel you can set bluestore_use_optimal_io_size_for_min_alloc_size to true and OSDs deployed on these should automatically be created with a 64KB min_alloc_size.  If you're writing a lot of objects smaller than, say, 256KB -- especially if using EC -- a more nuanced approach may be warranted.  ISTR that your data are large sequential files, so probably you can exploit this.  For sure you want these OSDs to not have the default 4KB min_alloc_size; that would result in lowered write performance and especially endurance burn.  The min_alloc_size cannot be changed after an OSD is created; instead one would need to destroy and recreate.


On Oct 24, 2023, at 11:42, Matt Larson <larsonmattr@xxxxxxxxx> wrote:

I am looking to create a new pool that would be backed by a particular set
of drives that are larger nVME SSDs (Intel SSDPF2NV153TZ, 15TB drives).
Particularly, I am wondering about what is the best way to move devices
from one pool and to direct them to be used in a new pool to be created. In
this case, the documentation suggests I could want to assign them to a new
device-class and have a placement rule that targets that device-class in
the new pool.

If you're using cephadm / ceph orch you can craft an OSD spec that uses or ignores drives based on size or model.

Multiple pools can share OSDs, for your use-case though you probably don't want to.

Currently the Ceph cluster has two device classes 'hdd' and 'ssd', and the
larger 15TB drives were automatically assigned to the 'ssd' device class
that is in use by a different pool. The `ssd` device classes are used in a
placement rule targeting that class.

The names of device classes are actually semi-arbitrary.  The above distinction is made on the basis of whether or not the kernel believes a given device to rotate.

The documentation describes that I could set a device class for an OSD with
a command like:

`ceph osd crush set-device-class CLASS OSD_ID [OSD_ID ..]`

Class names can be arbitrary strings like 'big_nvme".  

or "qlc"

Before setting a new
device class to an OSD that already has an assigned device class, should
use `ceph osd crush rm-device-class ssd osd.XX`.

Yep.  I suspect that's a guardrail to prevent inadvertently trampling.

Can I proceed to directly remove these OSDs from the current device class
and assign to a new device class?

Carpe NAND!

Should they be moved one by one? What is
the way to safely protect data from the existing pool that they are mapped

Are there other SSDs in said existing pool?  If you reassign all of these, will there be enough survivors to meet replication policy and hold all the data?

One by one would be safe.  Doing more than one might be faster and more efficient, depending on your hardware and topology.  For sure you don't want to reassign more than one per CRUSH failure domain at a time (host, rack, depends on your setup).  If your topology, RAM, and clients are amenable, you could do all OSDs in a single failure domain at once, then proceed to the next only after all PGs are active+clean.


Matt Larson, PhD
Madison, WI  53705 U.S.A.
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Matt Larson, PhD
Madison, WI  53705 U.S.A.
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux