Re: Ceph with 3 nodes and hybrid storage policy: how to configure OSDs with different HDD and SSD sizes

Daniel Vogelbacher <daniel@xxxxxxxxxxxxxx> · Wed, 12 Mar 2025 08:44:11 +0100

Hi Anthony,

On 3/11/25 16:45, Anthony D'Atri wrote:
Hi,

I want to setup a 3-node Ceph cluster with fault domain configured to "host".

Each node should be equipped with:

6x SAS3 HDD 12TB
1x SAS3 SSD 7TB (should be extended to 2x7 later)
Is this existing hardware you’re stuck with?  If not, don’t waste your money with SAS.  SAS generally requires you to add a PCIe HBA, which often comes with expensive and brittle RAID functionality.

SAS SSDs don’t cost more than NVMe SSDs if you procure carefully.  Buying NVMe-only chassis can in fact cost LESS up front than SAS-capable chassis.

With only 18 OSDs, each a large slow HDD, do you have any performance expectation at all?

Yes, I already have two servers with SAS3 backplanes and up to 12x 3.5" 
bays and currently 1GBit ethernet links. Each of these servers operate 
on a local ZFS zraid2 storage pool out of 6 SATA disks, running various 
KVM VMs via libvirt. The total required space (for now) is ~35TiB 
(including another very old fileserver, which content should be moved to 
a CephFS). Theoretically, if one server dies, the other server has 
enough power to run twice the amount of VMs, but as I don't have a 
separate storage network, I loose the VM disk images if a server dies 
(I've backups, but recovery takes time and may be ~24h old).

So my idea is to extend this setup with a third server, upgrade the 
network to 10Gbit, use LACP to get up to 20Gbit per server, and fill the 
remaining empty 6 disk bays with SAS HDD disks. After migration, the old 
ZFS pools and SATA disks are removed, so I've space for extending the 
cluster later (and put some SAS SSDs into it if I need more IOPS for 
some pools). The hybrid solution would save me some money, but now I 
think it would be better to create a pool and CRUSH rule to exclusively 
use SSDs to gain not only more read performance, but also write performance.

35 TiB + 20% + some free space for more VMs with only NVMe disks is out 
of my budget. I calculate with a total of ~60TB raw storage per node to 
have enough reserve and a NVMe-only setup would be too expensive (and it 
requires 3 new servers).

I don't expect any huge performance gain from switching from ZFS zraid2 
to Ceph 3-node setup, I want it because of the much better fail-over 
scenario if one server dies.

The ceph configuration should be size=3, min_size=2. All nodes are connected with 2x10Gbit (LACP).

I want to use different CRUSH rules for different pools. CephFS and low priority/IO VMs stored on RBD should use only HDD drives with default replication CRUSH rule.

For high priority VMs, I want to create another RBD data pool which uses a modified CRUSH replication rule:

|# Hybrid storage policy rule hybrid { ruleset 2 type replicated step take ssd step chooseleaf firstn 1 type host step emit step take hdd step chooseleaf firstn -1 type host step emit } |
|For pools using this hybrid rule, PGs are stored on one SSD (primary) and two HDD (secondary) devices.
I believe the upstream docs have an example of such a CRUSH rule, not sure if it’s identical to what you list above.  Note that you would want to ensure that primary affinity is limited to the SSD OSDs.

Do note that performance with only 3 SSD OSDs is not going to be terrific.  In fact it might even be less than a pool using the HDD OSDs, which at least are more numerous.  Note that with this strategy, your writes will not be any faster than the HDD-only pool, and may well be slower.

But these have different sizes in my hardware setup. What happens with the remaining disk space (12-7=5) 5GB on the secondary devices? Is it just unusable,
It will be shared with your “low priority” pools.

or will ceph use it for other pools with default replication? In any case, I don't bother about these 5GB, just want to know how it works. For the above setup, can you recommend any important configuration settings and should I modify the OSD weighting? Thanks. |--
Best regards / Mit freundlichen Grüßen
Daniel Vogelbacher
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx

--
Best regards / Mit freundlichen Grüßen
Daniel Vogelbacher
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx