Re: Ceph with 3 nodes and hybrid storage policy: how to configure OSDs with different HDD and SSD sizes

Eugen Block <eblock@xxxxxx> · Wed, 12 Mar 2025 14:06:48 +0000

Responses inline...

Zitat von Daniel Vogelbacher <daniel@xxxxxxxxxxxxxx>:

On 3/12/25 08:32, Eugen Block wrote:
I meant that as long as your third node is offline, your PGs will  
stay degraded until the node comes back because there's no recovery  
target. Your cluster will serve I/O though, yes.

Even when another node failed, I still have one node in read-only  
mode while recovery node 2 and 3. Sure, this is not optimal, but  
it's okay for my use-case.

You won't have MON quorum if you lose a second node. Not sure what  
you mean by "one node in read-only mode", you need two intact  
replicas to server I/O (min_size 2) and a MON quorum, so no, you  
can't lose two nodes.

Okay, after consulting the official docs again I think you are  
right, I/O is completely blocked, not only writes. I remember that I  
read somewhere when ceph goes below min_size, only writes are  
blocked but I can't find a source for that anymore.

Somehow this rumour still survives, I'm not sure where it comes from,  
but just a few weeks ago a customer said the same, without any proof.  
;-)

What would be the recovery scenario if a cluster goes below  
min_size? For example:

Failure scenario A: Node 3 burns down, only Node 1+2 are alive and  
during recovery (of a new node 3) a single OSD on Node 1 fails  
because of bad blocks.
What happens? Any manual steps are required to fix this additional  
fault or will recovery of node 3 finish successfully and afterwards  
I can replace the failed OSD on node 1?

I guess it depends a bit on what exactly happens. If the OSD just dies  
immediately (no chance to rewrite the faulty sector to a healthy one),  
I/O will pause for the affected PGs since you got below min_size. You  
would need to set min_size temporarily to 1 to allow I/O and recovery.  
But never forget to set it back to 2!!!!

Failure scenario B: Node 1+2 burns down, only Node 3 is alive,  
cluster is now below min_size=2 and all I/O blocked.
What steps are required to add two new nodes and start recovery? Do  
I need to temporary set min_size=1 to allow recovery?

The cluster will be completely offline since you won't have MON  
quorum, assuming you have three MONs since everything else doesn't  
make too much sense here. First you'll need to revive at least another  
MON to reach quorum again in order to manage the cluster. Then you  
might recover with min_size 1 (temporarily!!!), but this is just the  
general idea. It really depends on what exactly goes wrong. That's why  
I suggest to have at least one more node for OSDs to be able to recover.

Zitat von Daniel Vogelbacher <daniel@xxxxxxxxxxxxxx>:

Hi Eugen,

On 3/11/25 16:48, Eugen Block wrote:
Hi Daniel,

the first thing to mention is, while min_size 2/size 3 is good,  
having only three nodes leaves the cluster without any options to  
recover in case of a node failure. So it 's recommended to use at  
least four nodes.

What exactly do you mean with "without any options to recover"?  
From my understanding, with min_size=2 I can still operate the  
cluster with 2 healthy nodes in read-write I/O during recovery of  
the third node.
Even when another node failed, I still have one node in read-only  
mode while recovery node 2 and 3. Sure, this is not optimal, but  
it's okay for my use-case.

You have to be aware that the hybrid rule only gives you  
performance advantages for read requests (from primary the OSD).  
Writing is only completed when all replicas have acked the write,  
so your clients will be waiting for the HDDs to ack.

The 5 TB are not wasted if you have other pools utilizing HDDs.

Regards,
Eugen

Zitat von Daniel Vogelbacher <daniel@xxxxxxxxxxxxxx>:

Hi,

I want to setup a 3-node Ceph cluster with fault domain  
configured to "host".

Each node should be equipped with:

6x SAS3 HDD 12TB
1x SAS3 SSD 7TB (should be extended to 2x7 later)

The ceph configuration should be size=3, min_size=2. All nodes  
are connected with 2x10Gbit (LACP).

I want to use different CRUSH rules for different pools. CephFS  
and low priority/IO VMs stored on RBD should use only HDD drives  
with default replication CRUSH rule.

For high priority VMs, I want to create another RBD data pool  
which uses a modified CRUSH replication rule:

|# Hybrid storage policy rule hybrid { ruleset 2 type replicated  
step take ssd step chooseleaf firstn 1 type host step emit step  
take hdd step chooseleaf firstn -1 type host step emit } |
|For pools using this hybrid rule, PGs are stored on one SSD  
(primary) and two HDD (secondary) devices. But these have  
different sizes in my hardware setup. What happens with the  
remaining disk space (12-7=5) 5GB on the secondary devices? Is  
it just unusable, or will ceph use it for other pools with  
default replication? In any case, I don't bother about these  
5GB, just want to know how it works. For the above setup, can  
you recommend any important configuration settings and should I  
modify the OSD weighting? Thanks. |-- Best regards / Mit  
freundlichen Grüßen
Daniel Vogelbacher
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Best regards / Mit freundlichen Grüßen
Daniel Vogelbacher
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Best regards / Mit freundlichen Grüßen
Daniel Vogelbacher
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx