Hi, in our production cluster, we have the following setup
- 10 nodes
- 3 drives / server (so far), mix of SSD and HDD (different
pools) + NVMe
- dual 10G in LACP, linked to two different switches (Cisco vPC)
- OSDs, MONs and MGRs are colocated
- A + B power feeds, 2 ATS (each receiving A+B) - ATS1 and ATS2
- 2 PDU rails, each connected to an ATS (PDU1 = ATS1, PDU2 = ATS2)
- switches have dual PSUs and are connected to both rails
- CEPH nodes - single power supply
- Odd nodes (1,3,5...) are connected to PDU1
- Even nodes (2,4,6...) are connected to PDU2
... I can provide a drawing if it helps :)
Now, the default crush map ensures that multiple copies of the same
object
won't find their way on the same host, which is fine. But I'm thinking
that in case of power failure [1] of either ATS or PDU, we'd be
losing half
the nodes in the cluster at the same time. How would I go about
tuning our
map so it took into account that, for a 3 copy replicated pool, we don't
have those stored on hosts, say, 5,7,9 ?
And, what about when using EC pools ? We currently have 5+2 SSD pools -
how would we avoid losing availability in case of a power loss where 50%
of the server are offline ?
I've gone over
https://docs.ceph.com/docs/master/rados/operations/crush-map/
but don't believe I'm at the stage where I dare make changes without
incurring a huge data migration (probably can't be avoided).
Any input appreciated.
Cheers,
Phil
[1] both power feeds lost at the same time is really hard to protect
against :)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx