Handling node failures.

Subu Sankara Subramanian <subu.zsked@xxxxxxxxx> · Fri, 12 Nov 2021 08:40:29 -0800

Folks,

  New here - I tried searching for this topic in the archive, couldn't find
any since 2018 or so. So starting a new thread.  I am looking at the impact
of node failures. I found this doc:
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/operations_guide/handling-a-node-failure
- I have a few questions about this:

-  IIUC, if a root SSD fails, there is pretty much no way to rebuild a new
node with the same OSDs and avoid data shuffling - is this correct?
- If the hardware, fails - I assume replacing the part and rebooting in
time will bring back the node as is - is this right?
- If the root drive fails, is there a way to bring up a new host with the
same OSDs in the same order but with a different host name / ip address?
FWIW we are using rook, so I am wondering if the crush map can be
configured with some logical labels instead of host names for this purpose
- Is this possible? ( I am evaluating if I can bring up a new node back
with the original host name itself - at least the cloud K8s clusters make
this impossible).

- Assuming we use a shared SSD with partitions for WAL/ Metadata for the
whole node - if this drive fails, I assume we have to recover the entire
node. Correct? I remember seeing a note that this pretty much renders all
the relevant OSDs useless.
-- Semi-related: What is the ideal ratio of SSDs for WAL/metadata to count
of OSDs? I remember seeing pdfs from Redhat showing a 1:10 ratio, The
mailing list has references to 1:3 or 1:6. I am trying to figure out what
the right number is.

Thanks. Subu
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx