Folks, New here - I tried searching for this topic in the archive, couldn't find any since 2018 or so. So starting a new thread. I am looking at the impact of node failures. I found this doc: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/operations_guide/handling-a-node-failure - I have a few questions about this: - IIUC, if a root SSD fails, there is pretty much no way to rebuild a new node with the same OSDs and avoid data shuffling - is this correct? - If the hardware, fails - I assume replacing the part and rebooting in time will bring back the node as is - is this right? - If the root drive fails, is there a way to bring up a new host with the same OSDs in the same order but with a different host name / ip address? FWIW we are using rook, so I am wondering if the crush map can be configured with some logical labels instead of host names for this purpose - Is this possible? ( I am evaluating if I can bring up a new node back with the original host name itself - at least the cloud K8s clusters make this impossible). - Assuming we use a shared SSD with partitions for WAL/ Metadata for the whole node - if this drive fails, I assume we have to recover the entire node. Correct? I remember seeing a note that this pretty much renders all the relevant OSDs useless. -- Semi-related: What is the ideal ratio of SSDs for WAL/metadata to count of OSDs? I remember seeing pdfs from Redhat showing a 1:10 ratio, The mailing list has references to 1:3 or 1:6. I am trying to figure out what the right number is. Thanks. Subu _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx