Ceph disk failure causing outage/ stalled writes

Osama Hasebou <osama.hasebou@xxxxxx> · Wed, 20 Dec 2017 16:10:39 +0200 (EET)

Hi Everyone,

We have been having lately a pattern, which is, when a disk fails on CEPH, it gets marked as down, while the actual disk might not be faulty yet, and the systemd osd process is still showing up.

When trying to kill the process, it doesn't work, and if the machine is rebooted, it takes time to reboot, the writes to the cluster gets stalled for a good 10-15 mins and actually the machine just shut itself down.

1 - My question is, do you face such conditions ? Any best practices on how to handle maintenance of disks without getting stalled writes to the ceph cluster? Do you move it out of the crush production area fix it then push it back in and let it all rebalance?

2 - Lastly, I wanted to know, what would happen when a machine gets shutdown due a forced shutdown, would some data in ceph journal be lost? When data is being written, does it go to ceph osd journal then to the OSD, then from OSD gets replicated to the other 2  and hence if a machine is powered off in a non friendly matter would the data in ceph journal partition be gone causing data loss or is it on a synchronous mode ?

Detailed answers are welcome and thanks in advance!

Ceph version is Jewel 10.2.10.

Thanks.

Regards,
Ossi

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com