Hello Cephers!
I had a node over the weekend go nuts from what appears to have been failed/bad memory modules and/or motherboard.
This resulted in several OSDs blocking IO for > 128s (indefinitely).
I was not watching my alerts too closely over the weekend, or else I may have caught it early. The servers in the entire cluster reliant on ceph stalled from the blocked IO on this failing node and had to be restarted after taking the faulty node offline.
So, my question is, is there a way to tell ceph to start setting OSDs out in the event of an IO blockage that exceeds a certain limit, or are there risks in doing so that I would be better off dealing with a stalled ceph cluster?
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com