Node failure -- corrupt memory

Shawn Iverson <iversons@xxxxxxxxxxxxxxxxxxx> · Mon, 11 Nov 2019 08:00:13 -0500

Hello Cephers!
I had a node over the weekend go nuts from what appears to have been failed/bad memory modules and/or motherboard.

This resulted in several OSDs blocking IO for > 128s (indefinitely).

I was not watching my alerts too closely over the weekend, or else I may have caught it early. The servers in the entire cluster reliant on ceph stalled from the blocked IO on this failing node and had to be restarted after taking the faulty node offline.

So, my question is, is there a way to tell ceph to start setting OSDs out in the event of an IO blockage that exceeds a certain limit, or are there risks in doing so that I would be better off dealing with a stalled ceph cluster?

-- 
Shawn Iverson, CETL
Director of Technology
Rush County Schools
iversons@xxxxxxxxxxxxxxxxxxx

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com