Node failure -- corrupt memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Cephers!

I had a node over the weekend go nuts from what appears to have been failed/bad memory modules and/or motherboard.

This resulted in several OSDs blocking IO for > 128s (indefinitely).

I was not watching my alerts too closely over the weekend, or else I may have caught it early. The servers in the entire cluster reliant on ceph stalled from the blocked IO on this failing node and had to be restarted after taking the faulty node offline.

So, my question is, is there a way to tell ceph to start setting OSDs out in the event of an IO blockage that exceeds a certain limit, or are there risks in doing so that I would be better off dealing with a stalled ceph cluster?

--
Shawn Iverson, CETL
Director of Technology
Rush County Schools

Cybersecurity
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux