I have a cluster spread across 2 racks, with a crush rule that splits data across those racks. To test a failure scenario we powered off one of the racks, and expected ceph to continuing running. Of the 56 OSDs that were powered off 52 were quickly set as down in the cluster (it took around 30 seconds) but the remaining 4, all in different hosts, took the 900s with the "marked down after no pg stats for 900.118248seconds" message. now for some questions: Is it expected some OSDs to not go down as they should? I/O was happening to the cluster before, during and after this event... Should we reduce the 900s timeout to a much lower time? How can we make sure it's not too low beforehand? How low should we go? After those 18mins elapsed (900 seconds) the cluster resumed all IO and came back to life as expected. The big issue is that having these 4 OSDs down caused almost all IO to stop to this cluster, specially as OSDs get a bigger and bigger queue of slow requests... Another issue was during recovery, after we turned back on those servers, and started ceph on those nodes, the cluster just ground to a halt for a while. "osd perf" had all the numbers <100, slow requests were at the same level in all OSDs, nodes didn't have any IOWait, CPUs were idle, load avg < 0.5, so we couldn't find anything that pointed to a culprit. However one of the OSDs timed out a "tell osd.* version" and restarting that OSD made the cluster responsive again. Any idea on how to detect this type of situations? This cluster is running hammer (0.94.5) and has 112 OSDs, 56 in each rack. thanks, _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com