osd timeout

Luis Periquito <periquito@xxxxxxxxx> · Wed, 9 Mar 2016 12:05:10 +0000

I have a cluster spread across 2 racks, with a crush rule that splits
data across those racks.

To test a failure scenario we powered off one of the racks, and
expected ceph to continuing running. Of the 56 OSDs that were powered
off 52 were quickly set as down in the cluster (it took around 30
seconds) but the remaining 4, all in different hosts, took the 900s
with the "marked down after no pg stats for 900.118248seconds"
message.

now for some questions:
Is it expected some OSDs to not go down as they should? I/O was
happening to the cluster before, during and after this event...
Should we reduce the 900s timeout to a much lower time? How can we
make sure it's not too low beforehand? How low should we go?

After those 18mins elapsed (900 seconds) the cluster resumed all IO
and came back to life as expected.

The big issue is that having these 4 OSDs down caused almost all IO to
stop to this cluster, specially as OSDs get a bigger and bigger queue
of slow requests...

Another issue was during recovery, after we turned back on those
servers, and started ceph on those nodes, the cluster just ground to a
halt for a while. "osd perf" had all the numbers <100, slow requests
were at the same level in all OSDs, nodes didn't have any IOWait, CPUs
were idle, load avg < 0.5, so we couldn't find anything that pointed
to a culprit. However one of the OSDs timed out a "tell osd.* version"
and restarting that OSD made the cluster responsive again. Any idea on
how to detect this type of situations?

This cluster is running hammer (0.94.5) and has 112 OSDs, 56 in each rack.

thanks,
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com