On Tue, Jan 22, 2013 at 10:05 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > We observed an interesting situation over the weekend. The XFS volume > ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 > minutes. After 3 minutes (180s), ceph-osd gave up waiting and committed > suicide. XFS seemed to unwedge itself a bit after that, as the daemon was > able to restart and continue. > > The problem is that during that 180s the OSD was claiming to be alive but > not able to do any IO. That heartbeat check is meant as a sanity check > against a wedged kernel, but waiting so long meant that the ceph-osd > wasn't failed by the cluster quickly enough and client IO stalled. > > We could simply change that timeout to something close to the heartbeat > interval (currently default is 20s). That will make ceph-osd much more > sensitive to fs stalls that may be transient (high load, whatever). > > Another option would be to make the osd heartbeat replies conditional on > whether the internal heartbeat is healthy. Then the heartbeat warnings > could start at 10-20s, ping replies would pause, but the suicide could > still be 180s out. If the stall is short-lived, pings will continue, the > osd will mark itself back up (if it was marked down) and continue. > > Having written that out, the last option sounds like the obvious choice. > Any other thoughts? > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html By the way, is there any worth of preventing situation below - one host buried down in the softlocks marking almost all neighbors down for a short time? That`s harmless, because all osds will rejoin cluster in couple of next heartbeats, but all I/O will stuck at this time, rather than only operations with pgs on failing osd, so may be it`ll be useful to introduce kind of down-mark-quorum for such cases. 2013-01-22 14:40:31.481174 mon.0 [INF] osd.0 10.5.0.10:6800/6578 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481085 >= grace 20 .000000) 2013-01-22 14:40:31.481293 mon.0 [INF] osd.1 10.5.0.11:6800/6488 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481228 >= grace 20 .000000) 2013-01-22 14:40:31.481410 mon.0 [INF] osd.2 10.5.0.12:6803/7561 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481355 >= grace 20 .000000) 2013-01-22 14:40:31.481522 mon.0 [INF] osd.4 10.5.0.14:6803/5697 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481467 >= grace 20 .000000) 2013-01-22 14:40:31.481641 mon.0 [INF] osd.6 10.5.0.16:6803/5679 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481586 >= grace 20.000000) 2013-01-22 14:40:31.481746 mon.0 [INF] osd.8 10.5.0.10:6803/6638 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481700 >= grace 20.000000) 2013-01-22 14:40:31.481863 mon.0 [INF] osd.9 10.5.0.11:6803/6547 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481811 >= grace 20.000000) 2013-01-22 14:40:31.481976 mon.0 [INF] osd.10 10.5.0.12:6800/7019 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481916 >= grace 20.000000) 2013-01-22 14:40:31.482077 mon.0 [INF] osd.12 10.5.0.14:6800/5637 failed (3 reports from 1 peers after 2013-01-22 14:40:54.482022 >= grace 20.000000) 2013-01-22 14:40:31.482184 mon.0 [INF] osd.14 10.5.0.16:6800/5620 failed (3 reports from 1 peers after 2013-01-22 14:40:54.482130 >= grace 20.000000) 2013-01-22 14:40:31.482334 mon.0 [INF] osd.17 10.5.0.31:6800/5854 failed (3 reports from 1 peers after 2013-01-22 14:40:54.482275 >= grace 20.000000) 2013-01-22 14:40:31.482436 mon.0 [INF] osd.18 10.5.0.32:6800/5981 failed (3 reports from 1 peers after 2013-01-22 14:40:54.482389 >= grace 20.000000) 2013-01-22 14:40:31.482539 mon.0 [INF] osd.19 10.5.0.33:6800/5570 failed (3 reports from 1 peers after 2013-01-22 14:40:54.482489 >= grace 20.000000) 2013-01-22 14:40:31.482667 mon.0 [INF] osd.20 10.5.0.34:6800/5643 failed (3 reports from 1 peers after 2013-01-22 14:40:54.482620 >= grace 20.000000) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html