Re: handling fs errors

Andrey Korolyov <andrey@xxxxxxx> · Wed, 23 Jan 2013 01:20:08 +0300

On Tue, Jan 22, 2013 at 10:05 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> We observed an interesting situation over the weekend.  The XFS volume
> ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4
> minutes.  After 3 minutes (180s), ceph-osd gave up waiting and committed
> suicide.  XFS seemed to unwedge itself a bit after that, as the daemon was
> able to restart and continue.
>
> The problem is that during that 180s the OSD was claiming to be alive but
> not able to do any IO.  That heartbeat check is meant as a sanity check
> against a wedged kernel, but waiting so long meant that the ceph-osd
> wasn't failed by the cluster quickly enough and client IO stalled.
>
> We could simply change that timeout to something close to the heartbeat
> interval (currently default is 20s).  That will make ceph-osd much more
> sensitive to fs stalls that may be transient (high load, whatever).
>
> Another option would be to make the osd heartbeat replies conditional on
> whether the internal heartbeat is healthy.  Then the heartbeat warnings
> could start at 10-20s, ping replies would pause, but the suicide could
> still be 180s out.  If the stall is short-lived, pings will continue, the
> osd will mark itself back up (if it was marked down) and continue.
>
> Having written that out, the last option sounds like the obvious choice.
> Any other thoughts?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

By the way, is there any worth of preventing situation below - one
host buried down in the softlocks marking almost all neighbors down
for a short time? That`s harmless, because all osds will rejoin
cluster in couple of next heartbeats, but all I/O will stuck at this
time, rather than only operations with pgs on failing osd, so may be
it`ll be useful to introduce kind of down-mark-quorum for such cases.

2013-01-22 14:40:31.481174 mon.0 [INF] osd.0 10.5.0.10:6800/6578
failed (3 reports from 1 peers after 2013-01-22 14:40:54.481085 >=
grace 20
.000000)
2013-01-22 14:40:31.481293 mon.0 [INF] osd.1 10.5.0.11:6800/6488
failed (3 reports from 1 peers after 2013-01-22 14:40:54.481228 >=
grace 20
.000000)
2013-01-22 14:40:31.481410 mon.0 [INF] osd.2 10.5.0.12:6803/7561
failed (3 reports from 1 peers after 2013-01-22 14:40:54.481355 >=
grace 20
.000000)
2013-01-22 14:40:31.481522 mon.0 [INF] osd.4 10.5.0.14:6803/5697
failed (3 reports from 1 peers after 2013-01-22 14:40:54.481467 >=
grace 20
.000000)
2013-01-22 14:40:31.481641 mon.0 [INF] osd.6 10.5.0.16:6803/5679
failed (3 reports from 1 peers after 2013-01-22 14:40:54.481586 >=
grace 20.000000)
2013-01-22 14:40:31.481746 mon.0 [INF] osd.8 10.5.0.10:6803/6638
failed (3 reports from 1 peers after 2013-01-22 14:40:54.481700 >=
grace 20.000000)
2013-01-22 14:40:31.481863 mon.0 [INF] osd.9 10.5.0.11:6803/6547
failed (3 reports from 1 peers after 2013-01-22 14:40:54.481811 >=
grace 20.000000)
2013-01-22 14:40:31.481976 mon.0 [INF] osd.10 10.5.0.12:6800/7019
failed (3 reports from 1 peers after 2013-01-22 14:40:54.481916 >=
grace 20.000000)
2013-01-22 14:40:31.482077 mon.0 [INF] osd.12 10.5.0.14:6800/5637
failed (3 reports from 1 peers after 2013-01-22 14:40:54.482022 >=
grace 20.000000)
2013-01-22 14:40:31.482184 mon.0 [INF] osd.14 10.5.0.16:6800/5620
failed (3 reports from 1 peers after 2013-01-22 14:40:54.482130 >=
grace 20.000000)
2013-01-22 14:40:31.482334 mon.0 [INF] osd.17 10.5.0.31:6800/5854
failed (3 reports from 1 peers after 2013-01-22 14:40:54.482275 >=
grace 20.000000)
2013-01-22 14:40:31.482436 mon.0 [INF] osd.18 10.5.0.32:6800/5981
failed (3 reports from 1 peers after 2013-01-22 14:40:54.482389 >=
grace 20.000000)
2013-01-22 14:40:31.482539 mon.0 [INF] osd.19 10.5.0.33:6800/5570
failed (3 reports from 1 peers after 2013-01-22 14:40:54.482489 >=
grace 20.000000)
2013-01-22 14:40:31.482667 mon.0 [INF] osd.20 10.5.0.34:6800/5643
failed (3 reports from 1 peers after 2013-01-22 14:40:54.482620 >=
grace 20.000000)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html