Re: handling fs errors

Wido den Hollander <wido@xxxxxxxxx> · Tue, 22 Jan 2013 14:12:23 +0100

On 01/22/2013 07:12 AM, Yehuda Sadeh wrote:
On Mon, Jan 21, 2013 at 10:05 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
We observed an interesting situation over the weekend.  The XFS volume
ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4
minutes.  After 3 minutes (180s), ceph-osd gave up waiting and committed
suicide.  XFS seemed to unwedge itself a bit after that, as the daemon was
able to restart and continue.

The problem is that during that 180s the OSD was claiming to be alive but
not able to do any IO.  That heartbeat check is meant as a sanity check
against a wedged kernel, but waiting so long meant that the ceph-osd
wasn't failed by the cluster quickly enough and client IO stalled.

We could simply change that timeout to something close to the heartbeat
interval (currently default is 20s).  That will make ceph-osd much more
sensitive to fs stalls that may be transient (high load, whatever).

Another option would be to make the osd heartbeat replies conditional on
whether the internal heartbeat is healthy.  Then the heartbeat warnings
could start at 10-20s, ping replies would pause, but the suicide could
still be 180s out.  If the stall is short-lived, pings will continue, the
osd will mark itself back up (if it was marked down) and continue.

Having written that out, the last option sounds like the obvious choice.
Any other thoughts?

Another option would be to have the osd reply to the ping with some
health description.

Looking to the future with more monitoring that might be a good idea.

If an OSD simply stops sending heartbeats if the internal conditions 
aren't met you don't know what's going on.

If the heartbeat would have metadata which tells: "I'm here, but not in 
such a good shape" that could be reported back to the monitors.

Monitoring tools could read this out and could sent out 
notifications/alerts to where they want.

Now we assume I/O completely stalls, but the metadata could also contain 
high latency? If the latency goes over threshold X you can still mark 
the OSD out temporarily since it will impact clients, but some 
information towards the monitor might be useful.

Wido

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html