Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 25 Mar 2016 08:54:21 -0400 (EDT)

On Thu, 24 Mar 2016, Gregory Farnum wrote:
> On Thu, Mar 24, 2016 at 2:15 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> >
> > Ultimately, this is about shrinking the time it takes for a MON to
> > notice the "oops".  Do we expect those things to be common and frequent
> > enough to justify an external daemon, however small and simple, on each
> > OSD node?
> 
> Let's not forget that extra daemons aren't free quite apart from
> having to build them. There's a lot of user education to happen.
> There's more stuff to install; we'll have extra cephx keys for them
> that need to get placed; we need to update all our install and
> management tools to set them up. We'll probably run into new kinds of
> resource exhaustion, and we'll hit new errors around the local
> communication setup. :/ I'm uneasy about creating *any* mechanism that
> automatically marks down OSDs, but isn't directed by the OSD in
> question.
> 
> Plus, I think there are other benefits of annotating our asserts more
> carefully. They're kind of a mess right now and if we were able to do
> more than crash on disk errors, it'd be nice when we move on to
> gathering statistics and things...

Yep, I'm sold!  :)

Going back to Igor's PR...

	https://github.com/ceph/ceph/pull/7740

I think perhaps the first thing to do is to make a function like 
Ilya suggested that is

	ceph_abort_markmedown()

and then sort out where/when to call it (instead of tackling signal 
handlers immediately).  It seems like the semantics need to be something 
like

 - queue the markdown message for the mon
 - wait for N seconds (where N=5 or so?)
 - ceph_abort()

There are maybe three call sites that come to mind that will probably 
catch most issues:

 - the do_transaction (or equivalent) error code checks on write
 - a new helper that wraps up the checks/asserts about getting EIO on read
 - the internal heartbeat that goes off when a thread pool gets stuck

What else?

We could also go for an OSD signal handler, but it would have to be a 
best-effort sort of thing (obviuosly won't work if the messenger is 
busted), and it worries me a bit: what happens if there is a segv in the 
memory allocator, we try to stay alive longer so that we can send 
MarkMeDown, and as a result continue processing some IO but in the 
meantime let something corrupt reach disk or clients or otherwise get 
worse and propogate?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html