Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()

Ilya Dryomov <idryomov@xxxxxxxxx> · Fri, 25 Mar 2016 15:30:51 +0100

On Fri, Mar 25, 2016 at 1:54 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Thu, 24 Mar 2016, Gregory Farnum wrote:
>> On Thu, Mar 24, 2016 at 2:15 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>> >
>> > Ultimately, this is about shrinking the time it takes for a MON to
>> > notice the "oops".  Do we expect those things to be common and frequent
>> > enough to justify an external daemon, however small and simple, on each
>> > OSD node?
>>
>> Let's not forget that extra daemons aren't free quite apart from
>> having to build them. There's a lot of user education to happen.
>> There's more stuff to install; we'll have extra cephx keys for them
>> that need to get placed; we need to update all our install and
>> management tools to set them up. We'll probably run into new kinds of
>> resource exhaustion, and we'll hit new errors around the local
>> communication setup. :/ I'm uneasy about creating *any* mechanism that
>> automatically marks down OSDs, but isn't directed by the OSD in
>> question.
>>
>> Plus, I think there are other benefits of annotating our asserts more
>> carefully. They're kind of a mess right now and if we were able to do
>> more than crash on disk errors, it'd be nice when we move on to
>> gathering statistics and things...
>
> Yep, I'm sold!  :)
>
> Going back to Igor's PR...
>
>         https://github.com/ceph/ceph/pull/7740
>
> I think perhaps the first thing to do is to make a function like
> Ilya suggested that is
>
>         ceph_abort_markmedown()
>
> and then sort out where/when to call it (instead of tackling signal
> handlers immediately).  It seems like the semantics need to be something
> like
>
>  - queue the markdown message for the mon
>  - wait for N seconds (where N=5 or so?)
>  - ceph_abort()

Is it to wait for the message to go out?  If so, maybe request
a MarkMeDown ack and have an N second Cond timeout?  Modidying
OSD::dispatch() or wiring it up through the service abstraction
shouldn't be hard - an ack would take a lot less than a second.

>
> There are maybe three call sites that come to mind that will probably
> catch most issues:
>
>  - the do_transaction (or equivalent) error code checks on write
>  - a new helper that wraps up the checks/asserts about getting EIO on read
>  - the internal heartbeat that goes off when a thread pool gets stuck
>
> What else?
>
> We could also go for an OSD signal handler, but it would have to be a
> best-effort sort of thing (obviuosly won't work if the messenger is
> busted), and it worries me a bit: what happens if there is a segv in the
> memory allocator, we try to stay alive longer so that we can send
> MarkMeDown, and as a result continue processing some IO but in the
> meantime let something corrupt reach disk or clients or otherwise get
> worse and propogate?

IMHO it's entirely unnecessary.  An "oops" assert should just abort() -
we are not the kernel, after all.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html