Re: Ceph watchdog-like thing to reduce IO block during process goes down by abort()

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 24 Mar 2016 14:20:17 -0700

On Thu, Mar 24, 2016 at 2:15 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>
> Ultimately, this is about shrinking the time it takes for a MON to
> notice the "oops".  Do we expect those things to be common and frequent
> enough to justify an external daemon, however small and simple, on each
> OSD node?

Let's not forget that extra daemons aren't free quite apart from
having to build them. There's a lot of user education to happen.
There's more stuff to install; we'll have extra cephx keys for them
that need to get placed; we need to update all our install and
management tools to set them up. We'll probably run into new kinds of
resource exhaustion, and we'll hit new errors around the local
communication setup. :/ I'm uneasy about creating *any* mechanism that
automatically marks down OSDs, but isn't directed by the OSD in
question.

Plus, I think there are other benefits of annotating our asserts more
carefully. They're kind of a mess right now and if we were able to do
more than crash on disk errors, it'd be nice when we move on to
gathering statistics and things...
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html