Re: mgr "chaos monkey" (was Re: automated bluestore conversion)

John Spray <jspray@xxxxxxxxxx> · Mon, 16 Jul 2018 22:44:18 +0100

On Mon, Jul 16, 2018 at 10:24 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Mon, Jul 16, 2018 at 2:08 PM, John Spray <jspray@xxxxxxxxxx> wrote:
> > BTW there's already an interesting opportunity for someone to write a
> > chaosmonkey-type ceph-mgr module that periodically does things like
> > taking an OSD out and letting the cluster rebalance, randomly killing
> > an MDS from time to time, etc.
>
> What's the goal here? Are there tickets or something about this?

Nope, it's just chat for now.

The use case I have in mind is to validate that deployed systems have
enough slack in them to work well through failures, or at least give
the operator a good sense of what things are going to look like when
things aren't at 100%.  We periodically advise people to check that
their configurations are robust through some failures, but leave it as
an exercise to the reader to actually generate those failures, so I
imagine some sites may be doing less of this kind of validation than
would be ideal.

The qa thrasher deserves an honorable mention, but something in the
installed system should different -- a smaller, simple surface area
with some sensible defaults, and leaving out anything dangerous (like
the ability to ask the system to take out three osds at once!).

John

> I ask because anybody working on something like this should at least
> be aware of the thrasher code in
> https://github.com/ceph/ceph/blob/master/qa/tasks/ceph_manager.py#L98
> I would not say it is, uh, *good*, but we use it extensively in
> teuthology testing. If we're going to build another one into the
> manager it might be nice to switch to relying on that instead,
> assuming it's feasible. Fewer code bases are generally better! (Of
> course, then we've got a great circular testing thing to work out
> too...)
> -Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html