On Mon, Jul 16, 2018 at 10:24 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > On Mon, Jul 16, 2018 at 2:08 PM, John Spray <jspray@xxxxxxxxxx> wrote: > > BTW there's already an interesting opportunity for someone to write a > > chaosmonkey-type ceph-mgr module that periodically does things like > > taking an OSD out and letting the cluster rebalance, randomly killing > > an MDS from time to time, etc. > > What's the goal here? Are there tickets or something about this? Nope, it's just chat for now. The use case I have in mind is to validate that deployed systems have enough slack in them to work well through failures, or at least give the operator a good sense of what things are going to look like when things aren't at 100%. We periodically advise people to check that their configurations are robust through some failures, but leave it as an exercise to the reader to actually generate those failures, so I imagine some sites may be doing less of this kind of validation than would be ideal. The qa thrasher deserves an honorable mention, but something in the installed system should different -- a smaller, simple surface area with some sensible defaults, and leaving out anything dangerous (like the ability to ask the system to take out three osds at once!). John > I ask because anybody working on something like this should at least > be aware of the thrasher code in > https://github.com/ceph/ceph/blob/master/qa/tasks/ceph_manager.py#L98 > I would not say it is, uh, *good*, but we use it extensively in > teuthology testing. If we're going to build another one into the > manager it might be nice to switch to relying on that instead, > assuming it's feasible. Fewer code bases are generally better! (Of > course, then we've got a great circular testing thing to work out > too...) > -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html