Re: Improving alerting/health checks

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Thu, 28 Jun 2018 20:15:03 -0700

>> I think there are really two layers of alerting - state alerting, and
>> trend based alerting (time series). State alerting is where I'd see a
>> mgr module adding value, where as trend based alerting is more likely
>> to sit outside ceph within prometheus, zabbix, influx etc
> 
> I’m not sure I entirely agree with this. The example of trend alerting
> that sticks out in my mind is changing from WARN to ERROR based on
> whether recovery appears to be succeeding or not following an OSD
> failure.
> That is:
> * an OSD failing should not be an error; we are designed for failure!

Agreed in principle, though something I’ve seen over and over ahead is an OSD failing and nobody notices until they pile up.  Eg.  OSD node crashes; when it comes back up one of the drives is toast and that OSD doesn’t start - the cluster just backfills / recovers a bit more than it otherwise would have.  One should probably periodically check that #osds == #up == $in.   Or when someone completely removes an OSD and fails to redeploy a replacement, that’s trickier, though with the `destroyed` state now we hopefully will see less of that.

In the past I stumbled upon a cluster that someone had built, improperly, with the wrong hardware, without monitoring, and without telling anyone.  Over half the OSD drives had failed (those of you who acquainted with me know the etiology) in such a fashion/cadence that there were unfound objects.  I ended up having to blow away the RGW pools to recover; fortunately nobody was using them for anything official.  While this is a both hyperbolic and tangential to this point it’s left me sensitive to alerting on OSD failures.

I do really love the idea of finer granularity than just WARN and ERR, which we’ve already started to see in `ceph health detail`. 

> * but if the system isn't going to recover on its own, THAT is an error
> 
> Identifying "isn't going to recover on its own" can be sufficiently
> complicated it seems like we ought to answer that ourselves instead of
> making every alerting system re-implement it (badly).

Interesting point.  My first thought is that while Ceph is perhaps in a better position to know when that is the case, I suspect it would still be quite tricky.  Would a criterion be N (default 0?) successful recovery operations in a (configurable) period of time?  No decrease in `degraded` PGs in a period of time?  Those off-the-cuff thoughts do sound like time-series / trend things, how readily would they fit into Ceph’s way of doing things?

>> I also don't think alert management (snoozing, muting etc) should fall
>> to Ceph - let the monitoring/alert layer handle that. This keeps
>> things simple(ish) and helps define the 'alert' role as a health-check
>> and notifier, leaving more advanced controls to higher levels in the
>> monitoring stack.

Moreover, organizations generally have their own established infrastructure for that, PagerDuty, home-brew, et al.  Having to do something different because of a goofy choice a vendor hardcoded into a product is one of the reasons we cite for using Ceph in the first place!  Anyone remember how interoperable FlexLM wasn’t in the 90’s?

>> I've been thinking about a "notifier" mgr module to fulfill the
>> state-based alerting, based around the notion of  notification
>> channels (similar to Grafana). The idea being that when a problem is
>> seen the notifier calls the send_alert method of the channel, allowing
>> multiple channels to be notified (UI, SNMP, etc)

So long as it’s pluggable, ie. one could envisioned a spooled check for Nagios / check_mk / Icinga for a generalized Ceph status check that could evolve without having to reconfigure the receiving framework.  Really don’t want to have to rely on SNMP traps for this sort of thing.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html