Re: Improving alerting/health checks

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 28 Jun 2018 17:30:40 -0700

On Thu, Jun 28, 2018 at 3:49 PM, Paul Cuzner <pcuzner@xxxxxxxxxx> wrote:
> Just to add my 0.02c.
>
> I think there are really two layers of alerting - state alerting, and
> trend based alerting (time series). State alerting is where I'd see a
> mgr module adding value, where as trend based alerting is more likely
> to sit outside ceph within prometheus, zabbix, influx etc

I’m not sure I entirely agree with this. The example of trend alerting
that sticks out in my mind is changing from WARN to ERROR based on
whether recovery appears to be succeeding or not following an OSD
failure.
That is:
* an OSD failing should not be an error; we are designed for failure!
* objects are obviously going to go degraded in that case, but again
it is not a system error
* but if the system isn't going to recover on its own, THAT is an error

Identifying "isn't going to recover on its own" can be sufficiently
complicated it seems like we ought to answer that ourselves instead of
making every alerting system re-implement it (badly).
-Greg

>
> I also don't think alert management (snoozing, muting etc) should fall
> to Ceph - let the monitoring/alert layer handle that. This keeps
> things simple(ish) and helps define the 'alert' role as a health-check
> and notifier, leaving more advanced controls to higher levels in the
> monitoring stack.
>
> I've been thinking about a "notifier" mgr module to fulfill the
> state-based alerting, based around the notion of  notification
> channels (similar to Grafana). The idea being that when a problem is
> seen the notifier calls the send_alert method of the channel, allowing
> multiple channels to be notified (UI, SNMP, etc)
> On Wed, Jun 27, 2018 at 10:45 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
>>
>> On Mon, Jun 25, 2018 at 3:55 AM, John Spray <jspray@xxxxxxxxxx> wrote:> Hi all,
>> >
>> > Recently I've heard from a few different people about needs to have
>> > nicer alerting in Ceph, both for GUIs and for emitting alerts
>> > externally (e.g. over SNMP).  I'm keen to make sure we get the right
>> > common bits in, to avoid modules having to do their own thing too
>> > much.
>> >
>> > Points that have come up recently:
>> >  - How to integrate Ceph health checks with alerts generated in Prometheus?
>> >  - Filtering/muting particular health checks
>>
>> +snoozing
>>
>> --
>> Patrick Donnelly
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html