Re: Improving alerting/health checks

Paul Cuzner <pcuzner@xxxxxxxxxx> · Fri, 29 Jun 2018 14:23:14 +1200



+1. In that example, I'd see the ceph health checks determining the
issue - 100% agree that the place for ceph health logic is in the
health checks - not the alert/monitoring system further up the stack.

The ceph health check alerts would provide WARN/ERROR type events,
whereas the timeseries stuff is more likely to be WARN

Examples of timeseries events could include;
- free capacity projections by pool (i.e. when do I need to add more capacity)
- network errors (e.g. from prom node_exporter stats)
- spinner dev util% avg exceeding thresholds (from prom data,
indicating more spindles are needed)
- OS level triggers for CPU/RAM thresholds

I'd see the default target for any health-check derived alert being an
api endpoint within the dashboard.
On Fri, Jun 29, 2018 at 12:30 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Thu, Jun 28, 2018 at 3:49 PM, Paul Cuzner <pcuzner@xxxxxxxxxx> wrote:
> > Just to add my 0.02c.
> >
> > I think there are really two layers of alerting - state alerting, and
> > trend based alerting (time series). State alerting is where I'd see a
> > mgr module adding value, where as trend based alerting is more likely
> > to sit outside ceph within prometheus, zabbix, influx etc
>
> I’m not sure I entirely agree with this. The example of trend alerting
> that sticks out in my mind is changing from WARN to ERROR based on
> whether recovery appears to be succeeding or not following an OSD
> failure.
> That is:
> * an OSD failing should not be an error; we are designed for failure!
> * objects are obviously going to go degraded in that case, but again
> it is not a system error
> * but if the system isn't going to recover on its own, THAT is an error
>
> Identifying "isn't going to recover on its own" can be sufficiently
> complicated it seems like we ought to answer that ourselves instead of
> making every alerting system re-implement it (badly).
> -Greg
>
> >
> > I also don't think alert management (snoozing, muting etc) should fall
> > to Ceph - let the monitoring/alert layer handle that. This keeps
> > things simple(ish) and helps define the 'alert' role as a health-check
> > and notifier, leaving more advanced controls to higher levels in the
> > monitoring stack.
> >
> > I've been thinking about a "notifier" mgr module to fulfill the
> > state-based alerting, based around the notion of  notification
> > channels (similar to Grafana). The idea being that when a problem is
> > seen the notifier calls the send_alert method of the channel, allowing
> > multiple channels to be notified (UI, SNMP, etc)
> > On Wed, Jun 27, 2018 at 10:45 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> >>
> >> On Mon, Jun 25, 2018 at 3:55 AM, John Spray <jspray@xxxxxxxxxx> wrote:> Hi all,
> >> >
> >> > Recently I've heard from a few different people about needs to have
> >> > nicer alerting in Ceph, both for GUIs and for emitting alerts
> >> > externally (e.g. over SNMP).  I'm keen to make sure we get the right
> >> > common bits in, to avoid modules having to do their own thing too
> >> > much.
> >> >
> >> > Points that have come up recently:
> >> >  - How to integrate Ceph health checks with alerts generated in Prometheus?
> >> >  - Filtering/muting particular health checks
> >>
> >> +snoozing
> >>
> >> --
> >> Patrick Donnelly
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html