Re: Improving alerting/health checks

Paul Cuzner <pcuzner@xxxxxxxxxx> · Tue, 3 Jul 2018 11:52:40 +1200

Good point.

I guess I was looking at this from a multi-cluster context where
they'd be a NOC - the assumption being that the monitoring people
centrally manage, rather than going out to individual clusters to make
changes.

Agree, that each cluster should have the ability to 'snooze' a
specific alert trigger.

On Fri, Jun 29, 2018 at 8:41 PM John Spray <jspray@xxxxxxxxxx> wrote:
>
> On Thu, Jun 28, 2018 at 11:49 PM Paul Cuzner <pcuzner@xxxxxxxxxx> wrote:
> >
> > Just to add my 0.02c.
> >
> > I think there are really two layers of alerting - state alerting, and
> > trend based alerting (time series). State alerting is where I'd see a
> > mgr module adding value, where as trend based alerting is more likely
> > to sit outside ceph within prometheus, zabbix, influx etc
> >
> > I also don't think alert management (snoozing, muting etc) should fall
> > to Ceph - let the monitoring/alert layer handle that. This keeps
> > things simple(ish) and helps define the 'alert' role as a health-check
> > and notifier, leaving more advanced controls to higher levels in the
> > monitoring stack.
>
> Part of me feels the same way, but I'm also conscious that there are
> downsides: if we rely on higher layers to do snoozing, then
>  - it prevents us from building that "snooze" button into the dashboard
>  - the alert will still be active in "ceph status" even if it's
> filtered somewhere else
>
> The main use case I've heard for doing snoozing/muting is that some
> people monitor the overall HEALTH_OK of their Ceph cluster (not
> anything finer grained).  When there is a health check they don't care
> about, they want to mute it to get their external monitoring green
> again.  For those people, if we say that any muting is an external
> job, then we're kind of forcing them to monitor Ceph in a finer level
> of detail than they really want to.
>
> John
>
> >
> > I've been thinking about a "notifier" mgr module to fulfill the
> > state-based alerting, based around the notion of  notification
> > channels (similar to Grafana). The idea being that when a problem is
> > seen the notifier calls the send_alert method of the channel, allowing
> > multiple channels to be notified (UI, SNMP, etc)
> > On Wed, Jun 27, 2018 at 10:45 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Jun 25, 2018 at 3:55 AM, John Spray <jspray@xxxxxxxxxx> wrote:> Hi all,
> > > >
> > > > Recently I've heard from a few different people about needs to have
> > > > nicer alerting in Ceph, both for GUIs and for emitting alerts
> > > > externally (e.g. over SNMP).  I'm keen to make sure we get the right
> > > > common bits in, to avoid modules having to do their own thing too
> > > > much.
> > > >
> > > > Points that have come up recently:
> > > >  - How to integrate Ceph health checks with alerts generated in Prometheus?
> > > >  - Filtering/muting particular health checks
> > >
> > > +snoozing
> > >
> > > --
> > > Patrick Donnelly
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html