Re: Improving alerting/health checks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 28, 2018 at 11:49 PM Paul Cuzner <pcuzner@xxxxxxxxxx> wrote:
>
> Just to add my 0.02c.
>
> I think there are really two layers of alerting - state alerting, and
> trend based alerting (time series). State alerting is where I'd see a
> mgr module adding value, where as trend based alerting is more likely
> to sit outside ceph within prometheus, zabbix, influx etc
>
> I also don't think alert management (snoozing, muting etc) should fall
> to Ceph - let the monitoring/alert layer handle that. This keeps
> things simple(ish) and helps define the 'alert' role as a health-check
> and notifier, leaving more advanced controls to higher levels in the
> monitoring stack.

Part of me feels the same way, but I'm also conscious that there are
downsides: if we rely on higher layers to do snoozing, then
 - it prevents us from building that "snooze" button into the dashboard
 - the alert will still be active in "ceph status" even if it's
filtered somewhere else

The main use case I've heard for doing snoozing/muting is that some
people monitor the overall HEALTH_OK of their Ceph cluster (not
anything finer grained).  When there is a health check they don't care
about, they want to mute it to get their external monitoring green
again.  For those people, if we say that any muting is an external
job, then we're kind of forcing them to monitor Ceph in a finer level
of detail than they really want to.

John

>
> I've been thinking about a "notifier" mgr module to fulfill the
> state-based alerting, based around the notion of  notification
> channels (similar to Grafana). The idea being that when a problem is
> seen the notifier calls the send_alert method of the channel, allowing
> multiple channels to be notified (UI, SNMP, etc)
> On Wed, Jun 27, 2018 at 10:45 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> >
> > On Mon, Jun 25, 2018 at 3:55 AM, John Spray <jspray@xxxxxxxxxx> wrote:> Hi all,
> > >
> > > Recently I've heard from a few different people about needs to have
> > > nicer alerting in Ceph, both for GUIs and for emitting alerts
> > > externally (e.g. over SNMP).  I'm keen to make sure we get the right
> > > common bits in, to avoid modules having to do their own thing too
> > > much.
> > >
> > > Points that have come up recently:
> > >  - How to integrate Ceph health checks with alerts generated in Prometheus?
> > >  - Filtering/muting particular health checks
> >
> > +snoozing
> >
> > --
> > Patrick Donnelly
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux