Good point. I guess I was looking at this from a multi-cluster context where they'd be a NOC - the assumption being that the monitoring people centrally manage, rather than going out to individual clusters to make changes. Agree, that each cluster should have the ability to 'snooze' a specific alert trigger. On Fri, Jun 29, 2018 at 8:41 PM John Spray <jspray@xxxxxxxxxx> wrote: > > On Thu, Jun 28, 2018 at 11:49 PM Paul Cuzner <pcuzner@xxxxxxxxxx> wrote: > > > > Just to add my 0.02c. > > > > I think there are really two layers of alerting - state alerting, and > > trend based alerting (time series). State alerting is where I'd see a > > mgr module adding value, where as trend based alerting is more likely > > to sit outside ceph within prometheus, zabbix, influx etc > > > > I also don't think alert management (snoozing, muting etc) should fall > > to Ceph - let the monitoring/alert layer handle that. This keeps > > things simple(ish) and helps define the 'alert' role as a health-check > > and notifier, leaving more advanced controls to higher levels in the > > monitoring stack. > > Part of me feels the same way, but I'm also conscious that there are > downsides: if we rely on higher layers to do snoozing, then > - it prevents us from building that "snooze" button into the dashboard > - the alert will still be active in "ceph status" even if it's > filtered somewhere else > > The main use case I've heard for doing snoozing/muting is that some > people monitor the overall HEALTH_OK of their Ceph cluster (not > anything finer grained). When there is a health check they don't care > about, they want to mute it to get their external monitoring green > again. For those people, if we say that any muting is an external > job, then we're kind of forcing them to monitor Ceph in a finer level > of detail than they really want to. > > John > > > > > I've been thinking about a "notifier" mgr module to fulfill the > > state-based alerting, based around the notion of notification > > channels (similar to Grafana). The idea being that when a problem is > > seen the notifier calls the send_alert method of the channel, allowing > > multiple channels to be notified (UI, SNMP, etc) > > On Wed, Jun 27, 2018 at 10:45 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > > > > > > On Mon, Jun 25, 2018 at 3:55 AM, John Spray <jspray@xxxxxxxxxx> wrote:> Hi all, > > > > > > > > Recently I've heard from a few different people about needs to have > > > > nicer alerting in Ceph, both for GUIs and for emitting alerts > > > > externally (e.g. over SNMP). I'm keen to make sure we get the right > > > > common bits in, to avoid modules having to do their own thing too > > > > much. > > > > > > > > Points that have come up recently: > > > > - How to integrate Ceph health checks with alerts generated in Prometheus? > > > > - Filtering/muting particular health checks > > > > > > +snoozing > > > > > > -- > > > Patrick Donnelly > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html