+1. In that example, I'd see the ceph health checks determining the issue - 100% agree that the place for ceph health logic is in the health checks - not the alert/monitoring system further up the stack. The ceph health check alerts would provide WARN/ERROR type events, whereas the timeseries stuff is more likely to be WARN Examples of timeseries events could include; - free capacity projections by pool (i.e. when do I need to add more capacity) - network errors (e.g. from prom node_exporter stats) - spinner dev util% avg exceeding thresholds (from prom data, indicating more spindles are needed) - OS level triggers for CPU/RAM thresholds I'd see the default target for any health-check derived alert being an api endpoint within the dashboard. On Fri, Jun 29, 2018 at 12:30 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > On Thu, Jun 28, 2018 at 3:49 PM, Paul Cuzner <pcuzner@xxxxxxxxxx> wrote: > > Just to add my 0.02c. > > > > I think there are really two layers of alerting - state alerting, and > > trend based alerting (time series). State alerting is where I'd see a > > mgr module adding value, where as trend based alerting is more likely > > to sit outside ceph within prometheus, zabbix, influx etc > > I’m not sure I entirely agree with this. The example of trend alerting > that sticks out in my mind is changing from WARN to ERROR based on > whether recovery appears to be succeeding or not following an OSD > failure. > That is: > * an OSD failing should not be an error; we are designed for failure! > * objects are obviously going to go degraded in that case, but again > it is not a system error > * but if the system isn't going to recover on its own, THAT is an error > > Identifying "isn't going to recover on its own" can be sufficiently > complicated it seems like we ought to answer that ourselves instead of > making every alerting system re-implement it (badly). > -Greg > > > > > I also don't think alert management (snoozing, muting etc) should fall > > to Ceph - let the monitoring/alert layer handle that. This keeps > > things simple(ish) and helps define the 'alert' role as a health-check > > and notifier, leaving more advanced controls to higher levels in the > > monitoring stack. > > > > I've been thinking about a "notifier" mgr module to fulfill the > > state-based alerting, based around the notion of notification > > channels (similar to Grafana). The idea being that when a problem is > > seen the notifier calls the send_alert method of the channel, allowing > > multiple channels to be notified (UI, SNMP, etc) > > On Wed, Jun 27, 2018 at 10:45 AM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > >> > >> On Mon, Jun 25, 2018 at 3:55 AM, John Spray <jspray@xxxxxxxxxx> wrote:> Hi all, > >> > > >> > Recently I've heard from a few different people about needs to have > >> > nicer alerting in Ceph, both for GUIs and for emitting alerts > >> > externally (e.g. over SNMP). I'm keen to make sure we get the right > >> > common bits in, to avoid modules having to do their own thing too > >> > much. > >> > > >> > Points that have come up recently: > >> > - How to integrate Ceph health checks with alerts generated in Prometheus? > >> > - Filtering/muting particular health checks > >> > >> +snoozing > >> > >> -- > >> Patrick Donnelly > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html