Re: Prometheus Alerts for Ceph: Reference Rules

Jan Fajerski <jan.fajerski@xxxxxxxx> · Thu, 11 Apr 2019 12:58:04 +0200



On Thu, Apr 11, 2019 at 12:19:51PM +0200, Ernesto Puerta wrote:
  Hi Cephers,
  With Nautilus release Ceph-dashoard now can display Prometheus alerts
  as pop-up notifications (aka toasts). [1]
  As we did similarly for Grafana, there was a proposal to provide a
  reference alert rule file, [2] but not action has been taken since a
  while ago, so I'd want to refresh this discussion.
  With the help of Jan and Boris, I've got the following sources:
  - DeepSea [3]
  - Ceph-Ansible [4]
  - Ceph-mixins (Jsonnet) [5]
  The three of them share some common alerts (health, OSD o Mons down,
  network drops/errors), but each one brings its own (and IMHO worthy)
  alerts:
  - DeepSea: alerts based on predicted filling rates (both for pool and
  disk).
  - CephAnsible: slow OSD response, host loss check.
  This is the complete list:
  [common]
  - Ceph Health: Warning/Error
  - OSDs Down
  - Mon down/quorum
  - Disks/OSDs near full
  - Network errors/drops
  - PG checks: inactive/unclean/stuck
  - PG count: high pg count deviation/OSD(s) with High PG Count
  - Pool capacity
  [deepsea]
  - 10% OSDs down
  - flap osd
  - root volume full
  - storage filling up
  - pool filling up
  [ceph-ansible]
  - OSD Host(s) Down
  - OSD Host Loss Check
  - Slow OSD Responses
  - Cluster Capacity Low
  My specific questions:
  - Anyone else currently undertaking this?
I have this on my agenda before Cephalocon, though would be more then happy to 
collaborate with someone.
The linked DeepSea alert file is still based on the DigitalOcean ceph exporter 
(plus node_exporter). One task would be to port them to the metrics exported by 
the mgr module and a more up-to-date node_exporter. I do this early next week.
  - Any objections to have this reference set in ceph repo
  (monitoring/prometheus) or alternative proposals?
I think the ceph repo is a good place for this.
  - Any relevant alert missing there?
This looks like a decent set of default alerts. I'm pretty sure users will want 
add to that regardless of how many alert we ship. I'd advocate for a fairly 
minimal set, that applies to most (if not all) clusters.
  Any other feedback will also be appreciated!
  Kind Regards,
  Ernesto
  [1]: [1]http://docs.ceph.com/docs/nautilus/mgr/dashboard/#enabling-prom
  etheus-alerting
  [2]: [2]https://tracker.ceph.com/issues/24977
  [3]:
  [3]https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/p
  rometheus/files/ses_default_alerts.yml
  [4]:
  [4]https://github.com/ceph/ceph-ansible/blob/wip-dashboard/roles/ceph-p
  rometheus/files/ceph_dashboard.yml
  [5]: [5]https://github.com/ceph/ceph-mixins/tree/master/alerts

References

  1. http://docs.ceph.com/docs/nautilus/mgr/dashboard/#enabling-prometheus-alerting
  2. https://tracker.ceph.com/issues/24977
  3. https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml
  4. https://github.com/ceph/ceph-ansible/blob/wip-dashboard/roles/ceph-prometheus/files/ceph_dashboard.yml
  5. https://github.com/ceph/ceph-mixins/tree/master/alerts

--
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)