I do setup prometheus to use a file glob but by "remotely" here, I was talking about a (rest) API to do that. This requires a direct access to the entire host to add a set of alerting rules. On Thu, Apr 11, 2019 at 1:02 PM Jan Fajerski <jan.fajerski@xxxxxxxx> wrote: > > On Thu, Apr 11, 2019 at 12:51:34PM +0200, Boris Ranto wrote: > >Hey all, > > > >if we want to have a set of rules in the official Ceph repo then I > >would recommend merging all the alerting rules and letting the > > products decide on what alerts do they want to mute. > +1 > > > >My reasoning here is that afaik, you can configure AlertManager (even > >remotely) to silence certain alerts but you can't configure Prometheus > >remotely to include new alerting rules. > One solution is to configure the prometheus config variable rule_files with an > appropriate file glob (say /etc/prometheus/rules/*). Then a user can simply drop > a rules file in that location and voila. > > > >Regards, > >Boris > > > >On Thu, Apr 11, 2019 at 12:20 PM Ernesto Puerta <epuertat@xxxxxxxxxx> wrote: > >> > >> Hi Cephers, > >> > >> With Nautilus release Ceph-dashoard now can display Prometheus alerts as pop-up notifications (aka toasts). [1] > >> > >> As we did similarly for Grafana, there was a proposal to provide a reference alert rule file, [2] but not action has been taken since a while ago, so I'd want to refresh this discussion. > >> > >> With the help of Jan and Boris, I've got the following sources: > >> > >> - DeepSea [3] > >> - Ceph-Ansible [4] > >> - Ceph-mixins (Jsonnet) [5] > >> > >> The three of them share some common alerts (health, OSD o Mons down, network drops/errors), but each one brings its own (and IMHO worthy) alerts: > >> > >> - DeepSea: alerts based on predicted filling rates (both for pool and disk). > >> - CephAnsible: slow OSD response, host loss check. > >> > >> This is the complete list: > >> > >> [common] > >> - Ceph Health: Warning/Error > >> - OSDs Down > >> - Mon down/quorum > >> - Disks/OSDs near full > >> - Network errors/drops > >> - PG checks: inactive/unclean/stuck > >> - PG count: high pg count deviation/OSD(s) with High PG Count > >> - Pool capacity > >> > >> [deepsea] > >> - 10% OSDs down > >> - flap osd > >> - root volume full > >> - storage filling up > >> - pool filling up > >> > >> [ceph-ansible] > >> - OSD Host(s) Down > >> - OSD Host Loss Check > >> - Slow OSD Responses > >> - Cluster Capacity Low > >> > >> My specific questions: > >> - Anyone else currently undertaking this? > >> - Any objections to have this reference set in ceph repo (monitoring/prometheus) or alternative proposals? > >> - Any relevant alert missing there? > >> > >> Any other feedback will also be appreciated! > >> > >> Kind Regards, > >> Ernesto > >> > >> [1]: http://docs.ceph.com/docs/nautilus/mgr/dashboard/#enabling-prometheus-alerting > >> [2]: https://tracker.ceph.com/issues/24977 > >> [3]: https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml > >> [4]: https://github.com/ceph/ceph-ansible/blob/wip-dashboard/roles/ceph-prometheus/files/ceph_dashboard.yml > >> [5]: https://github.com/ceph/ceph-mixins/tree/master/alerts > > > > -- > Jan Fajerski > Engineer Enterprise Storage > SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah > HRB 21284 (AG Nürnberg)