Fwiw I added a few things to https://pad.ceph.com/p/alert-conditions and will
circulate this mail a bit wider.
Or maybe there is not all that much interest in alerting...
On Mon, Jul 23, 2018 at 06:10:04PM +0200, Jan Fajerski wrote:
Hi community,
the topic of alerting conditions for a ceph cluster comes up in
various contexts. Some folks use prometheus or grafana, (I believe)
sopme people would like snmp traps from ceph, the mgr dashboard could
provide basic alerting capabilities and there is of course ceph -s.
Also see "Improving alerting/health checks" on ceph-devel.
Working on some prometheus stuff I think it would be nice to have some
basic alerting rules in the ceph repo. This could serve as a
out-of-the-box default as well as a example or best practice which
conditions should be watched.
So I'm wondering what does the community think? What do operators use
as alert conditions or find alert-worthy?
I'm aware that this is very open-ended, highly dependent on the
cluster and its workload and can range from obvious (health_err
anyone?) to intricate conditions that are designed for a certain
cluster. I'm wondering if we can distill some non-trivial alert
conditions that ceph itself does not (yet) provide.
If you have any conditions fitting that description, feel free to add
them to https://pad.ceph.com/p/alert-conditions. Otherwise looking
forward to feedback.
jan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com