We monitor few things:
- cluster health (error only, ignoring warnings since we have
separate checks for interesting things)
- if all PGs are active (number of active replicas >= min_size)
- if there are any blocked requests (it's a good indicator, in our
case, that some disk is going to fail soon)
- if all monitors are up and in quorum (checking via admin socket)
- if there are any unfound objects
- if there are scrub/deep-scrub errors
- monitor clock skew
On 13.01.2017 21:35, David Turner wrote:
We don't currently monitor that, but
my todo list has an item to monitor for blocked requests longer
than 500 seconds to critical on. You can see how long they've
been blocked for from `ceph health detail`. Our cluster doesn't
need to be super fast at any given point, but it does need to be
progressing.
If you
are not the intended recipient of this message or
received it erroneously, please notify the sender and
delete it, together with any attachments, and be
advised that any dissemination or copying of this
message is prohibited. |
Thanks.
What about 'NN
ops > 32 sec' (blocked ops) type alerts? Does anyone
monitor for those type and if so what criteria do you
use?
Thanks again!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
PS
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com