Re: Ceph Monitoring

David Turner <david.turner@xxxxxxxxxxxxxxxx> · Fri, 13 Jan 2017 20:28:39 +0000

We don't use many critical alerts (that will have our NOC wake up an engineer), but the main one that we do have is a check that tells us if there are 2 or more hosts with osds
 that are down.  We have clusters with 60 servers in them, so having an osd die and backfill off of isn't something to wake up for in the middle of the night, but having osds down on 2 servers is 1 osd away from data loss.  A quick reference to how to do this
 check in bash is below.

hosts_with_down_osds=`ceph osd tree | grep 'host\|down' | grep -B1 down | grep host | wc -l`

if [ $hosts_with_down_osds -ge 2 ]

then

    echo critical

elif [ $hosts_with_down_osds -eq 1 ]

then

    echo warning

elif [ $hosts_with_down_osds -eq 0 ]

then

    echo ok

else

    echo unknown

fi

David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.

From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Chris Jones [cjones@xxxxxxxxxxx]

Sent: Friday, January 13, 2017 1:15 PM

To: ceph-users@xxxxxxxx

Subject: [ceph-users] Ceph Monitoring

General question/survey:

Those that have larger clusters, how are you doing alerting/monitoring? Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about collectd related but more on initial alerts
 of an issue or potential issue? What threshold do you use basically? Just trying to get a pulse of what others are doing.

Thanks in advance.  

-- 

Best Regards,
Chris Jones

Bloomberg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com