Re: Ceph Monitoring

Paweł Sadowski <ceph@xxxxxxxxx> · Fri, 13 Jan 2017 22:22:06 +0100



    We monitor few things:

      - cluster health (error only, ignoring warnings since we have
      separate checks for interesting things)

      - if all PGs are active (number of active replicas >= min_size)

      - if there are any blocked requests (it's a good indicator, in our
      case, that some disk is going to fail soon)

      - if all monitors are up and in quorum (checking via admin socket)

      - if there are any unfound objects

      - if there are scrub/deep-scrub errors

      - monitor clock skew

      
      On 13.01.2017 21:35, David Turner wrote:

    
      We don't currently monitor that, but
        my todo list has an item to monitor for blocked requests longer
        than 500 seconds to critical on.  You can see how long they've
        been blocked for from `ceph health detail`.  Our cluster doesn't
        need to be super fast at any given point, but it does need to be
        progressing.

      
              David Turner |
                Cloud
                  Operations Engineer |
                StorageCraft Technology
                      Corporation

                380
                  Data Drive Suite 300 |
                  Draper |
                Utah |
                84020

                Office:
                801.871.2760 |
                Mobile:
                385.224.2943
            
          
              If you
                  are not the intended recipient of this message or
                  received it erroneously, please notify the sender and
                  delete it, together with any attachments, and be
                  advised that any dissemination or copying of this
                  message is prohibited.
            
          
          From: Chris
              Jones [cjones@xxxxxxxxxxx]

              Sent: Friday, January 13, 2017 1:31 PM

              To: David Turner

              Cc: ceph-users@xxxxxxxx

              Subject: Re:  Ceph Monitoring

            
              Thanks.
              

              What about 'NN
                ops > 32 sec' (blocked ops) type alerts? Does anyone
                monitor for those type and if so what criteria do you
                use?
              

              Thanks again!
            
            
              On Fri, Jan 13, 2017 at 3:28 PM,
                David Turner 
                  <david.turner@xxxxxxxxxxxxxxxx>
                wrote:

                
                    We don't use many
                      critical alerts (that will have our NOC wake up an
                      engineer), but the main one that we do have is a
                      check that tells us if there are 2 or more hosts
                      with osds that are down.  We have clusters with 60
                      servers in them, so having an osd die and backfill
                      off of isn't something to wake up for in the
                      middle of the night, but having osds down on 2
                      servers is 1 osd away from data loss.  A quick
                      reference to how to do this check in bash is
                      below.

                      
                      hosts_with_down_osds=`ceph osd tree | grep
                      'host\|down' | grep -B1 down | grep host | wc -l`

                      if [ $hosts_with_down_osds -ge 2 ]

                      then

                          echo critical

                      elif [ $hosts_with_down_osds -eq 1 ]

                      then

                          echo warning

                      elif [ $hosts_with_down_osds -eq 0 ]

                      then

                          echo ok

                      else

                          echo unknown

                      fi

                    
                          David Turner |
                            Cloud
                              Operations Engineer |
                            StorageCraft
                                  Technology Corporation

                            380 Data
                              Drive Suite 300 |
                              Draper |
                            Utah |
                            84020

                            Office:
                            801.871.2760 |
                            Mobile:
                            385.224.2943
                        
                      
                          If you are not the intended
                              recipient of this message or received it
                              erroneously, please notify the sender and
                              delete it, together with any attachments,
                              and be advised that any dissemination or
                              copying of this message is prohibited.
                        
                      
                        From:
                            ceph-users [ceph-users-bounces@lists.ceph.com]
                            on behalf of Chris Jones [cjones@xxxxxxxxxxx]

                            Sent: Friday, January 13, 2017 1:15
                            PM

                            To: ceph-users@xxxxxxxx

                            Subject:  Ceph Monitoring

                          
                                General
                                  question/survey:
                                

                                Those
                                  that have larger clusters, how are you
                                  doing alerting/monitoring? Meaning, do
                                  you trigger off of 'HEALTH_WARN', etc?
                                  Not really talking about collectd
                                  related but more on initial alerts of
                                  an issue or potential issue? What
                                  threshold do you use basically? Just
                                  trying to get a pulse of what others
                                  are doing.
                                

                                Thanks
                                  in advance.  
                                

                                -- 

                                
                                      Best Regards,
                                        Chris Jones
                                        
                                          Bloomberg
                                          

              -- 

              
                        Best Regards,
                          Chris Jones
                          

                          cjones@xxxxxxxxxxx

                          
                          (p) 770.655.0770
                          

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
    -- 
PS
  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com