cluster health checks

Sage Weil <sweil@xxxxxxxxxx> · Tue, 13 Jun 2017 19:34:00 +0000 (UTC)

I've put together a rework of the cluster health checks at

	https://github.com/ceph/ceph/pull/15643

based on John's original proposal in

	http://tracker.ceph.com/issues/7192

(with a few changes).  I think it's pretty complete except that the 
MDSMonitor new-style checks aren't implemented yet.

This would be a semi-incompatible change with pre-luminous ceph in that

 - the structured (json/xml) health output is totally different
 - the plaintext health *detail* output is different
 - specific error messages are a bit different.  I was reimplementing them 
and took the liberty of revising what information was in the 
summary and detail in several cases.

Let me know what you think!

Thanks-
sage

$ ceph -s
  cluster:
    id:     9ee7f49c-57c3-4686-afd1-75b3a8f08c73
    health: HEALTH_WARN
            2 osds down
            1 host (2 osds) down
            1 root (2 osds) down
            8 pgs stale

  services:
    mon: 3 daemons, quorum a,b,c
    mgr: x(active)
    osd: 2 osds: 0 up, 2 in

  data:
    pools:   1 pools, 8 pgs
    objects: 0 objects, 0 bytes
    usage:   414 GB used, 330 GB / 744 GB avail
    pgs:     8 stale+active+clean

$ ceph health detail -f json-pretty
{
    "checks": {
        "OSD_DOWN": {
            "severity": "HEALTH_WARN",
            "message": "2 osds down"
        },
        "OSD_HOST_DOWN": {
            "severity": "HEALTH_WARN",
            "message": "1 host (2 osds) down"
        },
        "OSD_ROOT_DOWN": {
            "severity": "HEALTH_WARN",
            "message": "1 root (2 osds) down"
        },
        "PG_STALE": {
            "severity": "HEALTH_WARN",
            "message": "8 pgs stale"
        }
    },
    "status": "HEALTH_WARN",
    "detail": {
        "OSD_DOWN": [
            "osd.0 (root=default,host=gnit) is down",
            "osd.1 (root=default,host=gnit) is down"
        ],
        "OSD_HOST_DOWN": [
            "host gnit (root=default) (2 osds) is down"
        ],
        "OSD_ROOT_DOWN": [
            "root default (2 osds) is down"
        ],
        "PG_STALE": [
            "pg 0.7 is stale+active+clean, acting [1,0]",
            "pg 0.6 is stale+active+clean, acting [0,1]",
            "pg 0.5 is stale+active+clean, acting [0,1]",
            "pg 0.4 is stale+active+clean, acting [0,1]",
            "pg 0.0 is stale+active+clean, acting [0,1]",
            "pg 0.1 is stale+active+clean, acting [1,0]",
            "pg 0.2 is stale+active+clean, acting [0,1]",
            "pg 0.3 is stale+active+clean, acting [0,1]"
        ]
    }
}

$ ceph health detail
HEALTH_WARN 2 osds down; 1 host (2 osds) down; 1 root (2 osds) down; 8 pgs stale
OSD_DOWN 2 osds down
    osd.0 (root=default,host=gnit) is down
    osd.1 (root=default,host=gnit) is down
OSD_HOST_DOWN 1 host (2 osds) down
    host gnit (root=default) (2 osds) is down
OSD_ROOT_DOWN 1 root (2 osds) down
    root default (2 osds) is down
PG_STALE 8 pgs stale
    pg 0.7 is stale+active+clean, acting [1,0]
    pg 0.6 is stale+active+clean, acting [0,1]
    pg 0.5 is stale+active+clean, acting [0,1]
    pg 0.4 is stale+active+clean, acting [0,1]
    pg 0.0 is stale+active+clean, acting [0,1]
    pg 0.1 is stale+active+clean, acting [1,0]
    pg 0.2 is stale+active+clean, acting [0,1]
    pg 0.3 is stale+active+clean, acting [0,1]

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html