I've put together a rework of the cluster health checks at https://github.com/ceph/ceph/pull/15643 based on John's original proposal in http://tracker.ceph.com/issues/7192 (with a few changes). I think it's pretty complete except that the MDSMonitor new-style checks aren't implemented yet. This would be a semi-incompatible change with pre-luminous ceph in that - the structured (json/xml) health output is totally different - the plaintext health *detail* output is different - specific error messages are a bit different. I was reimplementing them and took the liberty of revising what information was in the summary and detail in several cases. Let me know what you think! Thanks- sage $ ceph -s cluster: id: 9ee7f49c-57c3-4686-afd1-75b3a8f08c73 health: HEALTH_WARN 2 osds down 1 host (2 osds) down 1 root (2 osds) down 8 pgs stale services: mon: 3 daemons, quorum a,b,c mgr: x(active) osd: 2 osds: 0 up, 2 in data: pools: 1 pools, 8 pgs objects: 0 objects, 0 bytes usage: 414 GB used, 330 GB / 744 GB avail pgs: 8 stale+active+clean $ ceph health detail -f json-pretty { "checks": { "OSD_DOWN": { "severity": "HEALTH_WARN", "message": "2 osds down" }, "OSD_HOST_DOWN": { "severity": "HEALTH_WARN", "message": "1 host (2 osds) down" }, "OSD_ROOT_DOWN": { "severity": "HEALTH_WARN", "message": "1 root (2 osds) down" }, "PG_STALE": { "severity": "HEALTH_WARN", "message": "8 pgs stale" } }, "status": "HEALTH_WARN", "detail": { "OSD_DOWN": [ "osd.0 (root=default,host=gnit) is down", "osd.1 (root=default,host=gnit) is down" ], "OSD_HOST_DOWN": [ "host gnit (root=default) (2 osds) is down" ], "OSD_ROOT_DOWN": [ "root default (2 osds) is down" ], "PG_STALE": [ "pg 0.7 is stale+active+clean, acting [1,0]", "pg 0.6 is stale+active+clean, acting [0,1]", "pg 0.5 is stale+active+clean, acting [0,1]", "pg 0.4 is stale+active+clean, acting [0,1]", "pg 0.0 is stale+active+clean, acting [0,1]", "pg 0.1 is stale+active+clean, acting [1,0]", "pg 0.2 is stale+active+clean, acting [0,1]", "pg 0.3 is stale+active+clean, acting [0,1]" ] } } $ ceph health detail HEALTH_WARN 2 osds down; 1 host (2 osds) down; 1 root (2 osds) down; 8 pgs stale OSD_DOWN 2 osds down osd.0 (root=default,host=gnit) is down osd.1 (root=default,host=gnit) is down OSD_HOST_DOWN 1 host (2 osds) down host gnit (root=default) (2 osds) is down OSD_ROOT_DOWN 1 root (2 osds) down root default (2 osds) is down PG_STALE 8 pgs stale pg 0.7 is stale+active+clean, acting [1,0] pg 0.6 is stale+active+clean, acting [0,1] pg 0.5 is stale+active+clean, acting [0,1] pg 0.4 is stale+active+clean, acting [0,1] pg 0.0 is stale+active+clean, acting [0,1] pg 0.1 is stale+active+clean, acting [1,0] pg 0.2 is stale+active+clean, acting [0,1] pg 0.3 is stale+active+clean, acting [0,1] -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html