Re: ceph health JSON format has changed sync?

Thomas Byrne - UKRI STFC <tom.byrne@xxxxxxxxxx> · Wed, 2 Jan 2019 11:18:20 +0000

I recently spent some time looking at this, I believe the 'summary' and 'overall_status' sections are now deprecated. The 'status' and 'checks' fields are the ones to use now.

The 'status' field gives you the OK/WARN/ERR, but returning the most severe error condition from the 'checks' section is less trivial. AFAIK all health_warn states are treated as equally severe, and same for health_err. We ended up formatting our single line human readable output as something like:

"HEALTH_ERR: 1 inconsistent pg, HEALTH_ERR: 1 scrub error, HEALTH_WARN: 20 large omap objects"

To make it obvious which check is causing which state. We needed to supress specific checks for callouts, so had to look at each check and the resulting state. If you're not trying to do something similar there may be a more lightweight way to go about it.

Cheers,
Tom

> -----Original Message-----
> From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> On Behalf Of Jan
> Kasprzak
> Sent: 02 January 2019 09:29
> To: ceph-users@xxxxxxxxxxxxxx
> Subject:  ceph health JSON format has changed sync?
> 
> 	Hello, Ceph users,
> 
> I am afraid the following question is a FAQ, but I still was not able to find the
> answer:
> 
> I use ceph --status --format=json-pretty as a source of CEPH status for my
> Nagios monitoring. After upgrading to Luminous, I see the following in the
> JSON output when the cluster is not healthy:
> 
>         "summary": [
>             {
>                 "severity": "HEALTH_WARN",
>                 "summary": "'ceph health' JSON format has changed in luminous. If
> you see this your monitoring system is scraping the wrong fields. Disable this
> with 'mon health preluminous compat warning = false'"
>             }
>         ],
> 
> Apart from that, the JSON data seems reasonable. My question is which part
> of JSON structure are the "wrong fields" I have to avoid. Is it just the
> "summary" section, or some other parts as well? Or should I avoid the whole
> ceph --status and use something different instead?
> 
> What I want is a single machine-readable value with OK/WARNING/ERROR
> meaning, and a single human-readable text line, describing the most severe
> error condition which is currently present. What is the preferred way to get
> this data in Luminous?
> 
> 	Thanks,
> 
> -Yenya
> 
> --
> | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
> | http://www.fi.muni.cz/~kas/                         GPG: 4096R/A45477D5 |
>  This is the world we live in: the way to deal with computers is to google  the
> symptoms, and hope that you don't have to watch a video. --P. Zaitcev
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com