Re: Random Health_warn

Scottix <scottix@xxxxxxxxx> · Thu, 23 Feb 2017 23:15:57 +0000

That sounds about right, I do see blocked requests sometimes when it is under really heavy load.
Looking at some examples I think summary should list the issues.
"summary": [],
"overall_status": "HEALTH_OK",

I'll try logging that too.

Scott

On Thu, Feb 23, 2017 at 3:00 PM David Turner <david.turner@xxxxxxxxxxxxxxxx> wrote:

There are multiple approaches to give you more information about the Health state.  CLI has these 2 options:

ceph health detail

ceph status

I also like using ceph-dash.  ( https://github.com/Crapworks/ceph-dash )  It has an associated nagios check to scrape the ceph-dash page.

I personally do `watch ceph status` when I'm monitoring the cluster closely.  It will show you things like blocked requests, osds flapping, mon clock skew, or whatever your problem is causing the health_warn state.  The most likely cause for health_warn off
 and on is blocked requests.  Those are caused by any number of things that you would need to diagnose further if that is what is causing the health_warn state.

David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.

________________________________________

From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of John Spray [jspray@xxxxxxxxxx]

Sent: Thursday, February 23, 2017 3:47 PM

To: Scottix

Cc: ceph-users@xxxxxxxxxxxxxx

Subject: Re:  Random Health_warn

On Thu, Feb 23, 2017 at 9:49 PM, Scottix <scottix@xxxxxxxxx> wrote:

> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

>

> We are seeing a weird behavior or not sure how to diagnose what could be

> going on. We started monitoring the overall_status from the json query and

> every once in a while we would get a HEALTH_WARN for a minute or two.

>

> Monitoring logs.

> 02/23/2017 07:25:54 AM HEALTH_OK

> 02/23/2017 07:24:54 AM HEALTH_WARN

> 02/23/2017 07:23:55 AM HEALTH_OK

> 02/23/2017 07:22:54 AM HEALTH_OK

> ...

> 02/23/2017 05:13:55 AM HEALTH_OK

> 02/23/2017 05:12:54 AM HEALTH_WARN

> 02/23/2017 05:11:54 AM HEALTH_WARN

> 02/23/2017 05:10:54 AM HEALTH_OK

> 02/23/2017 05:09:54 AM HEALTH_OK

>

> When I check the mon leader logs there is no indication of an error or

> issues that could be occuring. Is there a way to find what is causing the

> HEALTH_WARN?

Possibly not without grabbing more than just the overall status at the

same time as you're grabbing the OK/WARN status.

Internally, the OK/WARN/ERROR health state is generated on-demand by

applying a bunch of checks to the state of the system when the user

runs the health command -- the system doesn't know it's in a warning

state until it's asked.  Often you will see a corresponding log

message, but not necessarily.

John

> Best,

> Scott

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com