Re: Random Health_warn

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



That sounds about right, I do see blocked requests sometimes when it is under really heavy load.

Looking at some examples I think summary should list the issues.
"summary": [],
"overall_status": "HEALTH_OK",

I'll try logging that too.

Scott

On Thu, Feb 23, 2017 at 3:00 PM David Turner <david.turner@xxxxxxxxxxxxxxxx> wrote:

There are multiple approaches to give you more information about the Health state.  CLI has these 2 options:
ceph health detail
ceph status

I also like using ceph-dash.  ( https://github.com/Crapworks/ceph-dash )  It has an associated nagios check to scrape the ceph-dash page.

I personally do `watch ceph status` when I'm monitoring the cluster closely.  It will show you things like blocked requests, osds flapping, mon clock skew, or whatever your problem is causing the health_warn state.  The most likely cause for health_warn off and on is blocked requests.  Those are caused by any number of things that you would need to diagnose further if that is what is causing the health_warn state.


David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.


________________________________________
From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of John Spray [jspray@xxxxxxxxxx]
Sent: Thursday, February 23, 2017 3:47 PM
To: Scottix
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re: Random Health_warn


On Thu, Feb 23, 2017 at 9:49 PM, Scottix <scottix@xxxxxxxxx> wrote:
> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
> We are seeing a weird behavior or not sure how to diagnose what could be
> going on. We started monitoring the overall_status from the json query and
> every once in a while we would get a HEALTH_WARN for a minute or two.
>
> Monitoring logs.
> 02/23/2017 07:25:54 AM HEALTH_OK
> 02/23/2017 07:24:54 AM HEALTH_WARN
> 02/23/2017 07:23:55 AM HEALTH_OK
> 02/23/2017 07:22:54 AM HEALTH_OK
> ...
> 02/23/2017 05:13:55 AM HEALTH_OK
> 02/23/2017 05:12:54 AM HEALTH_WARN
> 02/23/2017 05:11:54 AM HEALTH_WARN
> 02/23/2017 05:10:54 AM HEALTH_OK
> 02/23/2017 05:09:54 AM HEALTH_OK
>
> When I check the mon leader logs there is no indication of an error or
> issues that could be occuring. Is there a way to find what is causing the
> HEALTH_WARN?

Possibly not without grabbing more than just the overall status at the
same time as you're grabbing the OK/WARN status.

Internally, the OK/WARN/ERROR health state is generated on-demand by
applying a bunch of checks to the state of the system when the user
runs the health command -- the system doesn't know it's in a warning
state until it's asked.  Often you will see a corresponding log
message, but not necessarily.

John

> Best,
> Scott
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

JPEG image

JPEG image

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux