Re: Would HEALTH_DISASTER be a good addition?

Wido den Hollander <wido@xxxxxxxx> · Tue, 1 Dec 2015 14:11:22 +0100



On 26-11-15 07:58, Wido den Hollander wrote:
> On 11/25/2015 10:46 PM, Gregory Farnum wrote:
>> On Wed, Nov 25, 2015 at 11:09 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>> Hi,
>>>
>>> Currently we have OK, WARN and ERR as states for a Ceph cluster.
>>>
>>> Now, it could happen that while a Ceph cluster is in WARN state certain
>>> PGs are not available due to being in peering or any non-active+? state.
>>>
>>> When monitoring a Ceph cluster you usually want to see OK and not worry
>>> when a cluster is in WARN.
>>>
>>> However, with the current situation you need to check if there are any
>>> PGs in a non-active state since that means they are currently not doing
>>> any I/O.
>>>
>>> For example, size is to 3, min_size is set to 2. One OSD fails, cluster
>>> starts to recover/backfill. A second OSD fails which causes certain PGs
>>> to become undersized and no longer serve I/O.
>>>
>>> I've seen such situations happen multiple times. VMs running and a few
>>> PGs become non-active which caused about all I/O to stop effectively.
>>>
>>> The health stays in WARN, but a certain part of it is not serving I/O.
>>>
>>> My suggestion would be:
>>>
>>> OK: All PGs are active+clean and no other issues
>>> WARN: All PGs are active+? (degraded, recovery_wait, backfilling, etc)
>>> ERR: One or more PGs are not active
>>> DISASTER: Anything which currently triggers ERR
>>>
>>> This way you can monitor for ERR. If the cluster goes into >= ERR you
>>> know you have to come into action. <= WARN is just a thing you might
>>> want to look in to, but not at 03:00 on Sunday morning.
>>>
>>> Does this sound reasonable?
>>
>> It sounds like basically you want a way of distinguishing between
>> manual intervention required, and bad states which are going to be
>> repaired on their own. That sounds like a good idea to me, but I'm not
>> sure how feasible the specific thing here is. How long does a PG need
>> to be in a not-active state before you shift into the alert mode? They
>> can go through peering for a second or so when a node dies, and that
>> will block IO but probably shouldn't trigger alerts.
> 
> Hmm, let's say:
> 
> mon_pg_inactive_timeout = 30
> 
> If one or more PGs is inactive longer than 30 seconds we go in to error
> state. This gives us time to go through peering where needed.
> 
> If that isn't resolved within 30 seconds we switch to HEALTH_ERR. Admins
> can monitor for HEALTH_ERR and send out an alert when that happens.
> 
> This way you can ignore HEALTH_WARN since you know all I/O is continuing.
> 

I created a issue for this: http://tracker.ceph.com/issues/13923

Wido

>> -Greg
>>
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com