Re: Would HEALTH_DISASTER be a good addition?

Wido den Hollander <wido@xxxxxxxx> · Thu, 26 Nov 2015 07:58:25 +0100

On 11/25/2015 10:46 PM, Gregory Farnum wrote:
> On Wed, Nov 25, 2015 at 11:09 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
>> Hi,
>>
>> Currently we have OK, WARN and ERR as states for a Ceph cluster.
>>
>> Now, it could happen that while a Ceph cluster is in WARN state certain
>> PGs are not available due to being in peering or any non-active+? state.
>>
>> When monitoring a Ceph cluster you usually want to see OK and not worry
>> when a cluster is in WARN.
>>
>> However, with the current situation you need to check if there are any
>> PGs in a non-active state since that means they are currently not doing
>> any I/O.
>>
>> For example, size is to 3, min_size is set to 2. One OSD fails, cluster
>> starts to recover/backfill. A second OSD fails which causes certain PGs
>> to become undersized and no longer serve I/O.
>>
>> I've seen such situations happen multiple times. VMs running and a few
>> PGs become non-active which caused about all I/O to stop effectively.
>>
>> The health stays in WARN, but a certain part of it is not serving I/O.
>>
>> My suggestion would be:
>>
>> OK: All PGs are active+clean and no other issues
>> WARN: All PGs are active+? (degraded, recovery_wait, backfilling, etc)
>> ERR: One or more PGs are not active
>> DISASTER: Anything which currently triggers ERR
>>
>> This way you can monitor for ERR. If the cluster goes into >= ERR you
>> know you have to come into action. <= WARN is just a thing you might
>> want to look in to, but not at 03:00 on Sunday morning.
>>
>> Does this sound reasonable?
> 
> It sounds like basically you want a way of distinguishing between
> manual intervention required, and bad states which are going to be
> repaired on their own. That sounds like a good idea to me, but I'm not
> sure how feasible the specific thing here is. How long does a PG need
> to be in a not-active state before you shift into the alert mode? They
> can go through peering for a second or so when a node dies, and that
> will block IO but probably shouldn't trigger alerts.

Hmm, let's say:

mon_pg_inactive_timeout = 30

If one or more PGs is inactive longer than 30 seconds we go in to error
state. This gives us time to go through peering where needed.

If that isn't resolved within 30 seconds we switch to HEALTH_ERR. Admins
can monitor for HEALTH_ERR and send out an alert when that happens.

This way you can ignore HEALTH_WARN since you know all I/O is continuing.

> -Greg
> 

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com