Re: Would HEALTH_DISASTER be a good addition?

Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> · Wed, 25 Nov 2015 14:56:04 -0700

I don't have any comment on Greg's specific concerns, but I agree that conceptually that distinguishing between states that are likely to resolve themselves and ones that require intervention would be a nice addition.
QH

On Wed, Nov 25, 2015 at 2:46 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
On Wed, Nov 25, 2015 at 11:09 AM, Wido den Hollander <wido@xxxxxxxx> wrote:

> Hi,

>

> Currently we have OK, WARN and ERR as states for a Ceph cluster.

>

> Now, it could happen that while a Ceph cluster is in WARN state certain

> PGs are not available due to being in peering or any non-active+? state.

>

> When monitoring a Ceph cluster you usually want to see OK and not worry

> when a cluster is in WARN.

>

> However, with the current situation you need to check if there are any

> PGs in a non-active state since that means they are currently not doing

> any I/O.

>

> For example, size is to 3, min_size is set to 2. One OSD fails, cluster

> starts to recover/backfill. A second OSD fails which causes certain PGs

> to become undersized and no longer serve I/O.

>

> I've seen such situations happen multiple times. VMs running and a few

> PGs become non-active which caused about all I/O to stop effectively.

>

> The health stays in WARN, but a certain part of it is not serving I/O.

>

> My suggestion would be:

>

> OK: All PGs are active+clean and no other issues

> WARN: All PGs are active+? (degraded, recovery_wait, backfilling, etc)

> ERR: One or more PGs are not active

> DISASTER: Anything which currently triggers ERR

>

> This way you can monitor for ERR. If the cluster goes into >= ERR you

> know you have to come into action. <= WARN is just a thing you might

> want to look in to, but not at 03:00 on Sunday morning.

>

> Does this sound reasonable?

It sounds like basically you want a way of distinguishing between

manual intervention required, and bad states which are going to be

repaired on their own. That sounds like a good idea to me, but I'm not

sure how feasible the specific thing here is. How long does a PG need

to be in a not-active state before you shift into the alert mode? They

can go through peering for a second or so when a node dies, and that

will block IO but probably shouldn't trigger alerts.

-Greg

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com