Would HEALTH_DISASTER be a good addition?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Currently we have OK, WARN and ERR as states for a Ceph cluster.

Now, it could happen that while a Ceph cluster is in WARN state certain
PGs are not available due to being in peering or any non-active+? state.

When monitoring a Ceph cluster you usually want to see OK and not worry
when a cluster is in WARN.

However, with the current situation you need to check if there are any
PGs in a non-active state since that means they are currently not doing
any I/O.

For example, size is to 3, min_size is set to 2. One OSD fails, cluster
starts to recover/backfill. A second OSD fails which causes certain PGs
to become undersized and no longer serve I/O.

I've seen such situations happen multiple times. VMs running and a few
PGs become non-active which caused about all I/O to stop effectively.

The health stays in WARN, but a certain part of it is not serving I/O.

My suggestion would be:

OK: All PGs are active+clean and no other issues
WARN: All PGs are active+? (degraded, recovery_wait, backfilling, etc)
ERR: One or more PGs are not active
DISASTER: Anything which currently triggers ERR

This way you can monitor for ERR. If the cluster goes into >= ERR you
know you have to come into action. <= WARN is just a thing you might
want to look in to, but not at 03:00 on Sunday morning.

Does this sound reasonable?

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux