Hi, Currently we have OK, WARN and ERR as states for a Ceph cluster. Now, it could happen that while a Ceph cluster is in WARN state certain PGs are not available due to being in peering or any non-active+? state. When monitoring a Ceph cluster you usually want to see OK and not worry when a cluster is in WARN. However, with the current situation you need to check if there are any PGs in a non-active state since that means they are currently not doing any I/O. For example, size is to 3, min_size is set to 2. One OSD fails, cluster starts to recover/backfill. A second OSD fails which causes certain PGs to become undersized and no longer serve I/O. I've seen such situations happen multiple times. VMs running and a few PGs become non-active which caused about all I/O to stop effectively. The health stays in WARN, but a certain part of it is not serving I/O. My suggestion would be: OK: All PGs are active+clean and no other issues WARN: All PGs are active+? (degraded, recovery_wait, backfilling, etc) ERR: One or more PGs are not active DISASTER: Anything which currently triggers ERR This way you can monitor for ERR. If the cluster goes into >= ERR you know you have to come into action. <= WARN is just a thing you might want to look in to, but not at 03:00 on Sunday morning. Does this sound reasonable? -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com