On 26-11-15 07:58, Wido den Hollander wrote: > On 11/25/2015 10:46 PM, Gregory Farnum wrote: >> On Wed, Nov 25, 2015 at 11:09 AM, Wido den Hollander <wido@xxxxxxxx> wrote: >>> Hi, >>> >>> Currently we have OK, WARN and ERR as states for a Ceph cluster. >>> >>> Now, it could happen that while a Ceph cluster is in WARN state certain >>> PGs are not available due to being in peering or any non-active+? state. >>> >>> When monitoring a Ceph cluster you usually want to see OK and not worry >>> when a cluster is in WARN. >>> >>> However, with the current situation you need to check if there are any >>> PGs in a non-active state since that means they are currently not doing >>> any I/O. >>> >>> For example, size is to 3, min_size is set to 2. One OSD fails, cluster >>> starts to recover/backfill. A second OSD fails which causes certain PGs >>> to become undersized and no longer serve I/O. >>> >>> I've seen such situations happen multiple times. VMs running and a few >>> PGs become non-active which caused about all I/O to stop effectively. >>> >>> The health stays in WARN, but a certain part of it is not serving I/O. >>> >>> My suggestion would be: >>> >>> OK: All PGs are active+clean and no other issues >>> WARN: All PGs are active+? (degraded, recovery_wait, backfilling, etc) >>> ERR: One or more PGs are not active >>> DISASTER: Anything which currently triggers ERR >>> >>> This way you can monitor for ERR. If the cluster goes into >= ERR you >>> know you have to come into action. <= WARN is just a thing you might >>> want to look in to, but not at 03:00 on Sunday morning. >>> >>> Does this sound reasonable? >> >> It sounds like basically you want a way of distinguishing between >> manual intervention required, and bad states which are going to be >> repaired on their own. That sounds like a good idea to me, but I'm not >> sure how feasible the specific thing here is. How long does a PG need >> to be in a not-active state before you shift into the alert mode? They >> can go through peering for a second or so when a node dies, and that >> will block IO but probably shouldn't trigger alerts. > > Hmm, let's say: > > mon_pg_inactive_timeout = 30 > > If one or more PGs is inactive longer than 30 seconds we go in to error > state. This gives us time to go through peering where needed. > > If that isn't resolved within 30 seconds we switch to HEALTH_ERR. Admins > can monitor for HEALTH_ERR and send out an alert when that happens. > > This way you can ignore HEALTH_WARN since you know all I/O is continuing. > I created a issue for this: http://tracker.ceph.com/issues/13923 Wido >> -Greg >> > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com