Hi, (In nautilus) I noticed that the acting set for a pg in acting+remapped+backfilling (or backfill_wait) can violate the failure domain rule. We have a 3x pool replicated across racks. Following some host outage I noticed several PGs like: 75.200 7570 0 7570 0 31583745536 0 0 3058 active+remapped+backfill_wait 55m 3208334'13480873 3208334:23611310 [596,502,717]p596 [596,502,424]p596 2020-05-28 05:44:42.604307 2020-05-28 05:44:42.604307 Checking those up and acting sets, I have: OK: PG 75.200 has no failure domain problem in up set [596, 502, 717] with racks ['BA09', 'BA10', 'BA12'] WARN: PG 75.200 has a failure domain problem in acting set [596, 502, 424] with racks ['BA09', 'BA10', 'BA09'] If BA09 were to go down, the pg would drop below min_size. (The full pg query output is at https://pastebin.com/rrEQMnyC) Is this https://tracker.ceph.com/issues/3360 ? # ceph osd dump | grep 75.200 pg_temp 75.200 [596,502,424] Shouldn't we mark a PG degraded or similar to raise HEALTH_WARN if the failure domain is violated? Cheers, Dan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx