Hi, We recently had problems that meant 3 out of 32 OSD hosts went offline for about 10 minutes. The hosts are now back in the cluster as expected and backfilling going on. However we are seeing a couple of problems. We are seeing: 1. Ceph is flagging a handfull of PGs as backfill_toofull when they aren't. https://tracker.ceph.com/issues/61839. 2. Periodically it generates an OSD_FULL error. Are there any plans to look at resolving bug #61839? When an OSD_FULL error has been reported, the OSD in question has been <75% usage. The OSD wasn't used by any of the PGs reporting backfill_toofull. I currently have the full_ratio set to 0.97 and nearfull_ratio set to 0.87 so the OSDs are nowhere near these levels. The %raw usage of the OSDs in the cluster ranges from about 60-80% and the raw usage of the cluster is about 75%. We do not get any "near full" warnings prior to OSD_FULL being set. Having a production system instantly go offline without warning isn't ideal and these things seem to know the least convenient moment. These problems only happen after a host failure. Each time we have added additional OSD hosts into the cluster, the backfilling has finished without problems. We are currently running Reef 18.2.4 but I have experienced these same problems on Pacific 16.2.10 Has anyone else seen this behaviour? Thanks Gerard _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx