OSD_FULL after OSD Node Failures

"Gerard Hand" <g.hand@xxxxxxxxxxxxxxx> · Thu, 05 Dec 2024 14:47:55 -0000

Hi,

We recently had problems that meant 3 out of 32 OSD hosts went offline for about 10 minutes.  The hosts are now back in the cluster as expected and backfilling going on.  However we are seeing a couple of problems.

We are seeing:

1. Ceph is flagging a handfull of PGs as backfill_toofull when they aren't.  https://tracker.ceph.com/issues/61839.   
2. Periodically it generates an OSD_FULL error.  

Are there any plans to look at resolving bug #61839?

When an OSD_FULL error has been reported, the OSD in question has been <75% usage.  The OSD wasn't used by any of the PGs reporting backfill_toofull.  I currently have the full_ratio set to 0.97 and nearfull_ratio set to 0.87 so the OSDs are nowhere near these levels.   The %raw usage of the OSDs in the cluster ranges from about 60-80% and the raw usage of the cluster is about 75%.   

We do not get any "near full" warnings prior to OSD_FULL being set.  Having a production system instantly go offline without warning isn't ideal and these things seem to know the least convenient moment. 

These problems only happen after a host failure.  Each time we have added additional OSD hosts into the cluster, the backfilling has finished without problems. 

We are currently running Reef 18.2.4 but I have experienced these same problems on Pacific 16.2.10

Has anyone else seen this behaviour?

Thanks
Gerard
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx