compounded problems interfering with recovery

Simon Oosthoek <s.oosthoek@xxxxxxxxxxxxx> · Sun, 8 Oct 2023 14:25:46 +0200

Hi

we're still struggling with our getting our ceph to health_ok. We're 
having compounded issues interfering with recovery, as I understand it.

To summarize, we have a cluster of 22 osd nodes running ceph 16.2.x. 
About a month back we had one of the OSDs break down (just the OS disk, 
but we didn't have a cold spare available, it took a week to get it 
fixed). Since the failure of the node, ceph has been repairing the 
situation of course, but then it became a problem that our OSDs are 
really unevenly balanced (lowest below 50%, highest around 85%). So 
whenever a disk fails (and there were 2 since then), the load spreads 
over the other OSDs and our fullest OSDs go over the 85% threshold, 
slowing down recovery, normal use and rebalancing.

We had issues with degraded PGs, but they weren't being repaired 
(because we had turned on the scrubbing during recovery, since we got 
messages that lots of PGs weren't being scrubbed in time.

Now there's still one remaining PG degraded because one object is 
unfound. The whole error state is taking far too long now and as this is 
going on, I was wondering how the balancer wasn't doing its job. Turns 
out this is dependent on the cluster being OK or at least not having any 
degraded things in it. The balancer hasn't done it's job even though our 
cluster was OK for a long time before; we added some 8 nodes a few years 
ago and still the newer nodes are having the lowest used OSDs.

Our cluster has about 70-71% usage overall, but with the unbalanced 
situation we cannot grow any more. The single node issue (though now 
resolved) and ongoing disk failures (we are seeing a handful of OSDs 
with read-repaired messages), it looks like we can't get back to health 
for a while.

I'm trying to mitigate this by reweighting the fullest OSDs, but the 
fuller OSDs keep going over the threshold, while the emptiest OSDs have 
plenty of space (just 55% full now).

If you read this far ;-) I'm wondering, can I force repair a PG around 
all the restrictions so it doesn't block auto rebalancing?

It seems to me, like that would help, but perhaps there are other things 
I can do as well?

(Budget wise, adding more OSD nodes is a bit difficult at the moment...)

Thanks for reading!

Cheers

/Simon
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx